Lecturer: Dr. Simon Winberg In industry and for academics professionals need to get use to the paradigm of listening to presentations and taking note.

Lecturer: Dr. Simon Winberg

In industry and for academics professionals need to get use to the paradigm of listening to presentations and taking note of what is useful to them. Now in this case we are still in a course so you should consider most things being relevant to you. But here are a few techniques: Have a notebook (if not a logbook with all your ideas in it) where you can jot down useful points, OR issues you found confusing and want to either ask about (if the opportunity permits) or to find out about later. If you have the assigned reading in front of you, either in digital form (e.g. on tablet) or printed, then you can by all means browse through to check on things while the presentation is covering things you already know or are finding less important. Reflect and review reading (after the lecture)

Let’s go ahead and dive in to the first reading! Reading 1: “The Landscape of Parallel Computing Research: A View from Berkeley” by Krste Asanovic, Ras Bodik, Bryan Catanzaro, et al. Technical Report. Electrical Engineering and Computer Sciences University of California at Berkeley. December 18, 2006 Since you wouldn’t have had time to read this yet, this presentation is design around presenting the main topics to you so that you get the main gist, and then read through it more thoroughly for homework.

 Setting the scene  7 Critical questions  Overview of recommendations  The major needs  Old vs. New Conventional Wisdom  Dwarfs  Hardware design issues  Take-home messages

 The recent switch to widely used parallel microprocessor architectures is a historical milestone.  Expect a diminishing returns effect for more than 16 processors per computer  Much can be learned from the “extremes of computing”:  Embedded computing  High performance computing

Seven critical questions 1.What are the applications? 2.What are common kernels of the applications? 3.What are the hardware building blocks? 4.How to connect them? 5.How to describe allocations and kernels? 6.How to program the hardware? Evaluation: 5.How to measure success? Applications Hardware Tension between Embedded & Server Computing

1. Make it easier to write parallel programs 2. Aim for 1000s of cores per chip optimizations of the cores:  Maximize MIPS/watt,  Maximize MIPS/area  Maximize MIPS/$ 3. Use the 13 Dwarfs to design and evaluate parallel models & architecture

4. Use of “Autotuners”* 5. “Human-centric” programming models to improve programmer productivity 6. Programming models independent of the number of processors 7. Support wide range of datatypes and models of parallelism:  Task-level, or word-level, or bit-level parallelism * Automatically adjust computer system depending on loading

8. Architecture should provide for accurate performance measurement facilities 9. Operating system functionality directed by libraries and virtual machines 10. Use of FPGA-based system emulators to explore design spaces

 Major requirements to make parallel applications development more widely accessible:  Naturally parallel programming models  Naturally parallel system software  Naturally parallel supporting architecture  General reinvention of the cornerstones of computing

 Discussed old CW vs. new CW  Major ones I found interesting: 1. Old CW: power is free, transistors costly New CW: “Power wall” 4. Old CW: Build on prior success, using abstraction & reuse, to build bigger systems New CW: Technicalities impose limitations 9,10. Old CW: Uniprocessor perf. doubles every 18mo New CW: “Brick Wall” could be a long wait

Seven Thirteen Dwarfs

 Dwarfs capture common requirements (and patterns) of classes of applications – i.e., dwarf = a “worker” that helps to accomplish the application  Attempt towards composing an application program from “dwarfs”  Means to guide processor manufactures and benchmarking platforms

 Small is beautiful – small cores giving:  Best highest performance / area  Fine granularity &  Dynamic power scaling  Easier to design & work with  Good chance: 1000s cores/die  Amdahl’s law – the less parallel portion of a program can limit performance  Benefits to different sized processors in a manycore architecture  Benefits of reconfigurable coprocessors

 Memory wall – bounding factor for many dwarf  DRAM keeping space with Moore’s Law  Making interconnects adapt to communications pattern at run time  Use of traditional communications primitives (synch locks, mutual excl. via transactional memory, message passing)

 Performance and energy counters  Additional resources provided in hardware for the specific purpose of performance evaluations  Useful to processor designers as well as developers who use the platforms  Help developers assess their programs in terms of energy as well as speed

1. Use the 13 Dwarfs to design and evaluate parallel models & architecture 2. Architectures should provide accurate performance measurement facilities 3. Small is beautiful, but not always so beneficial

 Berkeley’s report confirms the expected benefits of reconfigurable co-processors (if not entirely reconfigurable platforms for prototyping architectures)

Open Forum Discussion Questions Answers and Reflections from the class

Parallel Computing Systems EEE4084F: Digital Systems A? B? C? + * X ! Y ! -

Anyone else beside me feel that way? … Processor coresMajor software

CPUs at idle

Computation Methods HardwareReconfigurable Computer Software Processor E.g. PCBs, ASICs Advantages: High speed & performance Efficient (possibly lower power than idle proc.) Parallelizable Drawbacks: Expensive Static (cannot change) E.g. IBM Blade, FPGA-based computing platform Advantages: Faster than software alone More flexible than software More flexible than hardware Parallelizable Drawbacks: Expensive Complex (both s/w & h/w) E.g. PC, embedded software on microcontroller Advantages: Flexible Adaptable Can be much cheaper Drawbacks: The hardware is static Limit of clock speed Sequential processing

 Intel® Core™ i7  I7 Extreme Edition vs. (regular) i7 processor  Intel® Xeon™ Processors  4/8 cores; 8/16 threads; 40W to 130W (more cores, exponential growth in power)  1.86-3.3 Ghz CPU clock, 1066-1600 Mhz bus  Intel® Itanium®  Scalable. 1/2/4 cores; hyperthreading*  Allows designs up to 512 cores, 32/64bit  Power use starts around 75W  1-2.53 Ghz (Itanium 9500 ‘Poulson’); QPI 6.4GT/s bus * Hyperthreading  two virtual/logical processors per core (more: http://www.techopedia.com/definition/2866/hyperthreading-ht)http://www.techopedia.com/definition/2866/hyperthreading-ht gigatransfers per second (GT/s) or megatransfers per second (MT/s): (somewhat informal) number of operations transferring data that occurs each second in a given data-transfer channel. Also known as sample rate, i.e. num data samples captured per second, each sample normally occurring at the clock edge. [1]sample rate[1]

 Most server class machines today are:  PC class SMP’s (Symmetric Multi-Processors *)  2, 4, 8 processors - cheap  Run Windows & Linux  Delux SMP’s  8 to 64 processors  Expensive: 16-way SMP costs ≈ 4 x 4-way SMPs  Applications: Databases, web servers, internet commerce / OLTP (online transaction processing)  Newer applications: technical computing, threat analysis, credit card fraud... SMP offers all processors and memory on a common front side bus (FSB –bus that connects the CPU and motherboard subsystems). * Also termed “Shared Memory Processor”

 Hundreds of processors, typically built as clusters of SMPs  Often custom built with government funding (costly! 10 to 100 million USD)  National / international resource  Total sales tiny fraction of PC server sales  Few independent software developers  Programmed by small set of majorly smart people

 Some applications  Code breaking (CIA, FBI)  Weather and climate modeling / prediction  Pharmaceutical – drug simulation, DNA modeling and drug design  Scientific research (e.g., astronomy, SETI)  Defense and weapons development  Large-scale parallel systems are often used for modelling

 Important software sections (frequently run sections) usually hand- crafted (often from initial sequential versions) to run on the parallel computing platform  Why this happens:  Parallel programming is difficult  Parallel programs run poorly on sequential machines (need to simulate them)  Automatic parallelization difficult (& messy)  Leads to high utilization expenses

 Yes! … some other examples…

Parallel and Distributed Computing: Memories of Time Past and a Glimpse at the Future Author: D C Marinescu. Pub date: June 2014 We propose to look at the evolution of ideas related to parallel systems, algorithms, and applications during the past three decades and then glimpse at the future. The journey starts with massively parallel systems of the early nineties, transitions gently to computing for the masses on the clouds of the ﬁrst decade of the new millennium, and continues with what we could expect from quantum computers and quantum parallelism in the next few decades... File: 06900194.pdf http://dx.doi.org/10.1109/ISPDC.2014.33 This paper gives good insight into the current state of the art and likely approaches that will be used in the future

 Yes!  MatLab, Simulink, SystemC, UML*, NESL** and others  “Automatic parallelization”:  Def: converting sequential code into multi- threaded or vectorised code (or both) to utilize multiple processors simultaneously (e.g., for SMP machine)  … short powwow on the topic… * Model-Driven Development using case tools, e.g. Rhapsody for RT-UML Try interactive tutorial on: http://www.cs.cmu.edu/~scandal/nesl/tutorial2.html http://www.cs.cmu.edu/~scandal/nesl/tutorial2.html

Powwow* moment Why is it probably not easy to automate conversion from sequential code (e.g. BASIC or std C program) to parallel code? Consult the four winds, and your neighbouring classmates, as to: Why is it probably not easy to automate conversion from sequential code (e.g. BASIC or std C program) to parallel code? HINT: Perhaps start by clarifying the difference between sequential and parallel code. PS: You’re also welcome to get up, stretch, move about, infiltrate a more intelligent-looking tribe, and so on. Note approx. 5 min. time limit! Time limit TIME UP Next slide provides some reasons... * It’s a term from North America's Native people used to refer to a cultural gathering.

Return from the Powwow

Data hazards Timing issues Deciding how to break-up data to distribute Deciding when to implement semaphores and locks When code needs to block and when not How to split-up a loop into parallel parts Having to convert clocks of statements or functions in inter process calls Figuring out timing dependencies Some thoughts Difficulty in figuring out data dependencies On why is it probably not easy to automate conversion from sequential code (e.g. BASIC or C) to parallel code.

 Accumulation of 30+ years of research…  Only limited success in parallelism detection and program transformations  Instruction-level parallelism at the basic-block level (e.g., pipelining of instructions)  Parallelism in nested for-loops containing arrays with simple index expressions  Analysis methods: data dependence analysis, pointer analysis, abstraction back to more optimized implementation, flow analysis

 Main reasons (user perspective):  Tends to take too long.  Tend to be too fragile (i.e., breaks down after small changes to the code).  Tends to miss many things a human would notice and provide an effective solution for, i.e., human intellect, underlying knowledgeable of the application, and trained to write and problem-solve parallel code.  So: Instead of training compilers to recognize parallelism, people are being trained to write parallel code (i.e., no “middle man” approach).

 Golden measure:  A (usually) sequential solution that you develop as the ‘yard stick’  A solution that may run slowly, isn’t optimized, but you know it gives (numerically speaking) excellent results  E.g., a solution written in OCTAVE or MatLab, verify it is correct using graphs, inspecting values, checking by hand with calculator, etc.

 Sequential / Serial (serial.c)  A non-parallized code solution  Generally, you can call your code solutions parallel.c (or para1.c, para2.c if you have multiple versions)  You can also include some test data (if it isn’t too big, <1Mb), e.g. gold.csv or serial.csv, and paral1.csv

Power concerns & other trends (a GST perspective)

Computation Design Trends Intel performance graph For the past decades the means to increase computer performance has been focusing to a large extent on producing faster software processors. This included packing more transistors into smaller spaces. Moore’s law has been holding pretty well… when measured in terms of transistors (e.g., doubling number of transistors) But this trend has drawbacks, and seems to be slowing…

Calculation per seconds per 1k$ Over time trend

Illustration of demand for computers (Intel perspective) slide 22 - demand for computers.jpg (unknown license) Source: alphabytesoup.files.wordpress.com/2012/07/computer-timeline.gif

Computation Design Trends – Power concerns Processors are getting too power hungry! There’s too many transistors that need power. Also, the size of transistors can't come down by much – it might not be possible to have transistors smaller than a few atoms! And how would you connect them up? Now tending to multi-core processors.. Sure it can double the transistors every 2- 3 years (and the power). But what of performance? A dual core Intel system with GPU, LCD monitor draws about 220 watts Projections obviously we’ve seen the reality isn’t as bad Slide 22 - Power over time.jpg

Image source: http://commons.wikimedia.org/wiki/File:Processor_families_in_TOP500_supercomputers.svg

The TOP500 List and Progress in High-Performance Computing Authors: Strohmaier, Meuer, Dongarra and Simon. Pub date: Nov 2015 For more than two decades, the TOP500 list has enjoyed incredible success as a metric for supercomputing performance and as a source of data for identifying technological trends. The project’s editors reﬂect on its usefulness and limitations for guiding large-scale scientiﬁc computing into the exascale era.. File: 07328648.pdf http://dx.doi.org/10.1109/MC.2015.338 This paper gives good insight into the current state of the art and likely approaches that will be used in the future

Image sources: Clipart sources – public domain CC0 (http://pixabay.com/)http://pixabay.com/ commons.wikimedia.org images from flickr Disclaimers and copyright/licensing details I have tried to follow the correct practices concerning copyright and licensing of material, particularly image sources that have been used in this presentation. I have put much effort into trying to make this material open access so that it can be of benefit to others in their teaching and learning practice. Any mistakes or omissions with regards to these issues I will correct when notified. To the best of my understanding the material in these slides can be shared according to the Creative Commons “Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)” license, and that is why I selected that license to apply to this presentation (it’s not because I particularly want my slides referenced but more to acknowledge the sources and generosity of others who have provided free material such as the images I have used).

Lecturer: Dr. Simon Winberg In industry and for academics professionals need to get use to the paradigm of listening to presentations and taking note.

Similar presentations

Presentation on theme: "Lecturer: Dr. Simon Winberg In industry and for academics professionals need to get use to the paradigm of listening to presentations and taking note."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecturer: Dr. Simon Winberg In industry and for academics professionals need to get use to the paradigm of listening to presentations and taking note.

Similar presentations

Presentation on theme: "Lecturer: Dr. Simon Winberg In industry and for academics professionals need to get use to the paradigm of listening to presentations and taking note."— Presentation transcript:

Similar presentations

About project

Feedback