CS 320 Spring 2003 Introduction Laxmikant Kale

CS 320 Spring 2003 Introduction Laxmikant Kale http://charm.cs.uiuc.edu

2 Course objectives and outline You will learn about: –Parallel architectures overview Message passing support, routing, interconnection networks Cache-coherent scalable shared memory, synchronization Later –Relaxed consistency models (?) –Novel architectures: Tera, Blue Gene, Processors-in-memory –Parallel programming models Emphasis on 3: message passing, shared memory, and shared objects Ongoing evaluation and comparison of models –Commonly needed parallel algorithms/operations Analysis techniques –Parallel application categories –Performance analysis and optimization of parallel applications –Parallel application case studies

3 Project and homeworks Significant (effort and grade percentage) course project –Groups of 5 students –Expect publication quality results Homeworks/machine problems: –weekly (sometimes biweekly) Parallel machines: –NCSA Origin 2000, Turing Cluster, SUN cluster, SMP machine –Possible: Large machines for evaluating scalability: 1000 processor NCSA cluster 3000 processor Lemieux machine at PSC

4 Resources Much of the course will be run via the web –Lecture slides, assignments, will be available on the course web page http://www-courses.cs.uiuc.edu/~cs320 –Most of the reading material (papers, manuals) will be on the web –Projects will coordinate and submit information on the web Web pages for individual pages will be linked to the course web page –Newsgroup: uiuc.class.ece392 You are expected to read the newsgroup and web pages regularly

5 Advent of parallel computing “Parallel computing is necessary to increase speeds” –Cry of the ‘70s –Processors kept pace with Moore’s law: Doubling speeds every 18 months Now, finally, the time is ripe –Uniprocessors are commodities (and proc. speeds shows signs of slowing down) –Highly economical to build parallel machines

6 Why parallel computing It is the only way to increase speed beyond uniprocessors –Except, of course, waiting for uniprocessors to become faster! –Several applications require orders of magnitude higher performance than feasible on uniprocessors Cost effectiveness: –older argument –in 1985, a supercomputer cost 2000 times more than a desktop, yet performed only 400 times faster. –So: combine microcomputers to get speed at lower costs –Incremental scalability: can get in-between performance points with 20, 50, 100,… processors –But: You may get speedup lower than 400 on 2000 processors! Microcomputers became faster, killing supercomputers, effectively

7 Technology Trends The natural building block for multiprocessors is now also about the fastest!

8 Economics Commodity microprocessors not only fast but CHEAP Development cost is tens of millions of dollars (5-100 typical) BUT, many more are sold compared to supercomputers –Crucial to take advantage of the investment, and use the commodity building block –Exotic parallel architectures no more than special-purpose Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors Standardization by Intel makes small, bus-based SMPs commodity Desktop: few smaller processors versus one larger one? –Multiprocessor on a chip

9 What to Expect? Parallel Machine classes: –Cost and usage defines a class! Architecture of a class may change. –Desktops, Engineering workstations, database/web servers, suprtcomputers, Commodity (home/office) desktop: –less than $10,000 –possible to provide 10-50 processors for that price! –Driver applications: games, video /signal processing, possibly “peripheral” AI: speech recognition, natural language understanding (?), smart spaces and agents New applications?

10 Engineeering workstations Price: less than $100,000 (used to be): –new proce level acceptable may be $50,000 –100+ processors, large memory, –Driver applications: CAD (Computer aided design) of various sorts VLSI Structural and mechanical simulations… Etc. (many specialized applications)

11 Commercial Servers Price range: variable ($10,000 - several hundreds of thousands) –defining characteristic: usage –Database servers, decision support (MIS), web servers, e- commerce High availability, fault tolerance are main criteria Trends to watch out for: –Likely emergence of specialized architectures/systems E.g. Oracle’s “No Native OS” approach Currently dominated by database servers, and TPC benchmarks –TPC: transactions per second –But this may change to data mining and application servers, with corresponding impact on architecure.

12 Supercomputers “Definition”: expensive system?! –Used to be defined by architecture (vector processors,..) –More than a million US dollars? –Thousands of processors Driving applications –Grand challenges in science and engineering: –Global weather modeling and forecast –Rational Drug design / molecular simulations –Processing of genetic (genome) information –Rocket simulation –Airplane design (wings and fluid flow..) –Operations research?? Not recognized yet –Other non-traditional applications?

13 Consider Scientific Supercomputing Proving ground and driver for innovative architecture and techniques –Market smaller relative to commercial as MPs become mainstream –Dominated by vector machines starting in 70s –Microprocessors have made huge gains in floating-point performance high clock rates pipelined floating point units (e.g., multiply-add every cycle) instruction-level parallelism effective use of caches (e.g., automatic blocking) –Plus economics Large-scale multiprocessors replace vector supercomputers –Well under way already –Except with the Earth Simulator: thousands of vector processors

14 Scientific Computing Demand

15 Engineering Computing Demand Large parallel machines a mainstay in many industries –Petroleum (reservoir analysis) –Automotive (crash simulation, drag analysis, combustion efficiency), –Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), –Computer-aided design –Pharmaceuticals (molecular modeling) –Visualization in all of the above entertainment (films like Toy Story) architecture (walk-throughs and rendering) –Financial modeling (yield and derivative analysis) –etc.

16 Applications: Speech and Image Processing Also CAD, Databases,... 100 processors gets you 10 years, 1000 gets you 20 !

17 Learning Curve for Parallel Applications AMBER molecular dynamics simulation program Starting point was vector code for Cray-1 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D

18 Scalability Challenges –Machines are getting bigger and faster But –Communication Speeds? –Memory speeds? "Now, here, you see, it takes all the running you can do to keep in the same place" ---Red Queen to Alice in “Through The Looking Glass” –Further: –Applications are getting more ambitious and complex Irregular structures and Dynamic behavior –Programming models?

19 Current Scenario: Machines Extremely High Performance machines abound Clusters in every lab –GigaFLOPS per processor! –100 GFLOPS/S performance possible High End machines at centers and labs: –Many thousand processors, multi-TF performance –Earth Simulator, ASCI White, PSC Lemieux,.. Future Machines –Blue Gene/L : 128k processors! –Blue Gene Cyclops Design: 1M processors Multiple Processors per chip Low Memory to Processor Ratio

20 Communication Architecture On clusters: –100 MB ethernet 100 μs latency –Myrinet switches User level memory-mapped communication 5-15 μs latency, 200 MB/S Bandwidth.. Relatively expensive, when compared with cheap PCs –VIA, Infiniband On high end machines: –5-10 μs latency, 300-500 MB/S BW –Custom switches (IBM, SGI,..) –Quadrix Overall: –Communication speeds have increased but not as much as processor speeds

21 Memory and Caches Bottom line again: –Memories are faster, but not keeping pace with processors –Deep memory hierarchies: On Chip and off chip. –Must be handled almost explicitly in programs to get good performance A factor of 10 (or even 50) slowdown is possible with bad cache behavior Increase reuse of data: If the data is in cache, use it for as many different things you need to do.. Blocking helps

22 Application Complexity is increasing Why? –With more FLOPS, need better algorithms.. Not enough to just do more of the same.. –Better algorithms lead to complex structure –Example: Gravitational force calculation Direct all-pairs: O(N 2 ), but easy to parallelize Barnes-Hut: N log(N) but more complex –Multiple modules, dual time-stepping –Adaptive and dynamic refinements Ambitious projects –Projects with new objectives lead to dynamic behavior and multiple components

23 Disparity between peak and attained speed As a combination of all of these factors: –The attained performance of most real applications is substantially lower than the peak performance of machines –Caution: Expecting to attain peak performance is a pitfall.. We don’t use such a metric for our internal combustion engines, for example But it gives us a metric to gauge how much improvement is possible

CS 320 Spring 2003 Introduction Laxmikant Kale

Similar presentations

Presentation on theme: "CS 320 Spring 2003 Introduction Laxmikant Kale"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 320 Spring 2003 Introduction Laxmikant Kale

Similar presentations

Presentation on theme: "CS 320 Spring 2003 Introduction Laxmikant Kale"— Presentation transcript:

Similar presentations

About project

Feedback