Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -

Programming Models A historical perspective 1948--53 Machine Language Rules 1953--1973 single-threaded Fortran 1973--1980 single-threaded vector Fortran 1978--1995 Shared memory parallel vector Fortran Directives: multi-, auto- and microtasking 1987--present Massively parallel, Message-passing Fortran and C 1995--present Threads-based, shared memory parallelism 1996--present Hybrid threads + message passing

Programming Models Some false starts Late 80’s--early 90’s SIMD Fortran for heterogeneous problems Mid-eighties--present Dataflow parallelism and Functional programming Mid-eighties--late eighties AI-based languages, eg LISP Mid-nineties: CRAFT-90 (shared memory approach to MPPs Early-nineties to ~2000 MPP Threads

Programming Models --Observations Shared memory programming models have never scaled well Directives-based approaches lead to code explosion and are not effective at dealing with Amdahl’s Law Outer-Loop, distributed memory parallelism requires a “physics- centric” approach. I.e., it changed the way we think about parallelism but (largely) preserved our code base, didn’t lead to code explosion, and made it easier to marginalize the effedcts of Amdahl’s Law. People will change approaches only for a huge perceived gain

Petaflops-- can we get there with what we have now? YES

What’s Important? SURE: - Scalability - Usability - Reliability - Expense minimization

A more REAListic Amdahlian Law The actual scaled speedup is more like S(N) ~ S Amdahl (N)/[1 + f comm x R p/c ], where f comm is the fraction of work devoted to communications and R p/c is the ratio of processor speed to communications speed.

REAL Law Implications S real (N) / S Amdahl (N) Let’s consider three cases on two computers: the two computers are identical except that one has an R p/c of 1 and the second an R p/c of 0.05 The three cases are f comm = 0.01, 0.05 and 0.10

REAL Law Implications S(N) / S Amdahl (N) R p/c f comm 0.01 0.05 0.10 1.0 0.05 0.99 0.95 0.9 0.83 0.50 0.33

Bottom line: A well-balanced architecture is nearly insensitive to communications overhead By contrast a system with weak communications can lose over half its power for applications in which communications is important

Petaflops-- Why can we get there with what we have now? We only need 3 more spins of Moore’s Law --Today’s 6-GF Hammer becomes a 48-GF processor by 2009 --10-Gigabit ethernet becomes 40 or 80-Gbit ethernet --Memory capacities and prices continue to improve on current trend until 2009 Disk technology continues on its current trajectory for 6 more years We use small, optical switches to give us 40--80 Gbyte/sec interconnects

Petaflops-- Why can we get there with what we have now? We need 12,000--25,000 processors to get a peak PETAFLOP. It will have 250--1000 TB memory It will have several hundred petabytes disk storage It will sustain about a half terabyte/sec I/O (more costs more) It will have about 30 TB/sec XC BW It will have about 5--10 PB/Sec memory BW BALANCE REMAINS ESSENTIALLY LIKE THAT IN THE RED STORM DESIGN COST: in 2009: $100M--$250M in then-year dollars

Petaflops-- Design issues It will use commodity processors with multiple cores per chip It will run a partitioned OS based on Linux It could have partitions with fast vector processors in a mix-and- match architecture It won’t look like the Earth Simulator It won’t run IA-64 based on current Intel design intent It will probably run Power PC or HAMMER follow-ons

Petaflops-- Why not Earth Simulator? On our codes, commodity processors are nearly as fast as the ES nodes and they have a 1.5--2.0 order of magnitude cost/performance advantage BTW this is also true-- but with not as huge a difference-- for the McKinley versus the Pentium-4 Example: The geometric mean of Livermore Loops on ES is only 60% faster than on a 2 GHz Pentium-4 Example: A real CTH problem is about as fast on that P-4 as it is on the ES

Petaflops-- Why not Earth Simulator? Amdahl’s Law and the high cost of custom processors

Why not Earth Simulator? Amdahl’s Law S = T S / T V S = 1/{[pW / (s N) + (1-p)W / (s/M) ] / [ W / s]} S = [ p/N + M(1-p) ] -1 Let N = M = 4, S = 1/[ p/4 + 4(1-p) ].

Why not Earth Simulator? Amdahl’s Law (p = vector fraction of work) S = [ p/N + M(1-p) ] -1 Let N = M = 4, S = 1/[ p/4 + 4(1-p) ]. P must be greater than or equal to 0.8 for breakeven!

Petaflops-- Why not IA-64? Heat Size Complexity Cost High latency/ low BW Difficulty in Compilability Competition from Intel ….

ProcessorPeak Speed fma3d ratio Normalized Fma3d ratio Intel Itanium II 4.0 Gflops 776190 Intel Pentium-4 3.06 Gflops* 1038340 IBM Power4 5.2 Gflops 1020200 HP Alpha EV7 2.3 Gflops 1380600

The Bad News Somewhere between a petaflop and an Exaflop, we will run the string out on this approach to computing

The Good News - For ExaFlops computing, there is lots of potential for innovation: New approaches: DNA computers New memory-centric technologies (eg, spin computers) (Not) quantum computers Very Low power semiconductor based systems

The Good News - For ExaFlops computing, there is lots of potential for innovation: The Requirements for SURE will not change!

The Good News I’ll be gone fishing! The END (almost)

Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -

Similar presentations

Presentation on theme: "Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -

Similar presentations

Presentation on theme: "Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -"— Presentation transcript:

Similar presentations

About project

Feedback