Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -
Programming Models A historical perspective Machine Language Rules single-threaded Fortran single-threaded vector Fortran Shared memory parallel vector Fortran Directives: multi-, auto- and microtasking present Massively parallel, Message-passing Fortran and C present Threads-based, shared memory parallelism present Hybrid threads + message passing
Programming Models Some false starts Late 80’s--early 90’s SIMD Fortran for heterogeneous problems Mid-eighties--present Dataflow parallelism and Functional programming Mid-eighties--late eighties AI-based languages, eg LISP Mid-nineties: CRAFT-90 (shared memory approach to MPPs Early-nineties to ~2000 MPP Threads
Programming Models --Observations Shared memory programming models have never scaled well Directives-based approaches lead to code explosion and are not effective at dealing with Amdahl’s Law Outer-Loop, distributed memory parallelism requires a “physics- centric” approach. I.e., it changed the way we think about parallelism but (largely) preserved our code base, didn’t lead to code explosion, and made it easier to marginalize the effedcts of Amdahl’s Law. People will change approaches only for a huge perceived gain
Petaflops-- can we get there with what we have now? YES
What’s Important? SURE: - Scalability - Usability - Reliability - Expense minimization
A more REAListic Amdahlian Law The actual scaled speedup is more like S(N) ~ S Amdahl (N)/[1 + f comm x R p/c ], where f comm is the fraction of work devoted to communications and R p/c is the ratio of processor speed to communications speed.
REAL Law Implications S real (N) / S Amdahl (N) Let’s consider three cases on two computers: the two computers are identical except that one has an R p/c of 1 and the second an R p/c of 0.05 The three cases are f comm = 0.01, 0.05 and 0.10
REAL Law Implications S(N) / S Amdahl (N) R p/c f comm
Bottom line: A well-balanced architecture is nearly insensitive to communications overhead By contrast a system with weak communications can lose over half its power for applications in which communications is important
Petaflops-- Why can we get there with what we have now? We only need 3 more spins of Moore’s Law --Today’s 6-GF Hammer becomes a 48-GF processor by Gigabit ethernet becomes 40 or 80-Gbit ethernet --Memory capacities and prices continue to improve on current trend until 2009 Disk technology continues on its current trajectory for 6 more years We use small, optical switches to give us Gbyte/sec interconnects
Petaflops-- Why can we get there with what we have now? We need 12, ,000 processors to get a peak PETAFLOP. It will have TB memory It will have several hundred petabytes disk storage It will sustain about a half terabyte/sec I/O (more costs more) It will have about 30 TB/sec XC BW It will have about PB/Sec memory BW BALANCE REMAINS ESSENTIALLY LIKE THAT IN THE RED STORM DESIGN COST: in 2009: $100M--$250M in then-year dollars
Petaflops-- Design issues It will use commodity processors with multiple cores per chip It will run a partitioned OS based on Linux It could have partitions with fast vector processors in a mix-and- match architecture It won’t look like the Earth Simulator It won’t run IA-64 based on current Intel design intent It will probably run Power PC or HAMMER follow-ons
Petaflops-- Why not Earth Simulator? On our codes, commodity processors are nearly as fast as the ES nodes and they have a order of magnitude cost/performance advantage BTW this is also true-- but with not as huge a difference-- for the McKinley versus the Pentium-4 Example: The geometric mean of Livermore Loops on ES is only 60% faster than on a 2 GHz Pentium-4 Example: A real CTH problem is about as fast on that P-4 as it is on the ES
Petaflops-- Why not Earth Simulator? Amdahl’s Law and the high cost of custom processors
Why not Earth Simulator? Amdahl’s Law S = T S / T V S = 1/{[pW / (s N) + (1-p)W / (s/M) ] / [ W / s]} S = [ p/N + M(1-p) ] -1 Let N = M = 4, S = 1/[ p/4 + 4(1-p) ].
Why not Earth Simulator? Amdahl’s Law (p = vector fraction of work) S = [ p/N + M(1-p) ] -1 Let N = M = 4, S = 1/[ p/4 + 4(1-p) ]. P must be greater than or equal to 0.8 for breakeven!
Petaflops-- Why not IA-64? Heat Size Complexity Cost High latency/ low BW Difficulty in Compilability Competition from Intel ….
ProcessorPeak Speed fma3d ratio Normalized Fma3d ratio Intel Itanium II 4.0 Gflops Intel Pentium Gflops* IBM Power4 5.2 Gflops HP Alpha EV7 2.3 Gflops
The Bad News Somewhere between a petaflop and an Exaflop, we will run the string out on this approach to computing
The Good News - For ExaFlops computing, there is lots of potential for innovation: New approaches: DNA computers New memory-centric technologies (eg, spin computers) (Not) quantum computers Very Low power semiconductor based systems
The Good News - For ExaFlops computing, there is lots of potential for innovation: The Requirements for SURE will not change!
The Good News I’ll be gone fishing! The END (almost)