Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -

Slides:



Advertisements
Similar presentations
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Advertisements

Distributed Systems CS
Computer Abstractions and Technology
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
Types of Parallel Computers
Planned Machines: ASCI Purple, ALC and M&IC MCR Presented to SOS7 Mark Seager ICCD ADH for Advanced Technology Lawrence Livermore.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
History of Distributed Systems Joseph Cordina
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
April 23, 2002 Parallel Computing: Overview John Urbanic
Some Thoughts on Technology and Strategies for Petaflops.
Parallel Computing Overview CS 524 – High-Performance Computing.
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Lecture 1: Introduction to High Performance Computing.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Computer System Architectures Computer System Software
High-Performance Computing 12.1: Concurrent Processing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
CS591x -Cluster Computing and Parallel Programming
Distributed Programming CA107 Topics in Computing Series Martin Crane Karl Podesta.
Advanced Computer Networks Lecture 1 - Parallelization 1.
Outline Why this subject? What is High Performance Computing?
B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
SOS7 What will Cray do for Supercomputing in this Decade? Asaph Zemach Cray Inc.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Abstract Increases in CPU and memory will be wasted if not matched by similar performance in I/O SLED vs. RAID 5 levels of RAID and respective cost/performance.
Tackling I/O Issues 1 David Race 16 March 2010.
Background Computer System Architectures Computer System Software.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Performance. Moore's Law Moore's Law Related Curves.
CS203 – Advanced Computer Architecture
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Welcome: Intel Multicore Research Conference
CS775: Computer Architecture
What is Parallel and Distributed computing?
Distributed Systems CS
Multicore and GPU Programming
Types of Parallel Computers
Presentation transcript:

Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -

Programming Models A historical perspective Machine Language Rules single-threaded Fortran single-threaded vector Fortran Shared memory parallel vector Fortran Directives: multi-, auto- and microtasking present Massively parallel, Message-passing Fortran and C present Threads-based, shared memory parallelism present Hybrid threads + message passing

Programming Models Some false starts Late 80’s--early 90’s SIMD Fortran for heterogeneous problems Mid-eighties--present Dataflow parallelism and Functional programming Mid-eighties--late eighties AI-based languages, eg LISP Mid-nineties: CRAFT-90 (shared memory approach to MPPs Early-nineties to ~2000 MPP Threads

Programming Models --Observations Shared memory programming models have never scaled well Directives-based approaches lead to code explosion and are not effective at dealing with Amdahl’s Law Outer-Loop, distributed memory parallelism requires a “physics- centric” approach. I.e., it changed the way we think about parallelism but (largely) preserved our code base, didn’t lead to code explosion, and made it easier to marginalize the effedcts of Amdahl’s Law. People will change approaches only for a huge perceived gain

Petaflops-- can we get there with what we have now? YES

What’s Important? SURE: - Scalability - Usability - Reliability - Expense minimization

A more REAListic Amdahlian Law The actual scaled speedup is more like S(N) ~ S Amdahl (N)/[1 + f comm x R p/c ], where f comm is the fraction of work devoted to communications and R p/c is the ratio of processor speed to communications speed.

REAL Law Implications S real (N) / S Amdahl (N) Let’s consider three cases on two computers: the two computers are identical except that one has an R p/c of 1 and the second an R p/c of 0.05 The three cases are f comm = 0.01, 0.05 and 0.10

REAL Law Implications S(N) / S Amdahl (N) R p/c f comm

Bottom line: A well-balanced architecture is nearly insensitive to communications overhead By contrast a system with weak communications can lose over half its power for applications in which communications is important

Petaflops-- Why can we get there with what we have now? We only need 3 more spins of Moore’s Law --Today’s 6-GF Hammer becomes a 48-GF processor by Gigabit ethernet becomes 40 or 80-Gbit ethernet --Memory capacities and prices continue to improve on current trend until 2009 Disk technology continues on its current trajectory for 6 more years We use small, optical switches to give us Gbyte/sec interconnects

Petaflops-- Why can we get there with what we have now? We need 12, ,000 processors to get a peak PETAFLOP. It will have TB memory It will have several hundred petabytes disk storage It will sustain about a half terabyte/sec I/O (more costs more) It will have about 30 TB/sec XC BW It will have about PB/Sec memory BW BALANCE REMAINS ESSENTIALLY LIKE THAT IN THE RED STORM DESIGN COST: in 2009: $100M--$250M in then-year dollars

Petaflops-- Design issues It will use commodity processors with multiple cores per chip It will run a partitioned OS based on Linux It could have partitions with fast vector processors in a mix-and- match architecture It won’t look like the Earth Simulator It won’t run IA-64 based on current Intel design intent It will probably run Power PC or HAMMER follow-ons

Petaflops-- Why not Earth Simulator? On our codes, commodity processors are nearly as fast as the ES nodes and they have a order of magnitude cost/performance advantage BTW this is also true-- but with not as huge a difference-- for the McKinley versus the Pentium-4 Example: The geometric mean of Livermore Loops on ES is only 60% faster than on a 2 GHz Pentium-4 Example: A real CTH problem is about as fast on that P-4 as it is on the ES

Petaflops-- Why not Earth Simulator? Amdahl’s Law and the high cost of custom processors

Why not Earth Simulator? Amdahl’s Law S = T S / T V S = 1/{[pW / (s N) + (1-p)W / (s/M) ] / [ W / s]} S = [ p/N + M(1-p) ] -1 Let N = M = 4, S = 1/[ p/4 + 4(1-p) ].

Why not Earth Simulator? Amdahl’s Law (p = vector fraction of work) S = [ p/N + M(1-p) ] -1 Let N = M = 4, S = 1/[ p/4 + 4(1-p) ]. P must be greater than or equal to 0.8 for breakeven!

Petaflops-- Why not IA-64? Heat Size Complexity Cost High latency/ low BW Difficulty in Compilability Competition from Intel ….

ProcessorPeak Speed fma3d ratio Normalized Fma3d ratio Intel Itanium II 4.0 Gflops Intel Pentium Gflops* IBM Power4 5.2 Gflops HP Alpha EV7 2.3 Gflops

The Bad News Somewhere between a petaflop and an Exaflop, we will run the string out on this approach to computing

The Good News - For ExaFlops computing, there is lots of potential for innovation: New approaches: DNA computers New memory-centric technologies (eg, spin computers) (Not) quantum computers Very Low power semiconductor based systems

The Good News - For ExaFlops computing, there is lots of potential for innovation: The Requirements for SURE will not change!

The Good News I’ll be gone fishing! The END (almost)