Presentation is loading. Please wait.

Presentation is loading. Please wait.

HPC Achievement and Impact – 2012 a personal perspective Thomas Sterling Professor of Informatics and Computing Chief Scientist and Associate Director.

Similar presentations


Presentation on theme: "HPC Achievement and Impact – 2012 a personal perspective Thomas Sterling Professor of Informatics and Computing Chief Scientist and Associate Director."— Presentation transcript:

1 HPC Achievement and Impact – 2012 a personal perspective Thomas Sterling Professor of Informatics and Computing Chief Scientist and Associate Director Center for Research in Extreme Scale Technologies (CREST) Pervasive Technology Institute at Indiana University Fellow, Sandia National Laboratories CSRI June 20, 2012

2 2 HPC Year in Review A continuing tradition at ISC –(9 th year, and still going at it) As always, a personal perspective – how I’ve seen it –Highlights – the big picture But not all the nitty details, sorry –Necessarily biased, but not intentionally so –Iron-oriented, but software too –Trends and implications for the future And a continuing predictor: The Canonical HEC Computer Previous Years’ Themes: –2004: “Constructive Continuity” –2005: “High Density Computing” –2006: “Multicore to Petaflops” –2007: “Multicore: the Next Moore’s Law” –2008: “Run-up to Petaflops” –2009: “Year 1 A. P. (After Petaflops)“ –2010: “Igniting Exaflops” –2011: “Petaflops: the New Norm” This Year’s Theme: ?

3 3 Trends in Highlight International across the high end Reversal of power growth – green screams! GPU increases in HPC community –Everybody is doing software (apps or systems) for them But … resurgence of multi-core –IBM BG –Intel MIC (oops I mean Xeon Phi) Exaflops takes hold of international planning Big data moves to a driving command Commodity Linux clusters still the norm among the common computer

4 4 HPC Year in Review A continuing tradition at ISC –(9 th year, and still going at it) As always, a personal perspective – how I’ve seen it –Highlights – the big picture But not all the nitty details, sorry –Necessarily biased, but not intentionally so –Iron-oriented, but software too –Trends and implications for the future And a continuing predictor: The Canonical HEC Computer Previous Years’ Themes: –2004: “Constructive Continuity” –2005: “High Density Computing” –2006: “Multicore to Petaflops” –2007: “Multicore: the Next Moore’s Law” –2008: “Run-up to Petaflops” –2009: “Year 1 A. P. (After Petaflops)“ –2010: “Igniting Exaflops” –2011: “Petaflops: the New Norm” This Year’s Theme: –2012: “The Million Core Computer”

5 I. ADVANCEMENTS IN PROCESSOR ARCHITECTURES 5

6 Processors: Intel Intel’s 22nm processors: Ivy Bridge die shrink, Sandy Bridge microarchitecture, based on tri-gate (3D) transistors. Die size of 160 mm2 with 1.4 billion transistors Intel Xeon Processor E7-8870: 10 cores, 20 threads 112 GFLOPS max turbo single core 2.4 GHz 30 MB cache 130 W max TDP 32 nm Supports up to 4 TB of DDR3 memory Next generation 14nm Intel processors are planned for specifications.html?proc=53580

7 Processors: AMD AMD’s 32nm processors: Trinity Die size of 246 mm2 with 1.3 billion transistors AMD Opteron 6200 Series: 16 cores 2.7 GHz 32 MB cache 140 W max TDP 32 nm GFLOPS 2 x AMD Opteron 6276 processors Linpack benchmark(2P) 7

8 Processors: IBM 8 IBM Power7 -Released in The main processor in the PERCS Power7 Processor -45 nm SOI Process, 567 mm 2, 1.2 billion transistors GHz clock speed, 4 chips per quad module, -4,6,8 cores per chip, 64 kB L1, 256 kB L2, 32 MB L3 cache / core -Max GFLOPS per core, GFLOPS per chip IBM Power8 Successor to Power7 currently under development. Anticipated imporved SMT, reliability, larger cache size, more cores. Will be built on 22nm process.

9 Processors: Fujitsu SPARC IXfx Introduced in November 2011 Used in the PRIMEHPC FX10 (an HPC system that is a follow up to Fujitsu’s K computer) Architecture: SPARC v9 ISA extended for HPC with increased amounts of registers 16 cores GFLOPS peak performance 12 MB shared L2 cache 1.85 GHz 115 W TDP 9

10 Many Integrated Core (MIC) Architecture – Xeon Phi Multiprocessor architecture Prototype products (named Knights Ferry) were announced and released in 2010 to European Organization for Nuclear Research, Korea Institute of Science and Technology Information, and Leibniz Supercomputing Center among others Knights Ferry prototype: 45 nm process 32 in order cores with up to 1.2 GHz with 4 threads per core 2 GB GDDR5 memory, 8MB L2 cache Single board performance exceeds 750 GFLOPS Commercial release (codenamed Knights Corner) to be built on a 22nm process is proposed for release in Texas Advanced Computing Center (TACC) will use Knights Corner cards in 10 PetaFLOPS “Stampede” supercomputer 10

11 Processors: ShenWei SW1600 Third generation CPU by Jiāngnán Computing Research Lab Specs: Frequency GHz 140 GFLOPS 16 core RISC architecture Up to 8TB virtual memory and 1TB physical memory supported L1 cache: 8KB, L2 cache: 96KB 128-bit system bus 11

12 Processors: ARM Industry leading provider of 32-bit embedded microprocessors Wide portfolio of more than 20 processors is available Applications processors are capable of executing a wide range of OSs: Linux, Android/Chrome, Microsoft Windows CE, Symbian and others NVIDIA: The CUDA on ARM Development Kit A high performance energy efficient kit featuring NVIDIA Tegra 3 Quad-Core ARM A9 CPU CPU Memory: 2 GB 270 GFLOPS Single Precision Performance 12

13 II. ADVANCES IN GPUs 13

14 GPUs: NVIDIA 14 Current: Kepler Architecture – GK billion transistors 192 CUDA cores 32 Special Function Units 32 Load/Store Units More than 1 TFLOPS Upcoming: Maxwell Architecture – 2013 ARM instruction set 20 nm Fab process ??? (expected) GFLOPS in double precision per watt HPC systems with NVIDIA GPUs: Tianhe-1 (2 nd on TOP 500): 7,168 Nvidia Tesla M2050 general purpose GPUs Titan (currently 3 rd on TOP500): 960 NVIDIA GPUs Nebulae (4 th on TOP500): 4640 NVIDIA Tesla C2050 GPUs Tsubame 2.0 (5 th on TOP500): 4224 NVIDIA Tesla M2050; 34 NVIDIA Tesla S1070; _Nvidia_Maxwell_Graphics_Processors_to_Have_Integrated_ARM_General_Purpose_Cores.html

15 GPUs: AMD Current: AMD Radeon HD 7970 (Southern Islands HD7xxx series) Released in January nm Fab process 352 mm2 die size with 4.3 billion transistors Up to 925 MHz engine clock 947 GFLOPS double precision compute power 230 W TDP Latest AMD architecture – Graphic Core Next (GCN): 28 nm GPU architecture Designed both for graphics and general computing 32 compute nodes (1,048 stream processors) Handles workloads of the processor 15

16 GPU Programming Models Current: Cuda 4.1 Share GPUs across multiple threads Unified Virtual Addressing Use all GPUs from a single host thread Peer-to-Peer communication Coming in Cuda 5 Direct communication between GPUs and other PCI devices Easily acceleratable parallel nested loops starting with Tesla K20 Kepler GPU Current: OpenCL 1.2 Open royalty-free standard for cross-platform parallel computing Latest version released in November 2011 Host-thread safety, enabling OpenCL commands to be enqued from multiple host threads Improved OpenGL interoperability by linking OpenCL event objects to OpenGL OpenACC Programming standard developed by Cray, NVIDIA, CAPS and PGI Designed to simplify parallel programming of heterogeneous CPU/GPU systems The programming is done through some pragmas and API functions Planned supported compilers – Cray, PGI and CAPS 16

17 III. HPC SYSTEMS around the World 17

18 IBM Blue Gene/Q IBM PowerPC 1.6 GHz CPU 16 cores per node 16 GB SDRAM-DDR3 per node 90% water cooling 10% air cooling 209 TFLOPS peak performance 80 KW per rack power consumption HPC Systems powered by Blue Gene/Q: Mira, Argonne National Lab 786,432 cores PFLOPS Rmax, PFLOPS Rpeak 3rd on TOP500 (June 2012) SuperMUC, Leibniz Rechenzentrum 147,456 cores PFLOPS Rmax, PFLOPS Rpeak 4 th on TOP500 (June 2012)

19 USA: Sequoia First on the TOP500 as of June 2012 Fully deployed in June PFLOPS Rmax and PLOPS Rpeak 7.89 MW Power consumption IBM Blue Gene/Q 98,304 compute nodes 1.6 million processor cores 1.6 PB of memory 8 or 16 core Power Architecture processors built on a 45 nm fabrication method Lustre Parallel File System

20 Japan: K computer First on the TOP500 as of November 2011 Located at RIKEN Advanced Institute for Computational Science in Kobe, Japan Distributed memory architecture Manufactured by Fujitsu Performance: PFLOPS Rmax and PFLOPS Rpeak MW Power Consumption, GFLOPS/Kwatt Annual running costs - $10 million 864 cabinets: 88,128 8-core SPARC64 VIIIfx 2.0 GHz 705,024 cores 96 computing nodes + 6 I/O nodes per cabinet Water cooling system minimizes failure rate and power consumption

21 China: Tianhe-1A Tianhe-1A: #5 on Top PFLOPS Rmax and PFLOPS Rpeak 112 computer cabinets, 12 storage cabinets, 6 communication cabinets and 8 I/O cabinets: 14,336 Xeon X5670 processors 7,168 Nvidia Tesla M2050 general purpose GPUs 2,048 FreiTeng 1000 SPARC-based processors 4.04 MW Chinese custom-designed NUDT proprietary high-speed interconnect, called Arch (160 Gbit/s) Purpose: petroleum exploration, aircraft design $88 million to build, $20 million to maintain annually, operated by about 200 workers

22 China: Sunway Blue Light Sunway Blue Light: 26 th on the TOP TFLOPS Rmax and PFLOPS Rpeak 8,704 ShenWei SW MHz 150 TB main memory 2 PB external storage Total power consumption MW 9-rack water-cooled system Infiniband Interconnect, Linux OS

23 USA: Intrepid Located in Argonne National Laboratory IBM Blue Gene/P 23 rd in TOP500 list 40 racks 1024 nodes per rack 850 MHz quad-core processors 640 I/O nodes 458 TFLOPS Rmax and 557 TFLOPS Rpeak 1260 KW

24 Russia: Lomonosov Located at Moscow State University Research Computer Center Most powerful HPC system in Eastern Europe PFLOPS Rpeak and and TFLOPS Rmax (TOP500) 1.7 PFLOPS Rpeak and TFLOPS Rmax (according to the system’s website) 22 nd on the TOP500 5,104 CPU nodes (T-Platforms T-Blade2 system) 1,065 GPU nodes (T-Platforms TB2-TL) Applications: regional and climate research, nanoscience, protein simulations

25 Cray Systems Cray XK6 Cray’s Gemini interconnect, AMD’s multicore scalar processors, and NVIDIA’s GPU processors Scalable up to 50 Pflops of combined performance 70 Tflops per cabinet 16 core AMD Opteron 6200 (96 /cabinet) 1536 cores/cabinet NVIDIA Tesla 2090 (96 per cabinet) OS: Cray Linux Environment Cray XE6 Cray’s Gemini interconnect, AMD’s Opteron 6100 processors 20.2 Tflops per cabinet 12 core AMD Opteron 6100 (192/cabinet) 2034 cores/cabinet 3D torus interconnect 25

26 Upgrade of Jaguar from Cray XT5 to XK6 Cray Linux Environment operating system Gemini interconnect 3-D Torus Globally addressable memory Advanced synchronization features AMD Opteron 6274 processors (Interlagos) New accelerated node design using NVIDIA multi-core accelerators 2011: 960 NVIDIA x2090 “Fermi” GPUs 2012: 14,592 NVIDIA “Kepler” GPUs 20+ PFlops peak system performance 600 TB DDR3 mem TB GDDR5 mem ORNL’s “Titan” System Titan Specs Compute Nodes18,688 Login & I/O Nodes512 Memory per node32 GB + 6 GB # of Fermi chips (2012)960 # of NVIDIA “Kepler” (2013) 14,592 Total System Memory688 TB Total System Peak Performance 20+ Petaflops Liquid cooling at the cabinet level Cray EcoPHLex 26

27 USA: Blue Waters Petascale supercomputer to be deployed at NCSA at UIUC IBM terminated the contract in August 2011 Cray was chosen to deploy the system Specifications: Over 300 cabinets (Cray XE6 and XK6) Over 25,000 compute nodes 11.5 PFLOPS peak performance Over 49,000 AMD processors (380,000 cores) Over 3,000 NVIDIA GPUs Over 1.5 PB overall system memory (4 GB per core)

28 UK: HECToR Located in University of Edinburgh in Scotland 32 nd on the TOP500 list TFLOPS Rmax and TFLOPS Rpeak Currently 30 cabinets 704 compute blades Cray XE6 system 16-core Interlagos 2.3 GHz 90,112 cores total

29 Germany: HERMIT Located at High Performance Computing Center Stuttgart 24 th on the TOP500 list TFLOPS Rmax and TFLOPS Rpeak Currently 38 cabinets Cray XE compute nodes, 113,664 compute cores Dual socker AMD 2.3 GHz (16 cores) 2 MW max power consumption Research at HPCCS: OpenMP Validation Suite, MPI-Start, Molecular Dynamics, Biomechanical simulations, Microsoft Parallel Visualization, Virtual Reality, Augmented reality

30 NEC SX-9 Supercomputer, an SMP system Single-chip vector processor: 3.2 GHz 8-way replicated vector pipes (each has 2 multiply, 2 addition units) GFLOPS peak performance 65 nm CMOS technology Up to 64 GB of memory 30

31 31 Passing of the Torch IV. In Memoriam

32 Alan Turing - Centennial

33 Father of computability with the abstract “Turing Machine” Father of Colossus – Arguably first digital electronic supercomputer – Broke German enigma codes – Saved 10s of thousands of lives;

34 34 Passing of the Torch V. The Flame Burns On

35 Key HPC Awards Seymour Cray Award: Charles L. Seitz For innovation in high-performance message passing architectures and networks A member of National Academy of Engineering President of Myricom, Inc. until last year Projects he worked on include VLSI design, programming and packet-switching techniques for 2 nd generation multicomputers, Myrinet high-performance interconnects Sidney Fernbach Award: Cleve Moler For fundamental contribution to linear algebra, mathematical software, and enabling tools for computational science Professor at the University of Michigan, Stanford University and University of New Mexico Co-founder of MathWorks, Inc. (in 1984) One of the authors of LINPACK and EISPACK Member of National Academy of Engineering Past president of Society for Industrial and Applied Mathematics 35

36 Key Awards Ken Kennedy Award: Susan L. Graham For foundational compilation algorithms and programming tools; research and discipline leadership; and exceptional mentoring Fellow: Association for Computing Machinery, American Association for the Advancement of Science, American Academy of Arts and Sciences, National Academy of Engineering Awards: IEEE John von Neumann Medal in 2009 Research includes work on Harmonia, a language-based framework for interactive software development and Titanium, a Java-based parallel programming language, compiler and runtime system ACM Turing Award: Judea Pearl For fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning Professor of Computer Science at UCLA Pioneered probabilistic and causal reasoning Created computational foundation for processing information under certainty Awards: IEEE intelligent systems, ACM Allen Newell, and many others Member: AAAI, IEEE, National Academy of Engineering, and others 36

37 VII. EXASCALE 37

38 Picking up the Exascale Gauntlet International Exascale Software Project – 7 th meeting in Cologne, Germany – 8 th meeting in Kobe, Japan European Exascale Software Initiative – EESI initial study completed – EESI-2 inaugurated US DOE X-stack Program – 4 teams selected to develop system software stack – To be announced Plans in formulation by many nations

39 Strategic Exascale Issues Next Generation Execution Model(s) Runtime systems Numeric parallel algorithms Programming models and languages Transition of legacy codes and methods Computer architecture – Highly controversial – Must be COTS, but can’t be the same Managing power and reliability Good fast machines or just machines fast

40 IX. CANONICAL HPC SYSTEM 40

41 Canonical Machine: Selection Process Calculate histograms for selected non-numeric technical parameters Architecture: Cluster – 410, MPP – 89, Constellations - 1 Processor technology: Intel Nehalem – 196, other Intel EM64T – 141, AMD x86_64 – 63, Power – 49, etc. Interconnect family: GigE – 224, IB – 209, Custom – 29, etc. OS: Linux – 427, AIX – 28, CNK/SLES 9 – 11, etc. Find the largest intersection including single modality for each parameter (Cluster, Intel Nehalem, GigE, Linux): 110 machines matching Note: this does not always include the largest histogram buckets. For example, in 2003 the dominant system architecture was cluster, but he largest parameter subset is generated for (Constellations, PA-RISC, Myrinet, HP-UX) In the computed subset, find the centroid for selected numerical parameters Rmax Number of compute cores Processor frequency Other useful metric would be power, but it is not available for all entries in the list Order machines in the subset by increasing distance (least square error) from the centroid The parameter values have to be normalized due to different units used The canonical machine is found at the head of the list

42 Canonical Subset with Machine Ranks

43 Centroid Placement in Parameter Space Centroid coordinates: #cores: Rmax: 64.3 TFlops Processor frequency: 2.82 GHz

44 The Canonical HPC System – 2012 Architecture: commodity cluster Dominant class of HEC system Processor: 64-bit Intel Xeon Westmere-EP E56xx (32nm die shrink of E55xx) in most deployed machines Older E55xx Nehalem in the second place by machine count Minimal changes since the last year: strong Intel dominance, with AMD and POWER based systems in distant second and third place Intel maintains the strong lead in total number of cores deployed (4,337,423), followed by AMD Opteron (2,038,956) and IBM POWER (1,434,544) Canonical example: #315 in the TOP500 Rmax of 62.3 TFlops Rpeak of 129 Tflops cores System node Based on HS22 BladeCenter (3 rd most popular system family) Equipped with Intel X5670 6C clocked at 2.93 GHz (6 cores/12 threads per processor) Homogeneous Only 39 accelerated machines in the list; accelerator cores constitute 5.7% of all cores in TOP 500 IBM systems integrator Remains the most popular vendor (225 machines in the list) Gigabit Ethernet interconnect Infiniband still did not surpass GigE in popularity Linux By far #1 OS Power consumption: 309 kW, MFLOPS/W Industry owned and operated (telecommunication company)

45 X. CLOSING REMARKS 45

46 Next Year (if invited) Leipzig 10 th Anniversary of Retrospective Keynotes Focus on Impact and Accomplishments across an entire Decade Application oriented: science and industry System Software looking forward

47


Download ppt "HPC Achievement and Impact – 2012 a personal perspective Thomas Sterling Professor of Informatics and Computing Chief Scientist and Associate Director."

Similar presentations


Ads by Google