Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.

Slides:



Advertisements
Similar presentations
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
Advertisements

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
June 11, 2008 National Center for Atmospheric Research Teragrid 08, Las Vagas Early Experience of Running WRF and CAM on Ranger Siddhartha Ghosh Wei Huang.
Parallel Research at Illinois Parallel Everywhere
National Center for Atmospheric Research John Clyne 4/27/11 4/26/20111.
Petascale System Requirements for the Geosciences Richard Loft SCD Deputy Director for R&D.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
Lecture 1: Introduction to High Performance Computing.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
The Interactive Ensemble Coupled Modeling Strategy Ben Kirtman Center for Ocean-Land-Atmosphere Studies And George Mason University.
Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.
National Weather Service National Weather Service Central Computer System Backup System Brig. Gen. David L. Johnson, USAF (Ret.) National Oceanic and Atmospheric.
SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Blue Gene Experience.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Design of a Software Correlator for the Phase I SKA Jongsoo Kim Cavendish Lab., Univ. of Cambridge & Korea Astronomy and Space Science Institute Collaborators:
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.
1 Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
SCD Update Tom Bettge Deputy Director Scientific Computing Division National Center for Atmospheric Research Boulder, CO USA User Forum May 2005.
Copyright © 2003 University Corporation for Atmospheric ResearchSponsored by the National Science Foundation NCAR Computing Update Tom Engel Scientific.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.
Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
EVLA Data Processing PDR Scale of processing needs Tim Cornwell, NRAO.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Cray Environmental Industry Solutions Per Nyberg Earth Sciences Business Manager Annecy CAS2K3 Sept 2003.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
11 January 2005 High Performance Computing at NCAR Tom Bettge Deputy Director Scientific Computing Division National Center for Atmospheric Research Boulder,
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.
Exscale – when will it happen? William Kramer National Center for Supercomputing Applications.
Developments in High Performance Computing A Preliminary Assessment of the NAS SGI 256/512 CPU SSI Altix (1.5 GHz) Systems SC’03 November 17-20, 2003 Jim.
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
Tackling I/O Issues 1 David Race 16 March 2010.
Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Anshul Gandhi 347, CS building
Korea Astronomy and Space Science Institute
Computational Requirements
Appro Xtreme-X Supercomputers
The Earth Simulator System
Super Computing By RIsaj t r S3 ece, roll 50.
BlueGene/L Supercomputer
Course Description: Parallel Computer Architecture
K computer RIKEN Advanced Institute for Computational Science
K computer RIKEN Advanced Institute for Computational Science
Presentation transcript:

Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division National Center for Atmospheric Research Boulder, CO USA

Talk Outline Supercomputing Trends and Constraints Observed NCAR Cluster Performance (Aggregate) Microprocessor efficiency: what is possible? Microprocessor efficiency: recent efforts to improve CAM2 performance. Some RISC/Vector Cluster Comparisons Conclusions

The Demand: High Cost of Science Goals Climate scientists project a need for 150x more computing power of the next 5 years. T42->T85. Doubling horizontal resolution increases computational cost eightfold. Many additional constituents will be advected. New physics: computational cost of CAM/CCM, holding resolution constant, has increased 4x since More coming… Future: introducing super-parameterizations of moist processes would increase physics costs dramatically.

Existing Infrastructure Limits at NCAR Cooling Capacity –450 tons (1.58 megawatts) –Most limiting –One P690 node ~ 7.9 KW ~ 2.5 tons –Balance cooling with power Power ~ 1.2 MW without modifications –Second most limiting –Currently NCAR computer room draws 602 KW –About 400 kw from IBM clusters Space ~ 14,000 sq.ft. –P690 ~ 196 W/sq. ft. –Least limiting based on current trends

Mass Storage Growth 1.3 Pbytes total Adding ~3 Tbytes/day 5 year doubling times - –Unique files: 2.1 years –File size: 10.4 years –Media performance (GB/$) 1.9 years Alarming trends –MSS growth rate doubling time has accelerated over past year. Now 18 months. –MSS costs are increasing…

Observed Cluster Performance (Aggregate)

IBM Clusters at NCAR Bluesky: 1024 IBM 1.3 GHz Power-4 cluster –32 P690/32 compute servers –736 in 92, 8 way “nodes” (bluesky8) –288 in 9, 32 way “nodes” (bluesky32) –Peak: TFlops –Dual “Colony” interconnect Blackforest: IBM 375 MHz Power-3 cluster –283 “winterhawk” 4-way SMP’s –Peak: TFlops –TBMX interconnect

Observed IBM Cluster Efficiencies SystemApplication efficiency (% of peak) bluesk8 4.1% bluesk32 4.5% blackforest 5.7% Newer systems are less efficient. Larger nodes are more efficient. Max sustained performance: GFlops

Why is workload efficiency low? Computational character of workload average: –L3 cache miss rate 31% –computational intensity is 0.8 Applications are memory bandwidth limited. –Simple BW model predicts 5.5% for bluesky32. A good metric of efficiency is Flop/cycle. –Factors out dual FPU’s. –Bluesky32: 0.18 Flop/cycle –Blackforest: 0.23 Flop/cycle

RISC Cluster Network Comparison IBM Power-4 cluster with dual “Colony” network. IBM Power-3 cluster with single TBMX network. Compaq Alpha cluster with Quadrix network. Bisection Bandwidth –Important for global communications –XPAIR benchmark initiates all to all communication. –Dual Colony P690 local:global BW ratio 50:1 Global Reductions –For P processors these should scale as log(P). –Actually scales linearly.

Cluster Network Performance

Microprocessor efficiency: What is possible?

Example: 3-D FFT Performance Hand tuned multithreaded, 3-D FFT (STK) Three 1-D FFT on each axis with transpositions FFTs are memory bandwidth intensive –Both loads and Flop’s scale like N*log(N) The FFT is not multiply-add dominated The FFT butterfly is a non local, strided calculation. Gets more non local as size of FFT increases 1024^3 Transforms on P690 (IBM Power-4)

Microprocessor efficiency: Recent efforts to improve CAM2 performance…

CCM Benchmark Performance on Existing Multiprocessor Clusters

Some RISC/Vector Cluster Comparisons…

Processor Comparison Power 4 (2 cores) Pentiu m 4 Itanium IISX-6 Process.18 µ Cu/7l.13 µ Cu 0.18 µ Al/ 6l 0.15 µ Cu/ 9l Mhz /1000 Peak GF5.22.8/ Die area400 mm mm mm mm 2 Trans.170 M 55 M221 M57 M On-Chip cache 1.77MB512 KB3.3 MB On-Chip bandwidth 41 GB/s (per core) 89.6 GB/s 64 GB/s Memory Bandwidth 5.8 GB/s4.3 GB/s 6.4 GB/s32 GB/s

IBM P690 Cluster 5.3 TFlops peak 1024 processors (32, 32 way P690 nodes) 5.2 Gflops/processor Observed % of peak on NCAR codes Max sustained on workload: GFlops Est. Peak Price Performance: $2.6/MFlops Sustained Price Performance: $59/MFlops Sustained Power Performance: 0.7 Gflops/KW

Earth Simulator Tflops peak 5120 Processors (640, 8 processor GS40 nodes) 8 Gflops/processor Estimate 30% of peak on NCAR codes Est. Max sustained on workload: 12,200 GFlops Est. Peak Price Performance: $8.5/MFlops Est. Sustained Price Performance: $28/MFlops Est. Sustained Power Performance: Gflops/KW

Power 4 die floor plan

Power 4 cache/CPU area comparison

Conclusions Infrastructure (power, cooling, space) are becoming critical constraints. NCAR IBM clusters sustain 4.1%-4.5% of peak. Workload is memory bandwidth limited. RISC cluster interconnects are not great. We’re making steady progress learning how to program around these limitations. At this point, vector systems appear to be about 2x more cost effective in both price and power performance.

Pentium-4 die floor plan

Pentium-4 cache/CPU comparison

Itanium II die floor plan

Itanium II CPU/cache area comparison