1 Oak Ridge Interconnects Workshop - November 1999 ASCI-00-003. ASCI-00-003.1 ASCI Terascale Simulation Requirements and Deployments David A. Nowak ASCI.

Slides:



Advertisements
Similar presentations
DOE ASCI TeraFLOPS Rejitha Anand CMPS Accelerated Strategic Computing Initiative Large, complex, multifaceted, highly integrated research and development.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL STATE OF THE ART.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Today’s topics Single processors and the Memory Hierarchy
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.
Planned Machines: ASCI Purple, ALC and M&IC MCR Presented to SOS7 Mark Seager ICCD ADH for Advanced Technology Lawrence Livermore.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
An Introduction to Princeton’s New Computing Resources: IBM Blue Gene, SGI Altix, and Dell Beowulf Cluster PICASso Mini-Course October 18, 2006 Curt Hillegas.
National Partnership for Advanced Computational Infrastructure Advanced Architectures CSE 190 Reagan W. Moore San Diego Supercomputer Center
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28,
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Blue Gene Experience.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Rensselaer Why not change the world? Rensselaer Why not change the world? 1.
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Presented by Leadership Computing Facility (LCF) Roadmap Buddy Bland Center for Computational Sciences Leadership Computing Facility Project.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Commodity-SC Workshop, Mar00 Cluster-based Visualization Dino Pavlakos Sandia National Laboratories Albuquerque, New Mexico.
Large Scale Parallel File System and Cluster Management ICT, CAS.
HPSS for Archival Storage Tom Sherwin Storage Group Leader, SDSC
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Probe Plans and Status SciDAC Kickoff July, 2001 Dan Million Randy Burris ORNL, Center for.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Copyright  2005 SRC Computers, Inc. ALL RIGHTS RESERVED Overview.
1 SOS7: “Machines Already Operational” NSF’s Terascale Computing System SOS-7 March 4-6, 2003 Mike Levine, PSC.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Interconnection network network interface and a case study.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 6 TB SSA Disk StorageTek Tape Libraries 830 GB MaxStrat.
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
Tackling I/O Issues 1 David Race 16 March 2010.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Ryan Leonard Storage and Solutions Architect
Berkeley Cluster Projects
PC Farms & Central Data Recording
Appro Xtreme-X Supercomputers
Direct Attached Storage and Introduction to SCSI
BlueGene/L Supercomputer
Cluster Computers.
Presentation transcript:

1 Oak Ridge Interconnects Workshop - November 1999 ASCI ASCI ASCI Terascale Simulation Requirements and Deployments David A. Nowak ASCI Program Leader Mark Seager ASCI Terascale Systems Principal Investigator Lawrence Livermore National Laboratory University of California David A. Nowak ASCI Program Leader Mark Seager ASCI Terascale Systems Principal Investigator Lawrence Livermore National Laboratory University of California * Work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.

2 Oak Ridge Interconnects Workshop - November 1999 ASCI Overview  ASCI program background  Applications requirements  Balanced terascale computing environment  Red Partnership and CPLANT  Blue-Mountain partnership  Sustained Stewardship TeraFLOP/s (SST)  Blue-Pacific partnership  Sustained Stewardship TeraFLOP/s (SST)  White partnership  Interconnect issues for future machines

3 Oak Ridge Interconnects Workshop - November 1999 ASCI A successful Stockpile Stewardship Program requires a successful ASCI B61 W62 W76 W78 W80 B83 W84 W87 W88 PrimaryCertification MaterialsDynamics AdvancedRadiography SecondaryCertification EnhancedSurety ICF EnhancedSurveillance HostileEnvironments WeaponsSystemEngineering Directed Stockpile Work

4 Oak Ridge Interconnects Workshop - November 1999 ASCI ASCI’s major technical elements meet Stockpile Stewardship requirements Defense Applications & Modeling University Partnerships Integration Simulation & Computer Science Integrated Computing Systems Applications Development Physical & Materials Modeling V&V DisCom 2 PSE Physical Infrastructure & Platforms Operation of Facilities Alliances Institutes VIEWS PathForward

5 Oak Ridge Interconnects Workshop - November 1999 ASCI Example terascale computing environment in CY00 with ASCI White at LLNL Programs Application Performance Computing Speed Parallel I/O Archive BW Cache BW Archival Storage TFLOPS PetaBytes GigaBytes/sec TeraBytes/sec ‘96 ‘97 ‘ Year Disk TeraBytes Memory Memory BW TeraBytes TeraBytes/sec 1 Interconnect TeraBytes/sec ASCI is achieving programmatic objectives, but the computing environment will not be in balance at LLNL for the ASCI White platform. Computational Resource Scaling for ASCI Physics Applications 1 FLOPSPeak Compute 16Bytes/s / FLOPSCache BW 1Byte / FLOPSMemory 3Bytes/s / FLOPSMemory BW 0.5Bytes/s / FLOPSInterconnect BW 50Bytes / FLOPSLocal Disk 0.002Byte/s / FLOPSParallel I/O BW Byte/s / FLOPSArchive BW 1000Bytes / FLOPSArchive Computational Resource Scaling for ASCI Physics Applications 1 FLOPSPeak Compute 16Bytes/s / FLOPSCache BW 1Byte / FLOPSMemory 3Bytes/s / FLOPSMemory BW 0.5Bytes/s / FLOPSInterconnect BW 50Bytes / FLOPSLocal Disk 0.002Byte/s / FLOPSParallel I/O BW Byte/s / FLOPSArchive BW 1000Bytes / FLOPSArchive

6 Oak Ridge Interconnects Workshop - November 1999 ASCI SNL/Intel ASCI Red Good interconnect bandwidth Poor memory/NIC bandwidth Good interconnect bandwidth Poor memory/NIC bandwidth CPU NIC 256 MB CPU NIC 256 MB Kestrel Compute Board ~2332 in system CPUNIC 128 MB CPU 128 MB Eagle I/O Board ~100 in system 128 MB Terascale Operating System (TOS) cc on node, not cc across board Cougar Operating System 38 switches 32 switches node 6.25 TB Storage Classified Unclassified 2 GB each Sustained (total system) 800 MB bi-directional per link Shortest hop ~13 µs, Longest hop ~16 µs Aggregate link bandwidth = TB/s Bottleneck Good

7 Oak Ridge Interconnects Workshop - November 1999 ASCI SNL/Compaq ASCI C-Plant SSSSS C-Plant is located at Sandia National Laboratory Currently a Hypercube, C-Plant will be reconfigured as a mesh in 2000 C-Plant has 50 scalable units LINUX Operating System C-Plant is a “Beowulf” configuration 16 “boxes” 2 16 port Myricom switches 160 MB each direction 320 MB total Scalable Unit: “Box:” 500 MHz Compaq eV56 processor 192 MB SDRAM 55 MB NIC Serial port Ethernet port ______ Disk space Hypercube SSS This is a research project — a long way from being a production system.

8 Oak Ridge Interconnects Workshop - November 1999 ASCI LANL/SGI/Cray ASCI Blue Mountain TeraOPS Peak 48x128-CPU SMP= 3 TF 48x32 GB/SMP= 1.5TB 48x1.44 TB/SMP=70 TB Inter-SMP Compute Fabric Bandwidth=48 x 12 HIPPI/SMP=115,200 MB/s bi-directional 20(14) HPSS Movers=40(27) HPSS Tape drives of MB/s each= MB/s Bandwidth LAN HPSS Meta Data Server Bandwidth=48 X 1 HIPPI/SMP=9,600 MB/s bi-directional Link to LAN Link to Legacy Mxs Bandwidth=8 HIPPI= 1,600 MB/s bi-directional HPSS Bandwidth=4 HIPPI= 800 MB/s bi-directional RAID Bandwidth=48 X 10 FC/SMP=96,000 MB/s bi-directional Aggregate link bandwidth = TB/s

9 Oak Ridge Interconnects Workshop - November 1999 ASCI Blue Mountain Planned GSN Compute Fabric 9 Separate 32x32 X-Bar Switch Networks Expected Improvements Throughput 115,200 MB/s => 460,800 MB/s, 4x Link Bandwidth 200 MB/s => 1,600 MB/s, 8x Round Trip Latency 110 µs => ~ 10 µs,11x 3 Groups of 16 Computers each Aggregate link bandwidth = TB/s

10 Oak Ridge Interconnects Workshop - November 1999 ASCI LLNL/IBM Blue-Pacific TeraOP/s Peak Sector S Sector Y Sector K 24 Each SP sector comprised of 488 Silver nodes 24 HPGN Links System Parameters 3.89 TFLOP/s Peak 2.6 TB Memory 62.5 TB Global disk HPGN HiPPI 2.5 GB/node Memory 24.5 TB Global Disk 8.3 TB Local Disk 1.5 GB/node Memory 20.5 TB Global Disk 4.4 TB Local Disk 1.5 GB/node Memory 20.5 TB Global Disk 4.4 TB Local Disk FDDI SST Achieved >1.2TFLOP/s on sPPM and Problem >70x Larger Than Ever Solved Before! Aggregate link bandwidth = TB/s

11 Oak Ridge Interconnects Workshop - November 1999 ASCI I/O Hardware Architecture of SST System Data and Control Networks 488 Node IBM SP Sector 56 GPFS Servers 432 Silver Compute Nodes Each SST Sector Has local and global I/O file system 2.2 GB/s delivered global I/O performance 3.66 GB/s delivered local I/O performance Separate SP first level switches Independent command and control Link bandwidth = 300 Mb/s Bi-directional Full system mode Application launch over full 1,464 Silver nodes 1,048 MPI/us tasks, 2,048 MPI/IP tasks High speed, low latency communication between all nodes Single STDIO interface GPFS 24 SP Links to Second Level Switch

12 Oak Ridge Interconnects Workshop - November 1999 ASCI Partisn (S N -Method) Scaling Number of Processors RED Blue Pac (1p/n) Blue Pac (4p/n) Blue Mtn (UPS) Blue Mtn (MPI) Blue Mtn(MPI.2s2r) Blue Mtn (UPS.offbox) Constant Number of Cells per Processor Number of Processors RED Blue Pac (1p/n) Blue Pac (4p/n) Blue Mtn (UPS) Blue Mtn (MPI) Blue Mtn(MPI.2s2r) Blue Mtn(UPS.offbox) Constant Number of Cells per Processor

13 Oak Ridge Interconnects Workshop - November 1999 ASCI The JEEP calculation adds to our understanding the performance of insensitive high explosives This calculation involved 600 atoms (largest number ever at such a high resolution) with 1,920 electrons, using about 3,840 processors This simulation provides crucial insight into the detonation properties of IHE at high pressures and temperatures. This calculation involved 600 atoms (largest number ever at such a high resolution) with 1,920 electrons, using about 3,840 processors This simulation provides crucial insight into the detonation properties of IHE at high pressures and temperatures. Relevant experimental data (e.g., shock wave data) on hydrogen fluoride (HF) are almost nonexistent because of its corrosive nature. Quantum-level simulations, like this one, of HF- H 2 O mixtures can substitute for such experiments.

14 Oak Ridge Interconnects Workshop - November 1999 ASCI Silver Node delivered memory bandwidth is around MB/s/process Silver Peak Bytes:FLOP/S Ratio is 1.3/2.565 = 0.51 Silver Delivered B:F Ratio is 200/114 = 1.75

15 Oak Ridge Interconnects Workshop - November 1999 ASCI MPI_SEND/US delivers low latency and aggregate high bandwidth, but counter intuitive behavior per MPI task

16 Oak Ridge Interconnects Workshop - November 1999 ASCI LLNL/IBM White 10.2 TeraOPS Peak Aggregate link bandwidth = TB/s Five times better than the SST; Peak is three times better Ratio of Bytes:FLOPS is improving Aggregate link bandwidth = TB/s Five times better than the SST; Peak is three times better Ratio of Bytes:FLOPS is improving MuSST (PERF) System 8 PDEBUG nodes w/16 GB SDRAM ~484 PBATCH nodes w/8 GB SDRAM 12.8 GB/s delivered global I/O performance 5.12 GB/s delivered local I/O performance 16 GigaBit Ethernet External Network Up to 8 HIPPI-800 Programming/Usage Model Application launch over ~492 NH-2 nodes 16-way MuSPPA, Shared Memory, 32b MPI 4,096 MPI/US tasks Likely usage is 4 MPI tasks/node with 4 threads/MPI task Single STDIO interface

17 Oak Ridge Interconnects Workshop - November 1999 ASCI Interconnect issues for future machines — Why Optical? —  Need to increase Bytes:FLOPS ratio  Memory bandwidth (cache line) utilization will be dramatically lower for codes that utilize arbitrarily connected meshes and adaptive refinement  indirect addressing.  Interconnect bandwidth must be increased and latency must be reduced to allow a broader range of applications and packages to scale well  To get very large configurations (30  70  100 TeraOPS) larger SMPs will be deployed  For fixed B:F interconnect ratio this means that more bandwidth coming out of an SMP  Multiple pipes/planes will be used  Optical reduces cable count  Machjne footprint is growing  24,000 square feet may require optical  Network interface paradigm  Virtual memory direct memory access  Low-latency remote get/put  Reliability Availability and Serviceability (RAS)

18 Oak Ridge Interconnects Workshop - November 1999 ASCI ASCI ASCI Terascale Simulation Requirements and Deployments David A. Nowak ASCI Program Leader Mark Seager ASCI Terascale Systems Principal Investigator Lawrence Livermore National Laboratory University of California David A. Nowak ASCI Program Leader Mark Seager ASCI Terascale Systems Principal Investigator Lawrence Livermore National Laboratory University of California * Work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.