BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri.

Benchmarks on BG/L: Parallel and Serial John A. Gunnels Mathematical Sciences Dept. IBM T. J. Watson Research Center.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Top 500 Computers Federated Distributed Systems Anda Iamnitchi.

Today’s topics Single processors and the Memory Hierarchy

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance of Applications Using Dual-Rail InfiniBand 3D Torus Network on the.

Types of Parallel Computers

OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Figure 1.1 Interaction between applications and the operating system.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Martin Kruliš by Martin Kruliš (v1.0)1.

Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

BlueGene/L Power, Packaging and Cooling Todd Takken IBM Research February 6, 2004 (edited 2/11/04 version of viewgraphs)

Computer performance.

Memory Management in Windows and Linux &. Windows Memory Management Virtual memory manager (VMM) –Executive component responsible for managing memory.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Next KEK machine Shoji Hashimoto 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Rensselaer Why not change the world? Rensselaer Why not change the world? 1.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

History of Microprocessor MPIntroductionData BusAddress Bus

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh.

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

Interconnection network network interface and a case study.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Next Generation Correlators, June 26 th −29 th, 2006 The LOFAR Blue Gene/L Correlator Stichting ASTRON (Netherlands Foundation for Research in Astronomy)

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Background Computer System Architectures Computer System Software.

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.

PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.

Appro Xtreme-X Supercomputers

Modern Processor Design: Superscalar and Superpipelining

BlueGene/L Supercomputer

Hardware Overview System P & Power5.

CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

Presentation transcript:

BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego

Outline Introduction Introduction Hardware Overview Hardware Overview PackagingPackaging NodesNodes NetworksNetworks Software Overview Software Overview Use Double Floating Point UnitUse Double Floating Point Unit Computation ModesComputation Modes ExamplesExamples Conclusions Conclusions University of California, San Diego

Introduction (1) Goals Goals Price/PerformancePrice/Performance Power/PerformancePower/Performance PerformancePerformance Massively Parallel System Massively Parallel System Largest System scheduled at LLNLLargest System scheduled at LLNL 2 16 compute nodes2 16 compute nodes 64x32x32 three-dimensional torus64x32x32 three-dimensional torus 360TFlops360TFlops University of California, San Diego

Introduction (2) How to meet these goals? How to meet these goals? Modest Clock RateModest Clock Rate Single ASICSingle ASIC Dense PackagingDense Packaging Results Results ~1MW, ~300tons, <2500 sq ft~1MW, ~300tons, <2500 sq ft Earth Simulator (40Tflops) Earth Simulator (40Tflops) 2x37125 sq ft2x37125 sq ft University of California, San Diego

Packaging Chips to Racks: University of California, San Diego

Packaging (2) Air cooled system Air cooled system Standard raised floorStandard raised floor 220V rack feed and failover air220V rack feed and failover air Adaptive speedAdaptive speed MTBF MTBF Dominated by memory failureDominated by memory failure 6.16 days6.16 days University of California, San Diego

Nodes (1) ASIC ASIC Dual PowerPC 440FP2 700 MHzDual PowerPC 440FP2 700 MHz 32KB L1 non-coherent32KB L1 non-coherent 2KB prefetch buffer (L2)2KB prefetch buffer (L2) 4MB L3 shared EDRAM4MB L3 shared EDRAM Network controllersNetwork controllers Memory Memory 1GB DDR memory1GB DDR memory University of California, San Diego

Nodes (2) University of California, San Diego

Nodes (3) PowerPc 440 FP2 PowerPc 440 FP2 Two floating-point unitsTwo floating-point units SIMOMD instructionsSIMOMD instructions Quadword datapathQuadword datapath Superscalar architectureSuperscalar architecture ALU + load/store ALU + load/store University of California, San Diego

Networks (1) Torus Torus 3D point-to-point links3D point-to-point links Routers embeddedRouters embedded 6 connections per node6 connections per node 175MB bandwidth175MB bandwidth University of California, San Diego

Networks (2) Tree Tree 350MB bandwidth, latency 1.5msec350MB bandwidth, latency 1.5msec Per module integer ALUPer module integer ALU Interrupt/Barrier Interrupt/Barrier BarriersBarriers AND/ORAND/OR JTAG JTAG Control networkControl network Gigabit Ethernet for I/O Gigabit Ethernet for I/O University of California, San Diego

Software Support Compute nodes Compute nodes Compute Node KernelCompute Node Kernel Asymmetric view of the coresAsymmetric view of the cores Run only one job at a timeRun only one job at a time Different operation modesDifferent operation modes I/O nodes I/O nodes run Linuxrun Linux Manage compute nodesManage compute nodes Perform file I/OPerform file I/O

Using the Double FP unit Compiler Optimization Compiler Optimization Generate SIMOMD operationsGenerate SIMOMD operations Requires consecutive data 16-byte aligned Requires consecutive data 16-byte aligned C/C++ explicit aliasing disambiguation C/C++ explicit aliasing disambiguation Use of primitive functions Use of primitive functions University of California, San Diego

Operation Modes Default Default Computation/CommunicationComputation/Communication Virtual Node Mode Virtual Node Mode Resources splitResources split Coprocessor Mode Coprocessor Mode L1 cache not hardware coherentL1 cache not hardware coherent Software based coeherenceSoftware based coeherence University of California, San Diego

Coprocessor Mode Fork-join model of execution Fork-join model of execution No communicationNo communication Single-shot workSingle-shot work Permanent workPermanent work Avoid false sharing Avoid false sharing 32bit alignment32bit alignment Data partitioningData partitioning Use of shadow variablesUse of shadow variables University of California, San Diego

Reciprocal Computation T=8.4282n (pclks) T=8.4282n (pclks) T co =5.4335n (pclks) T co =5.4335n (pclks) Crossover point at 430 Crossover point at % parallel efficiency 78% parallel efficiency University of California, San Diego

Daxpy routine BLAS routine BLAS routine y(i)=a*x(i) + y(i) y(i)=a*x(i) + y(i) Theoretical limit of 8 flops in 3 cycles Theoretical limit of 8 flops in 3 cycles Obtained 6 flops in 3 cycle (75%) Obtained 6 flops in 3 cycle (75%) University of California, San Diego

Linpack Performance as percentage of peak Performance as percentage of peak Weak scaling (70% memory) Weak scaling (70% memory) Coprocessor vs Virtual Node Coprocessor vs Virtual Node 512 nodes512 nodes 70% - 65%70% - 65% University of California, San Diego

NAS BT Navier-Stokes equations Navier-Stokes equations 3D space decomposition 3D space decomposition 2D square process mesh 2D square process mesh Mflops/task Mflops/task University of California, San Diego

Conclusions System overview System overview Peak performance Peak performance Use both FP unitsUse both FP units Run in coprocessorRun in coprocessor Cache coherence problem Cache coherence problem Run in virtual nodeRun in virtual node Half resources, scalability Half resources, scalability Map processes to nodesMap processes to nodes Questions? Questions? University of California, San Diego

References An overview of BlueGene/L Supercomputer An overview of BlueGene/L Supercomputer Enabling Dual-Core Mode in BlueGene/L: Chanllenges and Solutions Enabling Dual-Core Mode in BlueGene/L: Chanllenges and Solutions rnumber= &isnumber=27982 Unlocking Performance of the BlueGene/L Supercomputer Unlocking Performance of the BlueGene/L Supercomputer unix.mcs.anl.gov/%7Egropp/projects/parallel/BGL/docs/unlock ingperf.pdf unix.mcs.anl.gov/%7Egropp/projects/parallel/BGL/docs/unlock ingperf.pdf BG System overview BG System overview BG_overview ppt University of California, San Diego