The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR.

Slides:

Advertisements

Similar presentations

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

Presented by: Quinn Gaumer CPS 221.  16,384 Processing Nodes (32 MHz)  30 m x 30 m  Teraflop  1992.

CM-5 Massively Parallel Supercomputer ALAN MOSER Thinking Machines Corporation 1993.

Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Top 500 Computers Federated Distributed Systems Anda Iamnitchi.

1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.

University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.

Storage area network and System area network (SAN)

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Computer performance.

Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Computer System Architectures Computer System Software

On-Chip Networks and Testing

Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.

Cis303a_chapt06_exam.ppt CIS303A: System Architecture Exam - Chapter 6 Name: __________________ Date: _______________ 1. What connects the CPU with other.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Blue Gene/L Torus Interconnection Network N. R. Adiga, et.al IBM Journal.

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Rensselaer Why not change the world? Rensselaer Why not change the world? 1.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

CSE 661 PAPER PRESENTATION

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

Computer Organization & Assembly Language © by DR. M. Amer.

2 Systems Architecture, Fifth Edition Chapter Goals Describe the system bus and bus protocol Describe how the CPU and bus interact with peripheral devices.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Chapter 4 MARIE: An Introduction to a Simple Computer.

The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Interconnection network network interface and a case study.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

Hardware Architecture

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.

Introduction to Computers - Hardware

BlueGene/L Supercomputer

Introduction to Computing

Presentation transcript:

The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR

What is Blue Gene/L? Blue Gene is an IBM Research project dedicated to exploring the frontiers in supercomputing. Blue Gene is an IBM Research project dedicated to exploring the frontiers in supercomputing. In November 2004, the IBM Blue Gene computer became the fastest supercomputer in the world. In November 2004, the IBM Blue Gene computer became the fastest supercomputer in the world. This project is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 TeraFLOPS. This project is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 TeraFLOPS. Example usage: Example usage: hydrodynamics hydrodynamics quantum chemistry quantum chemistry molecular dynamics molecular dynamics climate modeling climate modeling financial modeling financial modeling

A High-Level View of the BG/L Architecture Within node: Low latency, high bandwidth memory system. Strong floating point performance: 4 floating point operations/cycle. Across nodes: Low latency, high bandwidth networks. Many nodes: Low power/node. Low cost/node. RAS (reliability, availability and serviceability). Familiar SW API: C, C++, Fortan, MPI, POSIX subset, …

Main Design Principles for Blue Gene/L Some science & engineering applications scale up to and beyond 10,000 parallel processes. Improve computing capability, holding total system cost. Reduce cost/FLOP. Reduce complexity and size. ~25KW/rack is max for air-cooling in standard room. Need to improve performance/power ratio. 700MHz PowerPC440 for ASIC has excellent FLOP/Watt. Maximize Integration: On chip: ASIC with everything except main memory. Off chip: Maximize number of nodes in a rack.. Large systems require excellent reliability, availability, serviceability (RAS)

Main Design Principles (cont’d) Make cost/performance trade-offs considering the end- use: Applications <> Architecture <> Packaging Examples: 1 or 2 differential signals per torus link. I.e. 1.4 or 2.8Gb/s. Maximum of 3 or 4 neighbors on collective network. I.e. Depth of network and thus global latency. Maximize the overall system efficiency: Small team designed all of Blue Gene/L. Example: Chose ASIC die and chip pin-out to ease circuit card routing.

Reducing Cost and Complexity Cables are bigger, costlier and less reliable than traces. So want to minimize the number of cables. So 3-dimensional torus is chosen as main BG/L network, with each node connected to 6 neighbors. Maximize number of nodes connected via circuit card(s) only. BG/L midplane has 8*8*8=512 nodes. (Number of cable connections) / (all connections) = (6 faces * 8 * 8 nodes) / (6 neighbors * 8 * 8 * 8 nodes) = 1 / 8

Blue Gene/L Architecture Up to 32*32*64=65536 nodes (3D torus). Max 360 teraFLOPS computation power. Each processor can perform 4 floating point operations per cycle (in the form of two 64-bit floating point multiply-add’s per cycle) 5 networks connect nodes to themselves and to the world.

Node Architecture IBM PowerPC embedded CMOS processors, embedded DRAM, and system-on-a-chip technique is used mm square die size, allowing for a very high density of processing. The ASIC uses IBM CMOS CU micron technology. 700 Mhz processor speed close to memory speed. Two processors per node. Second processor is intended primarily for handling message passing operations

The BG/L node ASIC includes: The two processing cores are standard PowerPC 440 core each with a PowerPC 440 FP2 core an enhanced “Double” 64-bit Floating-Point Unit The two cores are not L1 cache coherent. Each core has a small 2 KB L2 cache 4 MB L3 cache made from embedded DRAM An integrated external DDR memory controller A gigabit Ethernet adapter A JTAG interface

BlueGene/L node diagram.

Link ASIC In addition to the compute ASIC, there is a “link” ASIC. When crossing a midplane boundary BG/L’s torus global combining tree global interrupt signals pass through the BG/L link ASIC. It redrives signals over the cables between BG/L midplanes. The link ASIC can redirect signals between its different ports. enables BG/L to be partitioned into multiple, logically separate systems in which there is no traffic interference between systems.

The PowerPC 440 FP2 core It It consists of a primary side and a secondary side Each side has its own 64-bit by 32 element register file a double-precision computational datapath and a double-precision storage access datapath The primary side is capable of executing standard PowerPC floating-point instructions An enhanced set of instructions include those that are executed solely on the secondary side, and those that are simultaneously executed on both sides. Enhanced set includes SIMD operations

The The FP2 core (cont’d) This enhanced set This enhanced set goes beyond the capabilities of traditional SIMD architectures. A single instruction can initiate a different but related operation on different data. Single Instruction Multiple Operation Multiple Data (SIMOMD). Either of the sides can access data from the other side’s register file. This saves a lot of swapping when working purely on complex arithmetic operations.

Memory System It is designed for high bandwidth, low latency memory and cache accesses. An L2 hit returns in 6 to 10 processor cycles An L3 hit in about 25 cycles An L3 miss in about 75 cycles System has a 16 byte interface to nine 256Mb SDRAM-DDR devices. Operating at a speed of one half or one third of the processor.

3D Torus Network It is used for general-purpose, point-to-point message passing and multicast operations to a selected “class” of nodes. The topology is a three-dimensional torus constructed with point-to-point, serial links between routers embedded within the BlueGene/L ASICs. Each ASIC has six nearest-neighbor connections Virtual cut-through routing with multipacket buffering on collision Minimal, Adaptive, Deadlock Free

Torus Network (cont’d) Class Routing Capability (Deadlock-free Hardware Multicast) Packets can be deposited along route to specified destination. Allows for efficient one to many in some instances Active messages allows for fast transposes as required in FFTs. Independent on-chip network interfaces enable concurrent access.

Other Networks A global combining/broadcast tree for collective operations A Gigabit Ethernet network for connection to other systems, such as hosts and file systems. A global barrier and interrupt network And another Gigabit Ethernet to JTAG network for machine control

Collective Network It has tree structure It has tree structure One-to-all broadcast functionality One-to-all broadcast functionality Reduction operations functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs ~23TB/s total binary tree bandwidth (64k machine) ~23TB/s total binary tree bandwidth (64k machine) Interconnects all compute and I/O nodes (1024) Interconnects all compute and I/O nodes (1024)

Gb Ethernet Disk/Host I/O Network IO nodes are leaves on collective network. IO nodes are leaves on collective network. Compute and IO nodes use same ASIC, but: Compute and IO nodes use same ASIC, but: IO node has Ethernet not torus. Provedes IO seperation on application. IO node has Ethernet not torus. Provedes IO seperation on application. Compute node has torus, not Ethernet: No need for cables. Compute node has torus, not Ethernet: No need for cables. Configurable ratio of IO to compute = 1:8,16,32,64,128. Configurable ratio of IO to compute = 1:8,16,32,64,128. Application runs on compute nodes, not IO nodes. Application runs on compute nodes, not IO nodes.

Fast Barrier/Interrupt Network Four Independent Barrier or Interrupt Channels Four Independent Barrier or Interrupt Channels Independently Configurable as "or" or "and" Independently Configurable as "or" or "and" Asynchronous Propagation Asynchronous Propagation Halt operation quickly (current estimate is 1.3usec worst case round trip) Halt operation quickly (current estimate is 1.3usec worst case round trip) 3/4 of this delay is time-of-flight. 3/4 of this delay is time-of-flight. Sticky bit operation Sticky bit operation Allows global barriers with a single channel. Allows global barriers with a single channel. User Space Accessible User Space Accessible System selectable System selectable It is partitioned along same boundaries as Tree, and Torus It is partitioned along same boundaries as Tree, and Torus Each user partition contains it's own set of barrier/ interrupt signals Each user partition contains it's own set of barrier/ interrupt signals

Control Network JTAG interface to 100Mb Ethernet JTAG interface to 100Mb Ethernet direct access to all nodes. direct access to all nodes. boot, system debug availability. boot, system debug availability. runtime noninvasive RAS support. runtime noninvasive RAS support. non-invasive access to performance counters non-invasive access to performance counters direct access to shared SRAM in every node direct access to shared SRAM in every node Control, configuration and monitoring: Control, configuration and monitoring: Make all active devices accessible through JTAG, I2C, or other “simple” bus. (Only clock buffers & DRAM are not accessible) Make all active devices accessible through JTAG, I2C, or other “simple” bus. (Only clock buffers & DRAM are not accessible)

Packaging 2 nodes per compute card. 16 compute cards per node board. 16 node boards per 512-node midplane. Two midplanes in a 1024-node rack. For compiling, diagnostics, and analysis, a host computer is required. An I/O node handles communication between a compute node and other systems, including the host and file servers.

BlueGene/L packaging.

Science Application Study of protein folding and dynamics. Aim is to obtain a microscopic view of the thermodynamics and kinetics of the folding process Simulating longer and longer time-scales is the key challenge Focus is on improving the speed of execution for a fixed size system by utilizing additional CPUs. Understanding the logical limits to concurrency within the application is very important.

Conclusion The Blue Gene/L supercomputer is designed to improve cost/performance for a relatively broad class of applications with good scaling behavior. The Blue Gene/L supercomputer is designed to improve cost/performance for a relatively broad class of applications with good scaling behavior. This is achieved by using parallesim. This is achieved by using parallesim. System on Chip technology. System on Chip technology. The functionality of a node was contained within a single ASIC chip. The functionality of a node was contained within a single ASIC chip. BG/L has significantly lower cost in terms of power, space, and service, while doing no worse than the other competitors. BG/L has significantly lower cost in terms of power, space, and service, while doing no worse than the other competitors.

The End Questions ???