Presentation is loading. Please wait.

Presentation is loading. Please wait.

Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Similar presentations

Presentation on theme: "Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)"— Presentation transcript:

1 Case study IBM Bluegene/L system InfiniBand

2 Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF) Rpeak Sum (GF) Processor Sum Myrinet40.80 %38445152441255152 Quadrics10.20 %52840637959968 Gigabit Ethernet 23246.40 %11796979220421812098562 Infiniband20641.20 %22980393327595812411516 Mixed10.20 %665678294413824 NUMAlink20.40 %10796112124118944 SP Switch10.20 %757609278112208 Proprietary295.80 %9841862139010821886982 Fat Tree10.20 %1224001310721280 Custom234.60 %13500813154608591271488 Totals500100%58930025.5985179949.007779924

3 Overview of the IBM Blue Gene/L System Architecture Design objectives Hardware overview –System architecture –Node architecture –Interconnect architecture

4 Highlights A 64K-node highly integrated supercomputer based on system-on-a-chip technology –Two ASICs Blue Gene/L compute (BLC), Blue Gene/L Link (BLL) Distributed memory, massively parallel processing (MPP) architecture. Use the message passing programming model (MPI). 360 Tflops peak performance Optimized for cost/performance

5 Design objectives Objective 1: 360-Tflops supercomputer –Earth Simulator (Japan, fastest supercomputer from 2002 to 2004): 35.86 Tflops Objective 2: power efficiency –Performance/rack = performance/watt * watt/rack Watt/rack is a constant of around 20kW Performance/watt determines performance/rack

6 Power efficiency: –360Tflops => 20 megawatts with conventional processors –Need low-power processor design (2-10 times better power efficiency)

7 Design objectives (continue) Objective 3: extreme scalability –Optimized for cost/performance  use low power, less powerful processors  need a lot of processors Up to 65536 processors. –Interconnect scalability

8 Blue Gene/L system components

9 Blue Gene/L Compute ASIC 2 Power PC440 cores with floating-point enhancements –700MHz –Everything of a typical superscalar processor Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc –1 W each through extensive power management

10 Blue Gene/L Compute ASIC

11 Memory system on a BGL node BG/L only supports distributed memory paradigm. No need for efficient support for cache coherence on each node. –Coherence enforced by software if needed. Two cores operate in two modes: –Communication coprocessor mode Need coherence, managed in system level libraries –Virtual node mode Memory is physical partitioned (not shared).

12 Blue Gene/L networks Five networks. –100 Mbps Ethernet control network for diagnostics, debugging, and some other things. –1000 Mbps Ethernet for I/O –Three high-band width, low-latency networks for data transmission and synchronization. 3-D torus network for point-to-point communication Collective network for global operations Barrier network All network logic is integrated in the BG/L node ASIC –Memory mapped interfaces from user space

13 3-D torus network Support p2p communication Link bandwidth 1.4Gb/s, 6 bidirectional link per node (1.2GB/s). 64x32x32 torus: diameter 32+16+16=64 hops, worst case hardware latency 6.4us. Cut-through routing Adaptive routing

14 Collective network Binary tree topology, static routing Link bandwidth: 2.8Gb/s Maximum hardware latency: 5us With arithmetic and logical hardware: can perform integer operation on the data –Efficient support for reduce, scan, global sum, and broadcast operations –Floating point operation can be done with 2 passes.

15 Barrier network Hardware support for global synchronization. 1.5us for barrier on 64K nodes.

16 IBM BlueGene/L summary Optimize cost/performance –limiting applications. –Use low power design Lower frequency, system-on-a-chip Great performance per watt metric Scalability support –Hardware support for global communication and barrier –Low latency, high bandwidth support

17 Case 2: Infiniband architecture –Specification (Infiniband architecture specification release 1.2.1, January 2008/Oct. 2006) available at Infiniband Trade Association (

18 Infiniband architecture overview

19 –Components: Links Channel adaptors Switches Routers –The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. –Topology: Irregular Regular: Fat tree –Link speed: Single data rate (SDR): 2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X). Double data rate (DDR): 5Gbps (X), 20 Gbps (4X) Quad data rate (QDR): 40Gbps (4X)

20 Layers: somewhat similar to TCP/IP –Physical layer –Link layer Error detection (CRC checksum) flow control (credit based) switching, virtual lanes (VL), forwarding table computed by subnet manager –Single path deterministic routing (not adaptive) –Network layer: across subnets. No use for the cluster environment –Transport layer Reliable/unreliable, connection/datagram –Verbs: interface between adaptors and OS/Users

21 Infinoband Link layer Packet format: Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet Global Route Header (GRH): 40 Bytes. Used for routing between subnets Base Transport header (BTH): 12 Bytes, for IBA transport Extened transport header –Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram –Datagram extended transport header (DETH): 8 bytes –RDMA extended transport header (RETH): 16 bytes –Atomic, ACK, Atomic ACK, Immediate DATA extended transport header: 4 bytes, optimized for small packets. Invariant CRC and variant CRC: –CRC for fields not changed and changed.

22 Local Route Header: –Switching based on the destination port address (LID) –Multipath switching by allocating multiple LIDs to one port

23 Subnet management Initialize the network –Discover subnet topology and topology changes, compute the paths, assign LIDs, distribute the routes, configure devices. –Related devices and entities Devices: Channel Adapters (CA), Host Channel Adapters, switches, routers Subnet manager (SM): discovering, configuring, activating and managing the subnet A subnet management agent (SMA) in every device generates, responses to control packets (subnet management packets (SMPs)), and configures local components for subnet management SM exchange control packets with SMA with subnet management interface (SMI).

24 Subnet Management phases: –Topology discovery: sending direct routed SMP to every port and processing the responses. –Path computation: computing valid paths between each pair of end node –Path distribution phase: configuring the forwarding table

25 Base transport header:

26 Verbs –OS/Users access the adaptor through verbs –Communication mechanism: Queue Pair (QP) Users can queue up a set of instructions that the hardware executes. A pair of queues in each QP: one for send, one for receive. Users can post send requests to the send queue and receive requests to the receive queue. Three types of send operations: SEND, RDMA- (WRITE, READ, ATOMIC), MEMORY-BINDING One receive operation (matching SEND)


28 To communicate: –Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). –Post send/receive requests as user level instructions. –Check completion.

29 InfiniBand has an almost perfect software/network interface: –The network subsystem realizes most user level functionality. Network supports in-order delivery and and fault tolerance. Buffer management is pushed out to the user. –OS bypass: User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS.

Download ppt "Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)"

Similar presentations

Ads by Google