Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Blue Gene/L Torus Interconnection Network N. R. Adiga, et.al IBM Journal.

Similar presentations


Presentation on theme: "© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Blue Gene/L Torus Interconnection Network N. R. Adiga, et.al IBM Journal."— Presentation transcript:

1 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Blue Gene/L Torus Interconnection Network N. R. Adiga, et.al IBM Journal of Research & Development From http://i.n.com.com/i/ne/p/photo/BlueGeneL_03_550x366.jpg

2 ECE 8813a (2) Overview An initiative for petaflops machine in support of computational biology Influenced by the success of “lattice” architectures targeted to specific problems  Customization and SoC technology  Better price/performance and energy/performance System: 32x32x64 nodes http://www.mcs.anl.gov/bgconsortium/

3 ECE 8813a (3) Packaging and Scale-Up From http://www.rug.nl/cit/diensten/system_services/nieuwsbrief/200504/bouwbluegene- buildup-800x571.jpg

4 ECE 8813a (4) Some Physical Notes Design point emphasizes “cellular” style problems  Nearest neighbor interconnect  Sensitivity to cabling Emphasize on performance/unit volume  Speed/energy Integration  Minimize parts count – separate NI cards/chips

5 ECE 8813a (5) Blue Gene/L Node Image from From http://www.bgl.mcs.anl.gov/Presentations/Bair-INCITE-BGArch-20060301-full.pdf Dual PPC 440 700 MHz cores  Dual issue OOO core  Dual FPUs Five interconnects  Torus inter-processor communication network  Global/collectives network  Global barrier/interrupts  Ethernet  Control network

6 ECE 8813a (6) Blue Gene/L Node PPC 440 32K/32K L1 PPC 440 32K/32K/L1 L2 Shared SRAM buffer Shared L3 Directory (EDRAM) L3 or Memory (EDRAM) toruscollbarrier GigE 128 bits snoop 4 global barriers/Interrupts 3 I/Os6 I/Os 256 bits 2KB Not coherent across L1

7 ECE 8813a (7) The Router Microarchitecture 19x6 Byte wide input output ejection injection input output input output input output input output input output 7 2 each 2 2 2 2 2 2 bypass Adaptive VC Escape VC High Priority VC Input pipeline 8 stage Pressure on routing logic and receiver arbiters Bit serial links – 175 MB/sec  Pin constraints One Kbyte VCs Switch speedup  Concurrent transfers to 2 senders

8 ECE 8813a (8) Router Vital Statistics 8 injection FIFOs  2 high priority and 6 normal 14 ejection FIFOs  Two groups of 7 oOne high priority and six normal for each direction Watermarks on the injection and ejection FIFOs tied to interrupts Deterministic bubble routing on the escape channel Worst case hardware latency through the node is 69 ns Area equivalent to one core

9 ECE 8813a (9) Arbitration Three stage arbitration  Join the shortest queue (JSQ) (RC + VCA) oUse token availability oUse deterministic VC oDo not compete  Serve the longest queue (SLQ) (for SA) o2-bit granularity o% of cycles devoted to randomized selection  Modified SLQ for SA allocation 19x6 Byte wide input output ejection injection input output input output input output input output input output 7 2 each 2 2 2 2 2 2

10 ECE 8813a (10) The Switching Layer Packet size varies from 32 bytes to 256 bytes (32 byte increments) Virtual cut through Token flow control: one token = 32 bytes  Not sufficient for deadlock freedom with variable sized packets?  Flow control + acknowledgements signaling protocol  Buffer allocation and freeing space in the retransmission buffer

11 ECE 8813a (11) The Switching Layer (cont.) Time-outs and link level retransmission for corrupted packets 32 byte chunk packet = 1 – 8 chunks 8 bytes - Link-level info - Routing info - VC & size - 8-bit CRC (protect header) 24 bit CRCvalid

12 ECE 8813a (12) The Routing Layer Adaptive and deterministic minimal path routing  Deadlock freedom: bubble router or deterministic Router registers store state  Neighbor coordinates  Hints early in the header to pipeline arbitration  Routing function implementation: hints+VCs Hardware broadcast down a single dimension 24 bit CRCvalid 101001 Hint bits Router State Registers Route

13 ECE 8813a (13) Tree Structured Collective Network One to all communication Embedded associative operations  min/max, add/sub, and/or Leaf to root latency of 2.5 microseconds Routing table driven

14 ECE 8813a (14) Fast Barrier/Interrupt Network Four tree structured networks for OR/AND operation 1.3 microseconds max delay User space accessible

15 ECE 8813a (15) Impact of Deadlock Avoidance Mechanism Asymmetry in traditional deadlock avoidance schemes From N. R. Adiga, et.al, “Blue Gene/L Torus Interconnection Network,” IBM J. Research and Development, March/May 2005

16 ECE 8813a (16) Impact of Adaptive Routing Diminishing returns for increasing number of virtual channels System non-uniformities affect maximum achievable link utilization  MPI all-to-all pattern

17 ECE 8813a (17) Summary Domain-specific system architecture  Note the impact of system design goals on the choices Heterogeneous interconnection network architecture


Download ppt "© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Blue Gene/L Torus Interconnection Network N. R. Adiga, et.al IBM Journal."

Similar presentations


Ads by Google