Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hot Interconnects 2005 Control Path Implementation for a Low- Latency Optical HPC Switch C. Minkenberg 1, F. Abel 1, P. Müller 1, R. Krishnamurthy 1, M.

Similar presentations


Presentation on theme: "Hot Interconnects 2005 Control Path Implementation for a Low- Latency Optical HPC Switch C. Minkenberg 1, F. Abel 1, P. Müller 1, R. Krishnamurthy 1, M."— Presentation transcript:

1 Hot Interconnects 2005 Control Path Implementation for a Low- Latency Optical HPC Switch C. Minkenberg 1, F. Abel 1, P. Müller 1, R. Krishnamurthy 1, M. Gusat 1, B.R. Hemenway 2 1 IBM Research, Zurich Research Laboratory 2 Corning Inc., Science and Technology

2 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated Outline  Motivation  Can optics play a significant role in high-performance computing interconnects?  OSMOSIS  Requirements  Design decisions  Architecture  Control path challenges  Arbiter speed & complexity  Summary

3 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated HPC Interconnection Networks  Presently implemented as electronic packet switching networks, but quickly approaching electronic limits with further scaling  Future could be based on maturing all-optical packet switching, but need to solve the technical challenges and accelerate the cost reduction of all-optical packet switching for HPC interconnects  build a full-function all-optical packet switch demonstrator system showing the scalability, performance and cost paths for a potential commercial system  Optical Shared MemOry Supercomputer Interconnect System  Sponsored by DoE & NNSA as part of ASC  Joint 2½-year project between Corning (optics and packaging) and IBM (electronics ― arbiter, input and output adapters ― and system design)

4 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated HPC Requirements for OSMOSIS  Near 1 microsecond memory-memory latency  Includes encoding/decoding, arbitration, virtual output queues (VOQ)  Scaling to 2048+ nodes  In a multi-stage topology  Very low bit-error rate (10 -21 )  After forward error correction (FEC) and reliable delivery (RD)  Low switching overhead (<25%)  Includes optical switching overhead, header, line coding, and FEC  FPGA only  Cost and flexibility

5 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated Key design choices  Large switch radix  64-port switch allows scaling to 2048 nodes in two levels  3-stage, 2-level Fat Tree topology with flow control  Basic module scales from 16 to 128 ports  Cell switching  No provisioning or aggregation techniques (burst/container switching)  Full switch reconfiguration every time slot (51.2 ns) requires low overhead and fast arbitration  Enabled by fast semicondunctor optical amplifiers (SOAs)  Input queuing with central arbitration  Optical crossbar, no buffering in optical domain  Electronic input buffers with VOQs to eliminate HOL blocking  Electronic central arbiter to achieve high maximum throughput and low latency  Port speed 40 Gb/s, cell size 256 B,  Allows ~ 25% overhead  Arbitration feasible at 40 Gb/s

6 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated OSMOSIS System Architecture  Broadcast-and-select architecture (crossbar)  Combination of wavelength- and space-division multiplexing  Fast switching based on SOAs  Electronic input and output adapters  Electronic arbitration EQ control 2 Rx central arbiter (bipartite graph matching algorithm) VOQs Tx control 64 Ingress Adapters All-optical Switch 64 Egress Adapters EQ control 2 Rx control links 8 Broadcast Units 128 Select Units 8x11x88x1 Com- biner Fast SOA 1x8 Fiber Selector Gates Fast SOA 1x8 Fiber Selector Gates Fast SOA 1x8 Wavelength Selector Gates Fast SOA 1x8 Wavelength Selector Gates Optical Amplifier WDM Mux Star Coupler 8x11x128 VOQs Tx control 1 8 1 128 1 64 1

7 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated Control Path Challenges  System size  Remote adapter cards – long round-trip time  Control channel protocol  Combines many functions  Matching algorithm  Iterative round-robin-based matching algorithm –Good performance, practical, amenable to distribution –Requires about log 2 (64) = 6 iterations for highest performance  Speed –Short cell duration makes it impossible to complete sufficient iterations  Complexity –Implement iterative matching algorithm for 64 ports @ 51.2 ns in FPGAs –Parallelism and distribution are needed  Packaging  A large number of high-end FPGAs must be accommodated in close proximity

8 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated System Size  Large installations  Switch remote from nodes  Long cables  Long round-trip latency between  adapters and arbiter  adapters and crossbar  arbiter and crossbar  Control channel protocol (ΔRGP)  Incremental requests and grants  Arbiter keeps track of pending requests per VOQ  Careful delay matching to ensure correct operation adap ter arbiter cross- bar adap ter

9 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated Control Channel Protocol  Incremental request/grant protocol (ΔRGP)  To cope with round trip time  “Census”  To ensure consistency of ΔRGP in presence of errors  Reliable delivery  Relaying of intra-switch acknowledgments  Flow control  To prevent egress buffer overflow (on-off watermark-based)  Multicast  Very large fanout (64 bits)  Special control message format  Control channel bandwidth  12 B control messages  2 Gb/s/port (2.5 Gb/s raw) bidirectional  Aggregate arbiter bandwidth = (64 + 16)*2.5*2 = 400 Gb/s!  One control channel interface (CCI) FPGA per two ports

10 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated FLPPR: Fast Low-latency Parallel Pipelined aRbitration  Short cell duration and large radix  Pipelining required to complete enough iterations per matching  Mean latency decreases as the number of iterations increases  FLPPR  Pipelined allocators, parallel requests  Allows requests to be issued to any allocator in any time slot  Matching rate independent of number of iterations  Performance advantages  Eliminates pipelining latency at low load  Achieves 100% throughput with uniform traffic  Reduces latency with respect to PMM also at high load  Can improve throughput with nonuniform traffic  Highly amenable to distributed implementation in FPGAs  Can be applied to any existing iterative matching algorithm delay line sub- arbiter sub- arbiter sub- arbiter sub- arbiter VOQ pending request counters intermediate requests intermediate grants requests grants

11 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated Distribution of Arbiter Complexity  Matching 64 ports = state-of-the-art  Full algorithm does not fit in largest Xilinx FPGA for 64 ports  Approach  Distribute the input and output selectors  Place 2 input selectors per CCI FPGA  Place 64 output selectors in per sub- arbiter FPGA  Works only well with two-phase algorithm (e.g. DRRM, but not SLIP)  Still performs well  Requires careful request policy  New issue  Round trip between input and output selectors: Still under study output selector output selector control channel itf. control channel itf. input selector input selector

12 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated Arbiter Structure and Packaging NameID#location Control channel interface CCI32OSCI[0:31] Switch command interface SCI8OSCI[32:39] Sub-arbitersA4OSCB Clocking and control CLK1OSCB MultiplexerMUX1OSCB ACK routerACK1OSCB Embedded microprocessor μP1OSCB Total47 + 1devices OSCB layout Midplane (OSCB; prototype shown here) with 40 daughter boards (OSCI)

13 OSMOSIS Hot Interconnects 2005 © 2005 IBM Corporation & Corning Incorporated Summary  OSMOSIS  all-optical data path  multi-stage ready  high radix  cell switching  pipelined, distributed central arbitration  low latency, high throughput and low overhead  Project status  All FPGAs designed (placed and routed)  Final arbiter baseboard in layout  Final switch being integrated  Scheduled for completion in 1Q06 Optical switching module (fiber selection stage with 8 SOAs)


Download ppt "Hot Interconnects 2005 Control Path Implementation for a Low- Latency Optical HPC Switch C. Minkenberg 1, F. Abel 1, P. Müller 1, R. Krishnamurthy 1, M."

Similar presentations


Ads by Google