Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Interconnection Networks.

Similar presentations


Presentation on theme: "© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Interconnection Networks."— Presentation transcript:

1 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Interconnection Networks

2 ECE 4100/6100 (2) Reading Appendix E

3 ECE 4100/6100 (3) Goals Coverage of basic concepts for high performance multiprocessor interconnection networks  Primarily link & data layer communication protocols Understand established and emerging micro- architecture concepts and implementations

4 ECE 4100/6100 (4) BlackWidow Technology Trends Source: W. J. Dally, “Enabling Technology for On-Chip Networks,” NOCS-1, May 2007

5 ECE 4100/6100 (5) Where is the Demand? Throughput Area, power Cables, connectors, transceivers latency, power cost Performance

6 ECE 4100/6100 (6) Blurring the Boundary Use heterogeneous multi-core chips for embedded devices  IBM Cell  gaming  Intel IXP  network processors  Graphics processors  NVIDIA Use large numbers of multicore processors to build supercomputers  IBM Cell  Intel Tera-ops processor Interconnection networks are central all across the spectrum!

7 ECE 4100/6100 (7) Cray XT3 3D Torus interconnect HyperTransport + Proprietary link/switch technology 45.6GB/s switching capacity per switch

8 ECE 4100/6100 (8) Blue Gene/L

9 ECE 4100/6100 (9) Blue Gene From http://i.n.com.com/i/ne/p/photo/BlueGeneL_03_550x366.jpg

10 ECE 4100/6100 (10) Intel TeraOp Die 2D Mesh Really a test chip Aggressive speed – multi- GHz links?

11 ECE 4100/6100 (11) IBM Cell On-chip network

12 ECE 4100/6100 (12) On-Chip Networks Why are they different?  Abundant bandwidth  Power  Wire length distribution Different functions  Operand networks  Cache memory  Message passing

13 ECE 4100/6100 (13) Some Industry Standards On-Chip  Open Core Protocol oReally an interface standard  AXI PCI Express AMD HyperTransport (HT) Intel QuickPath Interconnect (QPI Infiniband Ethernet These are not traditional switched networks These are traditional switched networks

14 ECE 4100/6100 (14) Basic Concepts Link Level Switch level Topology Routing End-to-End latency

15 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Link Level Concepts

16 ECE 4100/6100 (16) Messaging Hierarchy Routing Layer Switching Layer Physical Layer Where?: Destination decisions, i.e., which output port When?: When is data forwarded How?: synchronization of data transfer Lossless transmission (unlike Ethernet) Performance rather than interoperability is key Largely responsible for deadlock and livelock properties

17 ECE 4100/6100 (17) System View Southbridge Northbridge NI Processor PCIe High latency region From http://www.psc.edu/publications/tech_reports/PDIO/CrayXT3- ScalableArchitecture.jpg

18 ECE 4100/6100 (18) Messaging Units Data is transmitted based on a hierarchical data structuring mechanism  Messages  packets  flits  phits  While flits and phits are fixed size, packets and data may be variable sized Data/Message head flit Flits: flow control digits Phits: physical flow control digits Packets Dest InfoSeq #misc tail flit type

19 ECE 4100/6100 (19) Link Level Flow Control Unit of synchronized communication  Smallest unit whose transfer is requested by the sender and acknowledged by the receiver  No restriction on the relative timing of control vs. data transfers Flow control occurs at two levels  Level of buffer management (flits/packets)  Level of physical transfers (phits)  Relationship between flits and phits is machine & technology specific Briefly: Bufferless flow control

20 ECE 4100/6100 (20) Physical Channel Flow Control Synchronous Flow Control How is buffer space availability indicated? Asynchronous Flow Control What is the limiting factor on link throughput?

21 ECE 4100/6100 (21) Flow Control Mechanisms Credit Based flow control On/off flow control Optimistic Flow control (also Ack/Nack) Virtual Channel Flow Control

22 ECE 4100/6100 (22) Message Flow Control Basic Network Structure and Functions  Credit-based flow control receiversender Sender sends packets whenever credit counter is not zero 10 Credit counter 9876543210 X Queue is not serviced pipelined transfer © T.M. Pinkston, J. Duato, with major contributions by J. Filch

23 ECE 4100/6100 (23) Message Flow Control Basic Network Structure and Functions  Credit-based flow control receiversender 10 Credit counter 9876543210 +5 5432 X Queue is not serviced Receiver sends credits after they become available Sender resumes injection pipelined transfer © T.M. Pinkston, J. Duato, with major contributions by J. Filch

24 ECE 4100/6100 (24) Timeline* credit flit credit Node 1Node 2 process credit Round trip credit time equivalently expressed in number of flow control buffer units - t rt *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

25 ECE 4100/6100 (25) Performance of Credit Based Schemes The control bandwidth can be reduced by submitting block credits Buffers must be sized to maximize link utilization  Large enough to host packets in transit *From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004 link bandwidth flit size #flit buffers

26 ECE 4100/6100 (26) Optimistic Flow Control Transmit flits when available  De-allocate when reception is acknowledged  Re-transmit if flit is dropped (and negative ack received) Issues  Inefficient buffer usage oMessages held at source oRe-ordering may be required due to out of order reception Also known as Ack/Nack flow control

27 ECE 4100/6100 (27) Virtual Channel Flow Control Physical channels are idle when messages block Virtual channels decouple physical channel from the message Flow control (buffer allocation) is between corresponding virtual channels

28 ECE 4100/6100 (28) Virtual Channels Each virtual channel is a pair of unidirectional channels  Independently managed buffers multiplexed over the physical channel Improves performance through reduction of blocking delay Important in realizing deadlock freedom (later)

29 ECE 4100/6100 (29) Virtual Channel Flow Control As the number of virtual channels increase, the increased channel multiplexing has multiple effects (more later)  Overall performance  Router complexity and critical path Flits must now record VC information  Or send VC information out of band VC flit type packet

30 ECE 4100/6100 (30) Flow Control: Global View Flow control parameters are tuned based on link length, link width and processing overhead at the end-points Effective FC and buffer management is necessary for high link utilizations  network throughput  In-band vs. out of band flow control Links maybe non-uniform, e.g., lengths/widths on chips  Buffer sizing for long links Latency: overlapping FC, buffer management and switching  impacts end-to-end latency Some research issues  Reliability  QoS  Congestion management

31 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switching Techniques

32 ECE 4100/6100 (32) Switching Determines “when” message units are moved Relationship with flow control has a major impact on performance  For example, overlapping switching and flow control  Overlapping routing and flow control Tightly coupled with buffer management

33 ECE 4100/6100 (33) Circuit Switching Hardware path setup by a routing header or probe End-to-end acknowledgment initiates transfer at full hardware bandwidth System is limited by signaling rate along the circuits Routing, arbitration and switching overhead experienced once/message trtr tsts t setup t data Time Busy DataAcknowledgmentHeader Probe Link tsts

34 ECE 4100/6100 (34) Circuit Switching Switching  Circuit switching Source end node Destination end node Buffers for “request” tokens © T.M. Pinkston, J. Duato, with major contributions by J. Filch

35 ECE 4100/6100 (35) Routing, Arbitration, and Switching Request for circuit establishment (routing and arbitration is performed during this step) Switching  Circuit switching Source end node Destination end node Buffers for “request” tokens © T.M. Pinkston, J. Duato, with major contributions by J. Filch

36 ECE 4100/6100 (36) Routing, Arbitration, and Switching Request for circuit establishment Switching  Circuit switching Source end node Destination end node Buffers for “ack” tokens Acknowledgment and circuit establishment (as token travels back to the source, connections are established) © T.M. Pinkston, J. Duato, with major contributions by J. Filch

37 ECE 4100/6100 (37) Routing, Arbitration, and Switching Request for circuit establishment Switching  Circuit switching Source end node Destination end node Acknowledgment and circuit establishment Packet transport (neither routing nor arbitration is required) © T.M. Pinkston, J. Duato, with major contributions by J. Filch

38 ECE 4100/6100 (38) Routing, Arbitration, and Switching HiRequest for circuit establishment Switching  Circuit switching Source end node Destination end node Acknowledgment and circuit establishment Packet transport X High contention, low utilization (  )  low throughput © T.M. Pinkston, J. Duato, with major contributions by J. Filch

39 ECE 4100/6100 (39) Packet Switching (Store & Forward) Finer grained sharing of the link bandwidth Routing, arbitration, switching overheads experienced for each packet Increased storage requirements at the nodes Packetization and in-order delivery requirements Alternative buffering schemes  Use of local processor memory  Central (to the switch) queues Message Header t packet Link Message Data trtr Time Busy

40 ECE 4100/6100 (40) Routing, Arbitration, and Switching Switching  Store-and-forward switching Source end node Destination end node Packets are completely stored before any portion is forwarded Store Buffers for data packets © T.M. Pinkston, J. Duato, with major contributions by J. Filch

41 ECE 4100/6100 (41) Routing, Arbitration, and Switching Switching  Store-and-forward switching Source end node Destination end node Packets are completely stored before any portion is forwarded StoreForward Requirement: buffers must be sized to hold entire packet (MTU) © T.M. Pinkston, J. Duato, with major contributions by J. Filch

42 ECE 4100/6100 (42) Virtual Cut-Through Messages cut-through to the next router when feasible In the absence of blocking, messages are pipelined  Pipeline cycle time is the larger of intra-router and inter-router flow control delays When the header is blocked, the complete message is buffered at a switch High load behavior approaches that of packet switching t blocking twtw trtr tsts Packet Header Message Packet cuts through the Router Link Time Busy

43 ECE 4100/6100 (43) Routing, Arbitration, and Switching Switching  Cut-through switching Source end node Destination end node Routing Portions of a packet may be forwarded (“cut-through”) to the next switch before the entire packet is stored at the current switch © T.M. Pinkston, J. Duato, with major contributions by J. Filch

44 ECE 4100/6100 (44) Wormhole Switching Messages are pipelined, but buffer space is on the order of a few flits Small buffers + message pipelining  small compact switches/routers Supports variable sized messages Messages cannot be interleaved over a channel: routing information is only associated with the header Base Latency is equivalent to that of virtual cut-through Link Time Busy trtr tsts t wormhole Single Flit Header Flit

45 ECE 4100/6100 (45) Routing, Arbitration, and Switching Switching  Virtual cut-through  Wormhole Source end node Destination end node Source end node Destination end node Buffers for data packets Requirement: buffers must be sized to hold entire packet (MTU) Buffers for flits: packets can be larger than buffers “Virtual Cut-Through: A New Computer Communication Switching Technique,” P. Kermani and L. Kleinrock, Computer Networks, 3, pp. 267–286, January, 1979. © T.M. Pinkston, J. Duato, with major contributions by J. Filch

46 ECE 4100/6100 (46) Routing, Arbitration, and Switching Switching  Virtual cut-through  Wormhole Source end node Destination end node Source end node Destination end node Busy Link Packet stored along the path Busy Link Packet completely stored at the switch Buffers for data packets Requirement: buffers must be sized to hold entire packet (MTU) Buffers for flits: packets can be larger than buffers “Virtual Cut-Through: A New Computer Communication Switching Technique,” P. Kermani and L. Kleinrock, Computer Networks, 3, pp. 267–286, January, 1979. © T.M. Pinkston, J. Duato, with major contributions by J. Filch

47 ECE 4100/6100 (47) Comparison of Switching Techniques Packet switching and virtual cut-through  Consume network bandwidth proportional to network load  Predictable demands  VCT behaves like wormhole at low loads and like packet switching at high loads  Link level error control for packet switching Wormhole switching  Provides low (unloaded) latency  Lower saturation point  Higher variance of message latency than packet or VCT switching

48 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Topologies

49 ECE 4100/6100 (49) The Network Model registers ALU MEM NI message path Metrics (for now): latency and bandwidth Routing, switching, flow control, error control

50 ECE 4100/6100 (50) Pipelined Switch Microarchitecture CrossBar Stage 1Stage 2Stage 3Stage 4Stage 5 VC Allocation IB (Input Buffering) RC VCA SA ST & Output Buffering Input buffers DEMUX Physical channel Link Control Link Control Physical channel MUX DEMUX MUX Output buffers Link Control Output buffers Link Control Physical channel Physical channel DEMUX MUX DEMUX MUX Routing Computation Switch Allocation L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

51 ECE 4100/6100 (51) Classification Shared medium networks  Example: backplane buses Direct networks  Example: k-ary n-cubes, meshes, and trees Indirect networks  Example: multistage interconnection networks

52 ECE 4100/6100 (52) Direct Networks Buses do not scale, electrically or in bandwidth Full connectivity too expensive (not the same as Xbars) Network built on point-to-point transfers ProcessorMemory Router Ejection channels injection channels

53 ECE 4100/6100 (53) System View ProcessorMemory Router Ejection channels injection channels SB NB NI Processor PCIe High latency region Performance critical From http://www.psc.edu/publications/tech_reports/PDIO/CrayXT3- ScalableArchitecture.jpg

54 ECE 4100/6100 (54) Common Topologies Binary n-cube Torus (3-ary 2-cube) n-dimensional mesh

55 ECE 4100/6100 (55) Metrics NetworkBisection WidthNode Size k-ary n-cube2Wk n-1 2Wn Binary n-cubeNW/2nW n-dimensional meshWk n-1 2Wn

56 ECE 4100/6100 (56) Blocking vs. Non-blocking Networks blocking topology X non-blocking topology 7 6 5 4 3 2 1 0 7 6 54 3210 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 Consider the permutation behavior  Model the input-output requests as permutations of the source addresses © T.M. Pinkston, J. Duato, with major contributions by J. Filch

57 ECE 4100/6100 (57) Crossbar Network 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 © T.M. Pinkston, J. Duato, with major contributions by J. Filch

58 ECE 4100/6100 (58) Non-Blocking Clos Network 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 © T.M. Pinkston, J. Duato, with major contributions by J. Filch

59 ECE 4100/6100 (59) Clos Network Properties General 3 stage non-blocking network  Originally conceived for telephone networks Recursive decomposition  Produces the Benes network with 2x2 switches

60 ECE 4100/6100 (60) Clos Network: Recursive Decomposition 16 port, 5-stage Clos network 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 © T.M. Pinkston, J. Duato, with major contributions by J. Filch

61 ECE 4100/6100 (61) Clos Network: Recursive Decomposition 16 port, 7 stage Clos network = Benes topology 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 © T.M. Pinkston, J. Duato, with major contributions by J. Filch

62 ECE 4100/6100 (62) Path Diversity Alternative paths from 0 to 1. 16 port, 7 stage Clos network = Benes topology 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 © T.M. Pinkston, J. Duato, with major contributions by J. Filch

63 ECE 4100/6100 (63) Path Diversity 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 Alternative paths from 4 to 0. 16 port, 7 stage Clos network = Benes topology © T.M. Pinkston, J. Duato, with major contributions by J. Filch

64 ECE 4100/6100 (64) Path Diversity 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 Contention free, paths 0 to 1 and 4 to 1. 16 port, 7 stage Clos network = Benes topology © T.M. Pinkston, J. Duato, with major contributions by J. Filch

65 ECE 4100/6100 (65) Moving to Fat Trees Nodes at tree leaves Switches at tree vertices Total link bandwidth is constant across all tree levels, with full bisection bandwidth Equivalent to folded Benes topology Preferred topology in many system area networks Folded Clos = Folded Benes = Fat tree network Network Bisection © T.M. Pinkston, J. Duato, with major contributions by J. Filch

66 ECE 4100/6100 (66) Fat Trees: Another View Equivalent to the preceding multistage implementation Common topology in many supercomputer installations Forward Backward

67 ECE 4100/6100 (67) Relevance of Fat Trees 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 Popular interconnect for commodity supercomputing Active research problems  Efficient, low latency routing  Fault tolerant routing  Scalable protocols (coherence) Routing in MINs

68 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Routing

69 ECE 4100/6100 (69) Generic Router Architecture

70 ECE 4100/6100 (70) Routing Algorithms Deterministic Routing Oblivious Routing Source Routing Table-based routing Adaptive Routing Unicast vs. multicast

71 ECE 4100/6100 (71) Deterministic Routing Algorithms Strictly increasing or decreasing order of dimension Acyclic channel dependencies Mesh Binary Hypercube Deterministic routing

72 © Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Deadlock

73 ECE 4100/6100 (73) Overview: The Problem Guaranteeing message deliveryGuaranteeing message delivery  Fundamentally an issue of finite resources (buffers) and the manner in which they are allocated oHow are distributed structural hazards handled Waiting for resources can lead to  Deadlock: configuration of messages that cannot make progress  Livelock: messages that never reach the destination  Starvation: messages that never received requested resources oEven though the requested resource does become available

74 ECE 4100/6100 (74) Deadlocked Configuration of Messages Routing in a 2D mesh What can we infer from this figure?  Routing  Wait-for relationships

75 ECE 4100/6100 (75) Example of Deadlock Freedom Strictly increasing or decreasing order of dimension Acyclic channel dependencies Mesh Binary Hypercube

76 ECE 4100/6100 (76) Channel Dependencies When a packet holds a channel and requests another channel, there is a direct dependency between them Channel dependency graph D = G(C,E) For deterministic routing: single dependency at each node For adaptive routing: all requested channels produce dependencies, and dependency graph may contain cycles dependency SAF & VCT dependency Wormhole dependency Header flit Data flit

77 ECE 4100/6100 (77) Breaking Cycles in Rings/Torii n0n1 n3n2 n0n1 n3n2 c0c0 c1c1 c2c2 c3c3 c 10 c 11 c 02 c 12 c 01 c 03 c 13 c 00 The configuration to the left can deadlock Add (virtual) channels  We can make the channel dependency graph acyclic via routing restrictions (via the routing function) Routing function is c 0i when j i

78 ECE 4100/6100 (78) Breaking Cycles in Rings/Torii (cont.) Channels c 00 and c 13 are unused Routing function breaks cycles in the channel dependency graph n0n1 n3n2 c 10 c 11 c 02 c 12 c 01 c 03 c 00 c 10 c 11 c 12 c 01 c 02 c 03 c 13

79 ECE 4100/6100 (79) Summary Multiprocessor and multicore interconnection networks are different from traditional inter- processor communication networks  Unique microarchitectures  Distinct concepts Network-on-chip critical to multicore architectures

80 ECE 4100/6100 (80) Glossary Adaptive routing Arbitration Bisection bandwidth Blocking networks Channel dependency Circuit switching Clos networks Credit-based flow control Crossbar network Deadlock and deadlock freedom Deterministic routing Direct networks Fat trees Flit Flow control Hypercubes Indirect networks K-ary n-cubes Oblivious routing Multistage networks Non-blocking networks Packet Packet switching Phit Routing Routing header Routing payload Router computation Switch allocation Virtual channels Virtual channel allocation (in a switch) Virtual channel flow control Virtual cut-through switching Wormhole switching


Download ppt "© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Interconnection Networks."

Similar presentations


Ads by Google