Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Natalie Enright Jerger

Similar presentations


Presentation on theme: "Prof. Natalie Enright Jerger"— Presentation transcript:

1 Prof. Natalie Enright Jerger
ECE 1749H: Interconnection Networks for Parallel Computer Architectures: Router Microarchitecture Prof. Natalie Enright Jerger Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

2 ECE 1749H: Interconnection Networks (Enright Jerger)
Introduction Topology: connectivity Routing: paths Flow control: resource allocation Router Microarchitecture: implementation of routing, flow control and router pipeline Impacts per-hop delay and energy Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

3 Router Microarchitecture Overview
Focus on microarchitecture of Virtual Channel router Router complexity increase with bandwidth demands Simple routers built when high throughput is not needed Wormhole flow control, unpipelined, limited buffer Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

4 Virtual Channel Router
Credits Out Credits In VC Allocator Route Computation Switch Allocator Input buffers VC 1 VC 2 VC 3 VC 4 Input 1 Output 1 Input buffers VC 1 VC 2 VC 3 VC 4 Input 5 Output 5 Winter 2010 Crossbar switch

5 ECE 1749H: Interconnection Networks (Enright Jerger)
Router Components Input buffers, route computation logic, virtual channel allocator, switch allocator, crossbar switch Most OCN routers are input buffered Use single-ported memories Buffer store flits for duration in router Contrast with processor pipeline that latches between stages Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

6 Baseline Router Pipeline
BW RC VA SA ST LT Logical stages Fit into physical stages depending on frequency Canonical 5-stage pipeline BW: Buffer Write RC: Routing computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

7 Atomic Modules and Dependencies in Router
Decode + Routing Switch Arbitration Crossbar Traversal Wormhole Router Decode + Routing VC Allocation Switch Arbitration Crossbar Traversal Virtual Channel Router VC Allocation Decode + Routing Crossbar Traversal Speculative Switch Arbitration Speculative Virtual Channel Router Dependence between output of one module and input of another Determine critical path through router Cannot bid for switch port until routing performed Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

8 ECE 1749H: Interconnection Networks (Enright Jerger)
Atomic Modules Some components of router cannot be easily pipelined Example: pipeline VC allocation Grants might not be correctly reflected before next allocation Separable allocator: many wires connecting input/output stages requiring latches if pipelined Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

9 Baseline Router Pipeline (2)
1 2 3 4 5 6 7 8 9 BW RC VA SA ST LT Head BW SA ST LT Body 1 BW SA ST LT Body 2 BW SA ST LT Tail Routing computation performed once per packet Virtual channel allocated once per packet Body and tail flits inherit this info from head flit Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

10 Router Pipeline Performance
Baseline (no load) delay Ideally, only pay link delay Techniques to reduce pipeline stages BW NRC VA SA ST LT Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

11 Pipeline Optimizations: Lookahead Routing
At current router perform routing computation for next router Overlap with BW Precomputing route allows flits to compete for VCs immediately after BW RC decodes route header Routing computation needed at next hop Can be computed in parallel with VA BW RC VA SA ST LT Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

12 Pipeline Optimizations: Speculation
Assume that Virtual Channel Allocation stage will be successful Valid under low to moderate loads Entire VA and SA in parallel If VA unsuccessful (no virtual channel returned) Must repeat VA/SA in next cycle Prioritize non-speculative requests BW RC VA SA ST LT Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

13 Pipeline Optimizations: Bypassing
When no flits in input buffer Speculatively enter ST On port conflict, speculation aborted In the first stage, a free VC is allocated, next routing is performed and the crossbar is setup VA RC Setup ST LT Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

14 Pipeline Bypassing No buffered flits when A arrives
Lookahead Routing Computation Virtual Channel Allocation 1a Inject N 1 1b A S E W Eject N S E 2 W No buffered flits when A arrives Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

15 Speculation A succeeds in VA but fails in SA, retry SA
Virtual Channel Allocation 2a Lookahead Routing Computation 1a Switch Allocation 2b 3 Port conflict detected Inject 1 1c 1b B N 1 1c 1b A S E W Eject N S E W 4 Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger) 3

16 ECE 1749H: Interconnection Networks (Enright Jerger)
Buffer Organization Physical channels Virtual channels Also multiple variable length queues Single buffer per input Multiple fixed length queues per physical channel Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

17 ECE 1749H: Interconnection Networks (Enright Jerger)
Buffer Organization VC 0 tail head VC 1 tail head Multiple variable length queues Multiple VCs share a large buffer Each VC must have minimum 1 flit buffer Prevent deadlock More complex circuitry Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

18 ECE 1749H: Interconnection Networks (Enright Jerger)
Buffer Organization Many shallow VCs? Few deep VCs? More VCs ease HOL blocking More complex VC allocator Light traffic Many shallow VCs – underutilized Heavy traffic Few deep VCs – less efficient, packets blocked due to lack of VCs Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

19 ECE 1749H: Interconnection Networks (Enright Jerger)
Switch Organization Heart of datapath Switches bits from input to output High frequency crossbar designs challenging Crossbar composed for many multiplexers Common in low-frequency router designs i00 i10 i20 i30 i40 o0 sel0 o1 sel1 o2 sel2 o3 sel3 o4 sel4 Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

20 Switch Organization: Crosspoint
Inject N S E W Eject w columns w rows Area and power scale at O((pw)2) p: number of ports (function of topology) w: port width in bits (determines phit/flit size and impacts packet energy and delay) Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

21 ECE 1749H: Interconnection Networks (Enright Jerger)
Crossbar speedup 10:5 crossbar 5:10 crossbar 10:10 crossbar Increase internal switch bandwidth Simplifies allocation or gives better performance with a simple allocator More inputs to select from  higher probability each output port will be matched (used) each cycle Output speedup requires output buffers Multiplex onto physical link Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

22 Crossbar Dimension Slicing
Crossbar area and power grow with O((pw)2) Replace 1 5x5 crossbar with 2 3x3 crossbars Suited to DOR Traffic mostly stays within 1 dimension Inject E-in E-out W-in W-out N-in N-out S-in S-out Ports and width Eject Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

23 Arbiters and Allocators
Allocator matches N requests to M resources Arbiter matches N requests to 1 resource Resources are VCs (for virtual channel routers) and crossbar switch ports. Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

24 Arbiters and Allocators (2)
Virtual-channel allocator (VA) Resolves contention for output virtual channels Grants them to input virtual channels Switch allocator (SA) that grants crossbar switch ports to input virtual channels Allocator/arbiter that delivers high matching probability translates to higher network throughput. Must also be fast and/or able to be pipelined Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

25 ECE 1749H: Interconnection Networks (Enright Jerger)
Round Robin Arbiter Last request serviced given lowest priority Generate the next priority vector from current grant vector Exhibits fairness Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

26 ECE 1749H: Interconnection Networks (Enright Jerger)
Round Robin (2) Priority 0 Next priority 0 Grant 0 Priority 1 Next priority 1 Grant 1 Priority 2 Next priority 2 Grant 2 Gi granted, next cycle Pi+1 high Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

27 ECE 1749H: Interconnection Networks (Enright Jerger)
Matrix Arbiter Least recently served priority scheme Triangular array of state bits wij for i < j Bit wij indicates request i takes priority over j Each time request k granted, clears all bits in row k and sets all bits in column k Good for small number of inputs Fast, inexpensive and provides strong fairness Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

28 ECE 1749H: Interconnection Networks (Enright Jerger)
Matrix Arbiter (2) Request 0 01 02 10 Request 1 20 21 12 Grant 0 Disable 2 Disable 1 Disable 0 Request 2 Grant 1 Grant 2 Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

29 Matrix Arbiter Example
Request 0 01 02 10 Request 1 20 Grant 0 Disable 2 Disable 1 Disable 0 Request 2 Grant 1 Grant 2 1 A2 A1 B1 1 1 C2 C1 Bit [1,0] = 1, Bit [2,0] = 1  1 and 2 have priority over 0 Bit [2,1] = 1  2 has priority over 1 C1 (Req 2) granted Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

30 Matrix Arbiter Example (2)
Request 0 01 02 10 Request 1 20 Grant 0 Disable 2 Disable 1 Disable 0 Request 2 Grant 1 Grant 2 1 1 1 A2 A1 B1 C2 Set column 2, clear row 2 Bit [1,0] = 1, Bit [1,2] = 1  Req 1 has priority over 0 and 2 Grant B1 (Req 1) Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

31 Matrix Arbiter Example (3)
Request 0 01 02 10 Request 1 20 Grant 0 Disable 2 Disable 1 Disable 0 Request 2 Grant 1 Grant 2 1 1 A2 A1 1 C2 Set column 1, clear row 1 Bit [0,1] = 1, Bit [0,2] = 1  Req 0 has priority over 1 and 2 Grant A1 (Req 0) Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

32 Matrix Arbiter Example (4)
Request 0 01 02 10 Request 1 20 Grant 0 Disable 2 Disable 1 Disable 0 Request 2 Grant 1 Grant 2 1 A2 1 1 C2 Set column 0, clear row 0 Bit [2,0] = 1, Bit [2,1] = 1  Req 2 has priority over 0 and 1 Grant C2 (Req 2) Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

33 Matrix Arbiter Example (5)
Request 0 01 02 10 Request 1 20 Grant 0 Disable 2 Disable 1 Disable 0 Request 2 Grant 1 Grant 2 1 1 1 A2 Set column 2, clear row 2 Grant Request A2 Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

34 ECE 1749H: Interconnection Networks (Enright Jerger)
Wavefront Allocator Arbitrates among requests for inputs and outputs simultaneously Row and column tokens granted to diagonal group of cells If a cell is requesting a resource, it will consume row and column tokens Request is granted Cells that cannot use tokens pass row tokens to right and column tokens down Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

35 Wavefront Allocator Example
Tokens inserted at P0 Entry [0,0] receives grant, consumes token A requesting Resources 0, 1 ,2 p0 00 01 02 03 Remaining tokens pass down and right B requesting Resources 0, 1 p1 10 11 12 13 C requesting Resource 0 p2 20 21 22 23 [3,2] receives 2 tokens and is granted D requesting Resources 0, 2 p3 30 31 32 33 Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

36 Wavefront Allocator Example
00 01 02 03 [1,1] receives 2 tokens and granted p1 10 11 12 13 All wavefronts propagated p2 20 21 22 23 p3 30 31 32 33 Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

37 ECE 1749H: Interconnection Networks (Enright Jerger)
Separable Allocator Need for pipelineable allocators Allocator composed of arbiters Arbiter chooses one out of N requests to a single resource Separable switch allocator First stage: select single request at each input port Second stage: selects single request for each output port Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

38 Separable Allocator A 3:4 allocator
Requestor 1 requesting resource A 3:1 arbiter Resource A granted to Requestor 1 Requestor 1 requesting resource C 4:1 arbiter Resource A granted to Requestor 2 Resource A granted to Requestor 3 Resource A granted to Requestor 4 3:1 arbiter 4:1 arbiter 3:1 arbiter Resource C granted to Requestor 1 4:1 arbiter Resource C granted to Requestor 2 Requestor 4 requesting resource A Resource C granted to Requestor 3 3:1 arbiter Resource C granted to Requestor 4 A 3:4 allocator First stage: 3:1 – ensures only one grant for each input Second stage: 4:1 – only one grant asserted for each output Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

39 Separable Allocator Example
Requestor 1 wins A A 3:1 arbiter B 4:1 arbiter C A 3:1 arbiter B 4:1 arbiter A 3:1 arbiter Requestor 4 wins C 4:1 arbiter A 3:1 arbiter need to animate C 4 requestors, 3 resources Arbitrate locally among requests Local winners passed to second stage Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

40 Virtual Channel Allocator Organization
Depends on routing function If routing function returns single VC VCA need to arbitrate between input VCs contending for same output VC Returns multiple candidate VCs (for same physical channel) Needs to arbitrate among v first stage requests before forwarding winning request to second stage) Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

41 Virtual Channel Allocators
If routing function returns single virtual channel Need piv:1 arbiter for each output virtual channel (pov) Arbitrate among input VCs competing for same output VC piv:1 arbiter 1 piv:1 arbiter pov Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

42 Virtual Channel Allocators
v:1 arbiter 1 v:1 arbiter po 1 Routing function returns VCs on a single physical channel First stage of v:1 arbiters for each input VC Second stage piv:1 arbiters for each output VC piv:1 arbiter 1 piv:1 arbiter pov v:1 arbiter 1 v:1 arbiter po pi Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

43 Virtual Channel Allocators
pov:1 arbiter 1 pov:1 arbiter piv piv:1 arbiter 1 piv:1 arbiter pov Routing function returns candidate VCs on any physical channel First stage: pov:1 arbiter to handle max pov output VCs desired by each input VC Second stage: piv:1 for each output VC Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

44 Adaptive Routing & Allocator Design
Deterministic routing Single output port Switch allocator bids for output port Adaptive routing Returns multiple candidate output ports Switch allocator can bid for all ports Granted port must match VC granted Return single output port Reroute if packet fails VC allocation Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

45 Separable Switch Allocator
First stage: Pi v:1 arbiters For each Pi input, select among v input virtual channels Second stage: Po pi:1 arbiters Winners of v:1 arbiters select output port request of winning VC Forward output port request to pi:1 arbiters Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

46 ECE 1749H: Interconnection Networks (Enright Jerger)
Speculative VC Router Non-speculative switch requests must have higher priority than speculative ones Two parallel switch allocators 1 for speculative 1 for non-speculative From output, choose non-speculative over speculative Possible for flit to succeed in speculative switch allocation but fail in virtual channel allocation Done in parallel Speculation incorrect Switch reservation is wasted Body and Tail flits: non-speculative switch requests Do not perform VC allocation  inherit VC from head flit Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

47 ECE 1749H: Interconnection Networks (Enright Jerger)
Router Floorplanning Determining placement of ports, allocators, switch Critical path delay Determined by allocators and switch traversal Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

48 Router Floorplanning 16 Bytes 5x5 Crossbar (0.763 mm2) BF P1 P0 P3 P2
SA + BFC + VA (0.016 mm2) P3 P2 Req Grant + misc control lines (0.043 mm2) Req Grant + misc control lines (0.043 mm2) left off here… P4 Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

49 ECE 1749H: Interconnection Networks (Enright Jerger)
Router Floorplanning North Output M5 North Input North Output Module M5 Local Output Module Switch West Output HR M6 North Input Module South Input Module West Input Module Local Input Module East Input Module West Output Module East Output Module M6 East Output South Output Module M5 South Input South Output M5 WR Placing all input ports on left side Frees up M5 and M6 for crossbar wiring Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

50 Microarchitecture Summary
Ties together topological, routing and flow control design decisions Pipelined for fast cycle times Area and power constraints important in NoC design space Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

51 Interconnection Network Summary
Throughput given by flow control Latency Throughput given by routing Zero load latency (topology+routing+flow control) Throughput given by topology Min latency given by routing algorithm Min latency given by topology Offered Traffic (bits/sec) Latency vs. Offered Traffic Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

52 Towards the Ideal Interconnect
Ideal latency Solely due to wire delay between source and destination D = Manhatten distance L = packet size b = channel bandwidth v = propagation velocity Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

53 ECE 1749H: Interconnection Networks (Enright Jerger)
State of the Art Dedicated wiring impractial Long wires segmented with insertion of routers Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

54 Latency Throughput Gap
Latency gap Aggressive speculation and bypassing 8 VCs/port Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

55 Towards the Ideal Interconnect
Ideal Energy Only energy of interconnect wires D = Distance Pwire = transmission power per unit length Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

56 ECE 1749H: Interconnection Networks (Enright Jerger)
State of the Art No longer just wires Prouter = buffer read/write power, arbitration power, crossbar traversal Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

57 ECE 1749H: Interconnection Networks (Enright Jerger)
Power Gap Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

58 Key Research Challenges
Low power on-chip networks Power consumed largely dependent on bandwidth it has to support Bandwidth requirement depends on several factors Beyond conventional interconnects Power efficient link designs 3D stacking Optics Resilient on-chip networks Manufacturing defects and variability Soft errors and wearout Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)

59 ECE 1749H: Interconnection Networks (Enright Jerger)
Next Week Paper 1: Flattened Butterfly Presenter: Robert Hesse Paper 2: Design and Evaluation of a Hierarchical On-Chip Interconnect Presenter: Jason Luu Paper 3: Design Trade-offs for Tiled CMP On-Chip Networks Paper 4: Cost-Efficient Dragonfly Two critiques due at the start of class Winter 2010 ECE 1749H: Interconnection Networks (Enright Jerger)


Download ppt "Prof. Natalie Enright Jerger"

Similar presentations


Ads by Google