Switch Design a unified view of micro-architecture and circuits Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace.

Slides:



Advertisements
Similar presentations
A Novel 3D Layer-Multiplexed On-Chip Network
Advertisements

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
1 Lecture 16: On-Chip Networks Today: on-chip networks background.
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
1 Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
On-Chip Networks and Testing
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Deadlock CEG 4131 Computer Architecture III Miodrag Bolic.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.
Computer Networks with Internet Technology William Stallings
1 Lecture 26: Networks, Storage Topics: router microarchitecture, disks, RAID (Appendix D) Final exam: Monday 30 th Apr 10:30-12:30 Same rules as the midterm.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
EEE440 Computer Architecture
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.
Lecture 16: Router Design
Efficient Microarchitecture for Network-on-Chip Routers
1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
Flow Control Ben Abdallah Abderazek The University of Aizu
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
On-time Network On-Chip: Analysis and Architecture CS252 Project Presentation Dai Bui.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 Muhammad Waseem Iqbal Lecture # 20 Data Communication.
The network-on-chip protocol
Lecture 23: Interconnection Networks
Packet Switching Datagram Approach Virtual Circuit Approach
Interconnection Networks: Flow Control
Lecture 23: Router Design
Lecture 16: On-Chip Networks
NoC Switch: Basic Design Principles &
Chapter 3 Part 3 Switching and Bridging
Lecture 17: NoC Innovations
Mechanics of Flow Control
Lecture: Interconnection Networks
CEG 4131 Computer Architecture III Miodrag Bolic
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Lecture: Interconnection Networks
Chapter 3 Part 3 Switching and Bridging
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
Presentation transcript:

Switch Design a unified view of micro-architecture and circuits Giorgos Dimitrakopoulos Electrical and Computer Engineering Democritus University of Thrace (DUTH) Xanthi, Greece

Algorithms-Applications System abstraction Processors for computation Memories for storage IO for connecting to the outside world Network for communication and system integration Switch Design - NoCs 2012 Operating System Instruction Set Architecture Microarchitecture Register-Transfer Level Logic design Circuits Devices Network Processors Memory IO G. Dimitrakopoulos - DUTH 2

Logic, State and Memory Datapath functions – Controlled by FSMs – Can be pipelined Mapped on silicon chips – Gate-level netlist from a cell library – Cells built from transistors after custom layout Memory macros store large chunks of data – Multi-ported register files for fast local storage and access of data Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 3

On-Chip Wires Passive devices that connect transistors Many layers of wiring on a chip Wire width, spacing depends on metal layer – High density local connections, Metal 1-5 – Upper metal layers > 6 are wider and used for less dense low-delay global connections Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 4

Future of wires: 2.5D – 3D integration Switch Design - NoCs 2012 Evolution G. Dimitrakopoulos - DUTH 5

Optical wiring Optical connections will be integrated on chip – Useful when the power of electrical connections will limit the available chip IO bandwidth A balanced solution that involves both optical and electrical components will probably win Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 6

Let’s send a word on a chip Sender and receiver on the same clock domain Clock-domain crossing just adds latency Any relation of the sender- receiver clocks is exploited – Mesochronous interface – Tightly coupled synchronizers Switch Design - NoCs 2012 [AMD Zacate] G. Dimitrakopoulos - DUTH 7

Point-to-point links: Flow control Switch Design - NoCs 2012 SR Data SR Valid SR Stall Data Synchronous operation – Data on every cycle Sender can stall – Data valid signal Receiver can stall – Stall (back-pressure) signal Either can stall – Valid and Stall backpressure – Partially decouple Sender and Receiver by adding a buffer at the receive side SR Stall Data G. Dimitrakopoulos - DUTH 8

Sender and Receiver decoupled by a buffer Receiver accepts some of the sender’s traffic even if the transmitted words are not consumed – When to stop? How is buffer overflow avoided? Let’s see first how to build a buffer Clock-domain crossing can be tightly coupled within the buffer Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 9

Buffer organization A FIFO container that maintains order of arrival – 4 interfaces (full, empty, put, get) Elastic – Cascade of depth-1 stages – Internal full/empty signals Shift register in/Parallel out – Put: shift all entries – Get: tail pointer Circular buffer – Memory with head / tail pointers – Wrap around array implementation – Storage can be register based Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 10

Buffer implementation The same basic structure evolves with extra read/write flexibility Multiplexers and head/tail pointers handle data movement and addressing ElasticCircular arrayShift In/Parallel Out Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 11

Link-level flow control: Backpressure Link-level flow control provides a closed feedback loop to control the flow of data from a sender to a receiver Explicit flow control (stall-go) – Receiver notifies the sender when to stop/resume transmission Implicit flow control (credits) – Sender knows when to stop to avoid buffer overflow For unreliable channels we need extra mechanisms for detecting and handling transmission errors Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 12

STALL-GO flow control One signal STALL/GO is sent back to the receiver – STALL=0 (G0) means that the sender is allowed to send – STALL=1 (STALL) means that the sender should stop – The sender changes its behavior the moment it detects a change to the backpressure signal Data valid (not shown) is asserted when new data are available Switch Design - NoCs 2012 Stall G. Dimitrakopoulos - DUTH 13

STALL-GO flow control: example Switch Design - NoCs 2012 Stall In-flight words will be dropped or they will replace the ones that wait to be consumed – In every case data are lost STALL and GO should be connected with the buffer availability of the receiver’s queue – The example assumes that the receiver is stalled or released for other network reasons G. Dimitrakopoulos - DUTH 14

STALL should be asserted early enough – Not drop words in-flight – Timing of STALL assertion guarantees lossless operation GO should be asserted late enough – Have words ready-to-consume before new words arrive – Correct timing guarantees high throughput Minimum buffering for full throughput and lossless operation should cover both STALL&GO re-action cycles Switch Design - NoCs 2012 Buffering requirements of STALL&GO Stall If not available the link remains idle G. Dimitrakopoulos - DUTH 15

Switch Design - NoCs 2012 STALL&GO on pipelined and elastic links Traffic is “blind” during a time interval of Round-trip Time (RTT) – the source will only learn about the effects of its transmission RTT after this transmission has started – the (corrective) effects of a contention notification will only appear at the site of contention RTT after that occurrence G. Dimitrakopoulos - DUTH 16

Credit-based flow control Sender keeps track of the available buffer slots of the receiver – The number of available slots is called credits – The available credits are stored in a credit counter If #credits > 0 sender is allowed to send a new word – Credits are decremented by 1 for each transmitted word When one buffer slot is made free in the receive side, the sender is notified to increase the credit count – An example where credit update signal is registered first at the receive side Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 17

Credit-based flow control: Example Switch Design - NoCs * means that credit counter is incremented and decremented at the same cycle (ways and stays at 0) Credit Update G. Dimitrakopoulos - DUTH 18 Available Credits

Credit-based flow control: Buffers and Throughput Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 19

Condition for 100% throughput The number of registers that the data and the credits pass through define the credit loop – 100% throughput is guaranteed only when the number of available buffer slots at the receive side equals the registers of the credit loop Changing the available number of credits can reconfigure maximum throughput at runtime – Credit-based FC is lossless with any buffer size > 0. – Stall and Go FC requires at least one loop extra buffer space than credit-based FC Switch Design - NoCs 2012 Credit loop G. Dimitrakopoulos - DUTH 20

Link-level flow control enhancements Reservation based flow control – Separate control and data functions – Control links race ahead of the data to reserve resources – When data words arrive, they can proceed with little overhead Speculative flow control – The sender can transmit cells even without sufficient credits Speculative transmissions occur when no other words with available credits is eligible for transmission – The receiver drops an incoming cell if its buffer is full For every dropped word a NACK is returned to the sender Each cell remains stored at the sender until it is positively acknowledged – Each cell may be speculatively transmitted at most once All retransmissions must be performed when credits are available – The sender consumes credit for every cell sent, i.e., for speculative as well as credited transmissions. Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 21

Send a large message(packet) Send long packet of 1Kbit over a 32-bit-wire channel – Serialize the message to 16 words of 32 bits – Need 16 cycles for packet transmission Each packet is transmitted word-by-word Switch Design - NoCs 2012 – When the output port is free, send the next word immediately – Old fashioned Store-and-forward required the entire packet to reach each node before initiating next transmission G. Dimitrakopoulos - DUTH 22

Buffer allocation policies Each transmitted word needs a free downstream buffer slot – When the output of the downstream node is blocked the buffer will hold the arriving words How much free buffering is guaranteed before sending the first word of a packet? – Virtual Cut Through (VCT): The available buffer slots equal the words of the packet Each blocked packet stays together and consumes the buffers of only one node – Wormhole: Just a few are enough Packet inevitably occupies the buffers of more nodes Nothing is lost due to flow control backpressure policy Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 23

VCT and Wormhole in graphics Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 24

Link sharing The number of wires of the link does not increase – One word can be sent on each clock cycle – The channel should be shared A multiplexer is needed at the output port of the sender Switch Design - NoCs 2012 Each packet is sent un-interrupted – Wormhole, and VCT behave this way – Connection is locked for a packet until the tail of the packet passes the output port G. Dimitrakopoulos - DUTH 25

Who drives the select signals? The arbiter is responsible for selecting which packet will gain access to the output channel – A word is sent if buffer slots are available downstream It receives requests from the inputs and grants only one of them – Decisions are based on some internal priority state Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 26

Arbitration for Wormhole and VCT In wormhole and VCT the words of each packet are not mixed with the words of other packets Arbitration is performed once per packet and the decision is locked at the output for all packet duration Even if a packet is blocked downstream the connection does not change until the tail of the packet leaves the output port – Buffer utilization managed by flow control mechanism Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 27

How can I place my buffers? Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 28

Let’s add some complexity: Networks A network of terminal nodes – Each node can be a source or a sink Multiple point-to-point links connected with switches Parallel communication between components Switch Design - NoCs 2012 Source/Sink Terminal Node Switch G. Dimitrakopoulos - DUTH 29

Multiple input-output permutations should be supported Contention should be resolved and non-winning inputs should be handled – Buffered locally – Deflected to the network Separate flow control for each link Each packet needs to know/compute the path to its destination Switch Design - NoCs 2012 Complexity affects the switches G. Dimitrakopoulos - DUTH 30

More than one terminal nodes can connect per switch – Concentration good for bursty traffic – Local switch isolates local traffic from the main network Switch Design - NoCs 2012 How are the terminal nodes connected to the switch? G. Dimitrakopoulos - DUTH 31

Switch design: IO interface Switch Design - NoCs 2012 Separate flow control per link G. Dimitrakopoulos - DUTH 32

Switch design: One output port Switch Design - NoCs 2012 per-output requests Let’s reuse the circuit we already have for one output port G. Dimitrakopoulos - DUTH 33

Switch Design - NoCs 2012 Move buffers to the inputs Switch design: Input buffers Data from input#1 Requests for output #0 G. Dimitrakopoulos - DUTH 34

Switch design: Complete output ports Switch Design - NoCs 2012 How are the output requests computed? G. Dimitrakopoulos - DUTH 35

Routing computation Routing computation generates per output requests – The header of the packet carries the requests for each intermediate node (source routing) – The requests are computed/retrieved based on the packet’s destination (distributed routing) Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 36

Routing logic Routing logic translates a global destination address to a local output port request – To reach node X from node Y should use output port #2 of Y A Lookup-table is enough for holding the request vector that corresponds to each destination Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 37

Switch building blocks Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 38

Running example of switch operation Switch Design - NoCs 2012 Switches transfer packets Packets are broken to flits – Head flit only knows packet’s destination The wires of each link equals the bits of each flit G. Dimitrakopoulos - DUTH 39

Buffer access Switch Design - NoCs 2012 Buffer incoming packets per link Read the destination of the head of each queue G. Dimitrakopoulos - DUTH 40

Routing Computation/Request Generation Compute output requests and drive the output arbiters Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 41

Arbitration-Multiplexer path setup Switch Design - NoCs 2012 Arbitrate per output The grant signals – Drive the output multiplexers – Notify the inputs about the arbitration outcome G. Dimitrakopoulos - DUTH 42

Switch traversal Switch Design - NoCs 2012 Words H will leave the switch on the next clock edge provided they have at least one credit G. Dimitrakopoulos - DUTH 43

Link traversal Switch Design - NoCs 2012 Words going to a non-blocked output leave the switch The grants of a blocked output (due to flow control) are lost – An output arbiter can also stall in case of blocked output G. Dimitrakopoulos - DUTH 44

Head-Of-Line blocking: performance limiter Switch Design - NoCs 2012 The FIFO order of the input buffers limit the throughput of the switch – The flit is blocked by the Head-of-Line that lost arbitration – A memory throughput problem G. Dimitrakopoulos - DUTH 45

Wormhole switch operation Switch Design - NoCs 2012 The operations can fit in the same cycle or they can be pipelined – Extra registers are needed in the control path – Registers in the input/output ports already present – LT at the end involves a register write Body/tail flits inherit the decisions taken by the head flits G. Dimitrakopoulos - DUTH 46

Look-ahead routing Routing computation is based only on packet’s destination – Can be performed in switch A and used in switch B Look-ahead routing computation (LRC) Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 47

Look-ahead routing The LRC is performed in parallel to SA LRC should be completed before the ST stage in the same switch – The head flit needs the output port requests for the next switch Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 48

Look-ahead routing details Switch Design - NoCs 2012 The head flit of each packet carries the output port requests for the next switch together with the destination address G. Dimitrakopoulos - DUTH 49

Low-latency organizations Switch Design - NoCs 2012 Baseline – SA precedes ST (no speculation) SA decoupled from ST – Predict or Speculate arbiter’s decisions – When prediction is wrong replay all the tasks (same as baseline) Do in different phases – Circuit switching – Arbitration and routing at setup phase – At transmit only ST is needed since contention is already resolved Bypass switches – Reduce latency under certain criteria – When bypass not enabled same as baseline ST Setup Xmit SA LRC LT SA LRC LT ST SA LRC STLT STLT G. Dimitrakopoulos - DUTH 50 LT Setup Xmit

Prediction-based ST: Hit Switch Design - NoCs 2012 Crossbar Buffer X+ X- Y+ Y- X+ X- Y+ Y- PREDICTOR Crossbar is reserved Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and SA Correct 1st cycle: RC is performed  The prediction is correct! 2nd cycle: Next flit is transferred to X+ without RC and SA ARBITER G. Dimitrakopoulos - DUTH 51

Prediction-based ST: Miss Switch Design - NoCs 2012 X+ X- Y+ Y- Idle state: Output port X+ is selected and reserved Correct Dead flit 1st cycle: RC is performed  The prediction is wrong! (X- is correct) 2nd/3rd cycle: Dead flit is removed; retransmission to the correct port Buffer X+ X- Y+ Y- PREDICTORARBITER 1st cycle: Incoming flit is transferred to X+ without RC and SA KILL Kill signal to X+ is asserted tasks replayed as the baseline case G. Dimitrakopoulos - DUTH 52

Speculative ST Assume contention doesn’t happen – If correct then flit transferred directly to output port without waiting SA – In case of contention replay SA Wasted cycle in the event of contention – Arbiter decides what will be sent on the next cycle Switch Design - NoCs 2012 Switch Fabric Control B A A clk port 0 port 1 grant valid out data out 014cycle23 A p0 A A B p1 ??? B A A ? B A p0 B A A B A B Wins A Wins G. Dimitrakopoulos - DUTH 53

XOR-based ST Assume contention never happens – If correct then flit transferred directly to output port – If not then bitwise=XOR all the competing flits and send the encoded result to the link – At the same time arbitrate and mask (set to 0) the winning input – Repeat on the next cycle In the case of contention encoded outputs (due to contention) are resolved at the receiver – Can be done at the output port of the switch too Switch Design - NoCs 2012 Switch Fabric Control B A B A A A^B A 014cycle23 clk port 0 port 1 grant valid out data out A p0 A A B p1 B^A A A A No Contention Contention B Wins G. Dimitrakopoulos - DUTH 54

XOR-based ST: Flit recovery Works upon simple XOR property. – (A^B^C) ^ (B^C) = A – Always able to decode by XORing two sequential values Performs similarly to speculative switches – Only head-flit collisions matter – Maintains previous router’s arbitration order Switch Design - NoCs 2012 Coded Flit Buffer AA^B^CB^CCA A^B^CCB^CB G. Dimitrakopoulos - DUTH 55

Bypassing intermediate nodes Switch bypassing criteria: – Frequently used paths – Packets continually moving along the same dimension Most techniques can bypass some pipeline stages only for specific packet transfers and traffic patterns – Not generic enough Switch Design - NoCs cycle SRC DST 3-cycle Virtual bypassing paths 3-cycle 1-cycle Bypassed 1-cycle Bypassed G. Dimitrakopoulos - DUTH 56

Circuit switching Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 57 Network traversal done in phases Path reservation (multiple switch allocations) is done all at once Switch traversal finds no contention – Data buffers are avoided Part of the reserved and unutilized path is needlessly blocked

Speculation-free low-latency switches Prediction and speculation drawbacks – On miss-prediction(speculation) the tasks should be replayed – Latency not always saved. Depends on network conditions Merged Switch allocation and Traversal (SAT) – Latency always saved – no speculation – Delay of SAT smaller than SA and ST in series Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 58

Arbitration and Multiplexing Stop thinking arbitration and multiplexing separately One new algorithm that fits every policy – Generic priority-based solution that works even when arbitration and multiplexing are done separately Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 59

Round-robin arbitration – Most commonly used – Start from the High-Priority position and grant the first active request you find after searching all cyclically all requests – Granted input becomes lowest-priority for the next arbitration Cyclic search found in many other algorithms Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 60

Switch Design - NoCs 2012 Transform each request and priority bit to a 2bit unsigned arithmetic symbol – The request is the MSBit Round-robin arbitration is equivalent to finding the maximum symbol that lies in the rightmost position Cyclic search disappears Let’s think out of the box G. Dimitrakopoulos - DUTH 61

Working examples Switch Design - NoCs 2012 Maximum selection is done via a tree structure – The rightmost maximum symbol always wins Direction flags (L,R) always point to the direction of the winning input – Direction flags form the path to the winning input G. Dimitrakopoulos - DUTH 62

Why not switch data in parallel? Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 63

Grant signals produced simultaneously When F=0 the maximum came from the Right When F=1 the maximum came from the Left Onehot, thermometer, weighted-binary grant signals can be derived by the tree of MAX nodes Switch Design - NoCs 2012 Direction flag F G. Dimitrakopoulos - DUTH 64

Wormhole/VCT MARX-based switches Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 65

SRAM-based input buffers Buffer reads and writes are treated as separate tasks – Buffer write occurs always after link traversal A separate read and write port is required for maximum performance Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 66

Speculative Buffer Read Buffer read occurs after SA for Head flits (no speculation) Buffer read can occur in parallel to SA (speculation) – HOL Head flit is read out before knowing if it received a grant – Once SA has finished speculation is removed for the rest flits Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 67

Pipelining and credits Credit loop begins from upstream SA stage Deep pipelining increases the buffering requirements for 100% throughput – Elastic pipeline stages that can stall independently can partially alleviate the problem Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 68

Bufferless Assume there are no buffers – When a packet loses switch allocation it is: Dropped Deflected to any free output Deflection spreads contention in space (in the network) – Allocation solves contention at each time slot but spreads it in time (next time slots) Deflection (or misrouting) can occur in buffered switches too – Rotary router Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 69

Slicing Switch Design - NoCs 2012 Introduces hierarchy inside the switch When traffic is concentrated to certain outputs the switch suffers high performance penalties Intermediate buffers partially alleviate the loss Dimension slicingPort slicing G. Dimitrakopoulos - DUTH 70

How can we increase throughput? Switch Design - NoCs 2012 Green flow is blocked until red passes the switch. Physical channel left idle G. Dimitrakopoulos - DUTH 71

Decouple output port allocation from next-hop buffer allocation Contention present on: – Output links (crossbar output port) – Input port of the crossbar Contention is resolved by time sharing the resources – Mixing words of two packets on the same channel – The words are on different virtual channels – Separate buffers at the end of the link guarantee no interference between the packets Switch Design - NoCs 2012 Virtual Channels G. Dimitrakopoulos - DUTH 72

Virtual channels Virtual-channel support does not mean extra links – They act as extra street lanes – Traffic on each lane is time shared on a common channel Provide dedicated buffer space for each virtual channel – Decouple channels from buffers – Interleave flits from different packets “The Swiss Army Knife for Interconnection Networks” – Prevent deadlocks – Reduce head-of-line blocking – Provide QoS Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 73

Datapath of a VC-based switch Switch Design - NoCs 2012 Separate buffer for each VC Separate flow control signals (credits) for each VC The radix of the crossbar can stay the same Input VCs can share a common input port of the crossbar – On each cycle at most one VC will receive a new word G. Dimitrakopoulos - DUTH 74

A switch connects input VCs to output VCs Routing computation (RC) determines the output port – May restrict the output VCs that can be used An input VC should allocate first an output VC – Allocation is performed by the VC allocator (VA) RC and VA are done per packet on the head flits and inherited to the rest flits of the packet Switch Design - NoCs 2012 Input VCs Output VCs Per-packet operation of a VC-based switch G. Dimitrakopoulos - DUTH 75

Per-flit operation of a VC-based switch Flits with an allocated output VC fight for an output port – Output port allocated by switch allocator – The VCs of the same input share a common input port of the crossbar – Each input has multiple requests (equal to the #input VCs) The flit leaves the switch provided that credits are available downstream – Credits are counted per output VC Switch Design - NoCs 2012 Input VCs Output VCs G. Dimitrakopoulos - DUTH 76

Switch allocation All VCs at a given input port share one crossbar input port Switch allocator matches ready-to-go flits with crossbar time slots Switch Design - NoCs 2012 – Allocation performed on a cycle-by-cycle basis – N×V requests (input VCs), N resources (output ports) – At most one flit at each input port can be granted – At most one flit et each output port can be leave Other options need more crossbar ports (input-output speedup) G. Dimitrakopoulos - DUTH 77

Switch allocation example One request (arc) for each input VC Example with 2 VCs per input – At most 2 arcs leaving each input – At most 2 requests per row in the request matrix Matching: – Each grant must satisfy a request – Each requester gets at most one grant – Each resource is granted at most once Switch Design - NoCs 2012 Inputs Outputs Inputs Outputs Bipartite graphRequest matrix G. Dimitrakopoulos - DUTH 78

Separable allocation Matchings have at most one grant per row and per column Two phases of arbitration – Column-wise and row-wise – Perform in either order – Arbiters in each stage are independent But the outcome of each one affects the quality of the overall match Fast and cheap Bad choices in first phase can prevent second stage from generating a good matching – Multiple iterations required for a good match Switch Design - NoCs 2012 Input-first: Output-first: G. Dimitrakopoulos - DUTH 79

Implementation G. Dimitrakopoulos - DUTH Switch Design - NoCs Output first allocation Input first allocation

Multi-cycle separable allocators Switch Design - NoCs 2012 Allocators produce better results if they run for many cycles – On each cycle they try to fill the input-output match with new edges We don’t have the time to wait more than one cycle Run two or more allocators in parallel and interleave their grants to the switch G. Dimitrakopoulos - DUTH 81

Centralized allocator Wavefront allocation – Pick initial diagonal – Grant all requests on each diagonal Never conflict! – For each grant, delete requests in same row, column – Repeat for next diagonal Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 82

Switch allocation for adaptive routing Input VCs can request more than one output ports – Called the set of Admissible Output Ports (AOP) – This adds an extra selection step (not arbitration) – Selection mostly tries to load balance the traffic Input-first allocation – For each input VC select one request of the AOP – Arbitrate locally per input and select one input VC – Arbitrate globally per output and select one VC from all fighting inputs Output-first allocation – Send all requests of the AOP of each input VC to the outputs – Arbitrate globally per output and grant one request – Arbitrate locally per input and grant an input VC – For this input VC select one out of the possibly multiple grants of the AOP set Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 83

VC allocation Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels) Before packets can proceed through router, need to claim ownership of VC buffer at next router – VC acquired by head flit, is inherited by body & tail flits VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use – N×V inputs (input VCs), N×V outputs (output VCs) – Once assigned, VC is used for entire packet’s duration in the switch Switch Design - NoCs 2012 Input VCs Output VCs G. Dimitrakopoulos - DUTH 84

VC allocation example Input VC match to an output VC simultaneously with the rest – Even if it belongs to the same input – No port constraint as in switch allocators VC allocation refers to allocating buffer id (output VC) on the next router – Allocation can be both separable (2 arbitration steps) or centralized Switch Design - NoCs 2012 Inputs VCs Output VCs 0 1 In#0 In#1 In#2 Out#0 Requests Grants Out#1 Out#2 Inputs VCs Output VCs 0 1 In#0 In#1 In#2 Out# Out#1 Out#2 G. Dimitrakopoulos - DUTH 85

Any-to-any flexibility in VC allocator is unnecessary – Partition set of VCs to restrict legal requests Different use cases for VCs restrict possible transitions: – Message class never changes – Resource classes are traversed in order VCs within a packet class are functionally equivalent Can take advantage of these properties to reduce VC allocator complexity! Switch Design - NoCs 2012 Input – output VC mapping G. Dimitrakopoulos - DUTH 86

VA single cycle or pipelined organization Header flits see longer latency than body/tail flits – RC, VA decisions taken for head flits and inherited to the rest of the packet – Every flit fights for SA Can we parallelize SA and VA? Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 87

The order of VC and switch allocation VA first SA follows – Only packets will an allocated output VC fight for SA VA and SA can be performed concurrently: – Speculate that waiting packets will successfully acquire a VC – Prioritize non-speculative requests over speculative ones – Speculation holds only for the head flits (The body/tail flits always know their output VC) Switch Design - NoCs 2012 VASADescription Win Everything OK!! Leave the switch WinLose Allocated a VC Retry SA (not speculative - high priority next cycle) LoseWin Does not know the output VC Allocated output port (grant lost – output idle) Lose Retry both VA and SA G. Dimitrakopoulos - DUTH 88

Speculative switch allocation Perform switch allocation in parallel with VC allocation – Speculate that the latter will be successful – If so, saves delay, otherwise try again – Reduces zero-load latency, but adds complexity Prioritize non-speculative requests – Avoid performance degradation due to miss-speculation Usually implemented through secondary switch allocator – But need to prioritize non-speculative grants Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 89

Free list of VCs per output Can assign a VC non-speculatively after SA A free list of output VCs exists at each output – The flit that was granted access to this output receives the first free VC before leaving the switch – If no VC available output port allocation slot is missed Flit retries for switch allocation VCs are not unnecessarily occupied for flits that don’t win SA Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 90

VC buffer implementation Switch Design - NoCs 2012 Static partitioningDynamic partitioning G. Dimitrakopoulos - DUTH 91 Linked-List Shared Buffer Implementation

VC-based switches with MARX units Merged Switch Allocation and Traversal can be applied to VC-based switches too VA can be run before or in parallel to SAT G. Dimitrakopoulos - DUTH Switch Design - NoCs

VC-based switches with MARX units: Datapath Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 93

NoC: The science & art of on-chip connections Switch Design - NoCs 2012 G. Dimitrakopoulos - DUTH 94 Network- on-Chip MICRO ARCH CIRCUITS Micro-architecture of Network-on-Chip Routers Giorgos Dimitrakopoulos, Springer, mid 2013 ADVERTISEMENT

References (1) W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks, DAC 2001 A. Kumar, et al. A 4.6tbits/s 3.6ghz single-cycle noc router with a novel switch allocator. In in 65nm CMOS”, ICCD-2007 A. Kumar, et al. “Express virtual channels: towards the ideal interconnection fabric”, ISCA ’07 H. Matsutani, et al. “Prediction router: A low-latency on-chip router architecture with multiple predictors”, IEEE Trans. Computers, 2011 G. Michelogiannakis, J. Balfour, and W. Dally, “Elastic bufferflow control for on-chip networks”, HPCA Mitchell Hayenga, Mikko Lipasti, “The NoX Router”, MICRO 2011 T. Moscibroda and O. Mutlu. A case for bufferless routing in on-chip networks, ISCA 2009 R. Mullins, A. West, and S. Moore. Low-latency virtual-channel routers for on-chip networks. ISCA 2004 L.-S. Peh and W. J. Dally. A delay model and speculative architecture for pipelined router HPCA 2001 D. Wentzlaff, et al. “On-chip interconnection architecture of the tile processor. Micro, IEEE,2007 Y. J. Yoon, et al. “Virtual channels vs. Multiple Physical Networks”, DAC 2010 M. Azimi, et al. “Flexible and adaptive on-chip interconnect for terascale architectures,” Intel Technology Journal, A. Golander, et al. “A cost-efficient L1–L2 multicore interconnect: Performance, power, and area considerations,” IEEE TCAS-I P. Kumar, “Exploring concentration and channel slicing in on-chip network router,” HPCA 2009 M. Galles, “Spider: A high-speed network interconnect,” IEEE Micro, A. S. Vaidya, et al., “Lapses: A recipe for high performance adaptive router design”, HPCA C. Batten Interconnection Networks Course, Columbia University M. Katevenis, Packet Switch Architectures Course, University of Crete, Greece. W. J. Dally, “Virtual-Channel Flow Control,” ISCA D. U. Becker and W. J. Dally, “Allocator implementations for network-on-chip routers,”, SC S. S. Mukherjee, et al., “A comparative study of arbitration algorithms for the Alpha pipelined router,” ASPLOS Y. Tamir and H.-C. Chi, “Symmetric crossbar arbiters for VLSI communication switches,” IEEE Trans. on Par. and Distributed Systems, J. Hurt, et al., “Design and implementation of high-speed symmetric crossbar schedulers,” in ICC 1999 G. Ascia, et al., “Implementation and analysis of a new selection strategy for adaptive routing in networks-on-chip,” IEEE T. on Comp P. Salihundam, et al., “A 2Tb/s 6x4 Mesh Network with DVFS and 2.3Tb/s/W router in 45nm CMOS,” in Symp. VLSI Circuits, P. Gupta and N. McKeown, “Design and implementation of a fast crossbar scheduler,” IEEE Micro J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale Era”, CRC Press, 2010 G. Dimitrakopoulos - DUTH Switch Design - NoCs

References (2) L. Pirvu et al. “The impact of link arbitration on switch performance,” HPCA, M. Coppola, et al. “Spidergon: A Novel On-Chip Communication Network” IEEE SOC W. Dally and C. Seitz. “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks” IEEE Tran.on Computers, 1987 M. Karol, “Input vs Output Queuing on a Space-Division Packet Switch”, In IEEE Transactions on Communications, Zhonghai Lu, et al. “Evaluation of on-chip networks using deflection routing”. In Proceedings of GLSVLSI, Zhonghai Lu, et al.. “Layered switching for networks on chip”. DAC 2007 R. Ginosar, "Metastability and Synchronizers: A Tutorial," IEEE Design & Test, Sept/Oct G. Dimitrakopoulos, D. Bertozzi, “Switch architecture”, in J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale Era”, CRC Press, 2010 G. Dimitrakopoulos, “Logic-level Design of Basic Switch Components”, in J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale Era”, CRC Press, 2010 G. Dimitrakopoulos E. Kalligeros, “Dynamic-Priority Arbiter and Multiplexer Soft Macros for On-Chip Networks Switches”, DATE 2012 G. Dimitrakopoulos, E. Kalligeros, K. Galanopoulos, “Merged Switch allocation and traversal in Network-on-Chip Switches”, to appear in IEEE transactions on Computers (available at IEEExplore preprints) Se-Joong Lee et al. Packet-Switched On-Chip Interconnection Network for System-on-Chip Applications, IEEE TCAS II Donghyun Kim et al. A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on-Chip, ISCAS2005. Anh Tran and Bevan Baas, "RoShaQ: High-Performance On-Chip Router with Shared Queues,“ iCCD 2011 Anh Tran et al. "A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms,“ IEEE Trans.on CAD, B. Dally and B. Towles, “Interconnection networks”, Morgan Kaufman 2004 C.A. Nicopoulos, “ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers “, MICRO Clive Maxfield, “2D vs. 2.5D vs. 3D ICs 101” EE Times, Design 2012 Mike Santarini, “2.5D ICs are more than a stepping stone to 3D Ics”, EE Times, Design 2012 Nathan Binkert et al., “The Role of Optics in Future High Radix Switch Design”, ISCA-2011 Eylon Caspi “Design Automation for Streaming Systems”, PhD Thesis, Berkeley 2005 C Minkenberg, M Gusat, “Design and performance of speculative flow control for high-radix datacenter interconnect switches”, JPDC 09 Peh, Li-Shiuan and Dally, William J., "Flit-Reservation Flow Control," in HPCA 1999 M. Gerla and L. Kleinrock. Flow Control: A Comparative Survey. IEEE Transactions on Communications, G. Dimitrakopoulos - DUTH Switch Design - NoCs