CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu.

Slides:



Advertisements
Similar presentations
George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Lecture 12 Reduce Miss Penalty and Hit Time
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Congestion Control Reasons: - too many packets in the network and not enough buffer space S = rate at which packets are generated R = rate at which receivers.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun,
$ A Case for Bufferless Routing in On-Chip Networks A Case for Bufferless Routing in On-Chip Networks Onur Mutlu CMU TexPoint fonts used in EMF. Read the.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
Nick McKeown CS244 Lecture 6 Packet Switches. What you said The very premise of the paper was a bit of an eye- opener for me, for previously I had never.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Switching, routing, and flow control in interconnection networks.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Networks-on-Chips (NoCs) Basics
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
NC2 (No.4) 1 Undeliverable packets & solutions Deadlock: packets are unable to progress –Prevention, avoidance, recovery Livelock: packets cannot reach.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Yu Cai Ken Mai Onur Mutlu
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
1 Buffering Strategies in ATM Switches Carey Williamson Department of Computer Science University of Calgary.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Virtual-Channel Flow Control William J. Dally
Improving Fault Tolerance in AODV Matthew J. Miller Jungmin So.
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
An End-to-End Service Architecture r Provide assured service, premium service, and best effort service (RFC 2638) Assured service: provide reliable service.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Flow Control Ben Abdallah Abderazek The University of Aizu
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
Chris Fallin Carnegie Mellon University
The network-on-chip protocol
ISPASS th April Santa Rosa, California
Rachata Ausavarungnirun, Kevin Chang
Exploring Concentration and Channel Slicing in On-chip Network Router
Lecture 23: Router Design
Using Packet Information for Efficient Communication in NoCs
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Lecture: Interconnection Networks
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
A Case for Bufferless Routing in On-Chip Networks
Presentation transcript:

CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu

Motivation In many-core chips, on-chip interconnect (NoC) consumes significant power.  Intel Terascale: ~30%; MIT RAW: ~40% of system power Must maintain low latency and good throughput  critical path for cache misses 2 CoreL1 L2 Slice Router

Motivation Recent work has proposed bufferless deflection routing (BLESS [Moscibroda, ISCA 2009])  Energy savings: ~40% in total NoC energy  Area reduction: ~40% in total NoC area  Minimal performance loss: ~4% on average  Unfortunately: unaddressed complexities in router  long critical path, large reassembly buffers Goal: obtain these benefits while simplifying the router in order to make bufferless NoCs practical. 3

Destination Bufferless Deflection Routing Key idea: Packets are never buffered in the network. When two packets contend for the same link, one is deflected. 4  No buffers  lower power, smaller area  Conceptually simple New traffic can be injected whenever there is a free output link. New traffic can be injected whenever there is a free output link.

Problems that Bufferless Routers Must Solve 1. Must provide livelock freedom  A packet should not be deflected forever 2. Must reassemble packets upon arrival 5 Flit: atomic routing unit Packet: one or multiple flits

Local Node Router Inject Deflection Routing Logic Deflection Routing Logic Crossbar A Bufferless Router: A High-Level View 6 Reassembly Buffers Reassembly Buffers Eject Problem 2: Packet Reassembly Problem 1: Livelock Freedom

Complexity in Bufferless Deflection Routers 1. Must provide livelock freedom Flits are sorted by age, then assigned in age order to output ports  43% longer critical path than buffered router 2. Must reassemble packets upon arrival Reassembly buffers must be sized for worst case  4KB per node (8x8, 64-byte cache block) 7

Inject Deflection Routing Logic Deflection Routing Logic Crossbar Problem 1: Livelock Freedom 8 Reassembly Buffers Reassembly Buffers Eject Problem 1: Livelock Freedom

Livelock Freedom in Previous Work What stops a flit from deflecting forever? All flits are timestamped Oldest flits are assigned their desired ports Total order among flits But what is the cost of this? 9 Flit age forms total order Guaranteed progress! <<< < < New traffic is lowest priority

Age-Based Priorities are Expensive: Sorting Router must sort flits by age: long-latency sort network  Three comparator stages for 4 flits

Age-Based Priorities Are Expensive: Allocation After sorting, flits assigned to output ports in priority order Port assignment of younger flits depends on that of older flits  sequential dependence in the port allocator 11 East?GRANT: Flit 1  East DEFLECT: Flit 2  North GRANT: Flit 3  South DEFLECT: Flit 4  West East? {N,S,W} {S,W} {W} South? Age-Ordered Flits

Age-Based Priorities Are Expensive Overall, deflection routing logic based on Oldest-First has a 43% longer critical path than a buffered router Question: is there a cheaper way to route while guaranteeing livelock-freedom? 12 Port AllocatorPriority Sort

Solution: Golden Packet for Livelock Freedom What is really necessary for livelock freedom? Key Insight: No total order. it is enough to: 1. Pick one flit to prioritize until arrival 2. Ensure any flit is eventually picked 13 Flit age forms total order Guaranteed progress! New traffic is lowest-priority <<< Guaranteed progress! < “Golden Flit” partial ordering is sufficient!

Only need to properly route the Golden Flit First Insight: no need for full sort Second Insight: no need for sequential allocation What Does Golden Flit Routing Require? 14 Port AllocatorPriority Sort

Golden Flit Routing With Two Inputs Let’s route the Golden Flit in a two-input router first Step 1: pick a “winning” flit: Golden Flit, else random Step 2: steer the winning flit to its desired output and deflect other flit  Golden Flit always routes toward destination 15

Golden Flit Routing with Four Inputs 16 Each block makes decisions independently! Deflection is a distributed decision N E S W N S E W

Permutation Network Operation 17 N E S W wins  swap! wins  no swap! deflected Golden: wins  swap! x x Port AllocatorPriority Sort N E S W N S E W

Which Packet is Golden? We select the Golden Packet so that: 1. a given packet stays golden long enough to ensure arrival  maximum no-contention latency 2. the selection rotates through all possible packet IDs  static rotation schedule for simplicity 18 SourceDestRequest ID Src 0Req 0 Golden Src 1Src 2Src 3Src 0Req 1Src 1Src 2Src 3 Packet Header: Cycle

Permutation Network-based Pipeline 19 Inject/Eject Reassembly Buffers Reassembly Buffers InjectEject

Problem 2: Packet Reassembly 20 Inject/Eject Reassembly Buffers Reassembly Buffers InjectEject

Reassembly Buffers are Large Worst case: every node sends a packet to one receiver Why can’t we make reassembly buffers smaller? 21 Node 0 Node 1 Node N-1 Receiver one packet in flight per node N sending nodes … O(N) space!

Small Reassembly Buffers Cause Deadlock What happens when reassembly buffer is too small? 22 Network cannot eject: reassembly buffer full reassembly buffer Many Senders One Receiver Remaining flits must inject for forward progress cannot inject new traffic network full

Reserve Space to Avoid Deadlock? What if every sender asks permission from the receiver before it sends?  adds additional delay to every request 23 reassembly buffers Reserve Slot? Reserved ACK Sender 1.Reserve Slot 2.ACK 3.Send Packet Receiver

Escaping Deadlock with Retransmissions Sender is optimistic instead: assume buffer is free  If not, receiver drops and NACKs; sender retransmits  no additional delay in best case  transmit buffering overhead for all packets  potentially many retransmits 24 Reassembly Buffers Retransmit Buffers NACK! Sender ACK Receiver 1.Send (2 flits) 2.Drop, NACK 3.Other packet completes 4.Retransmit packet 5.ACK 6.Sender frees data

Solution: Retransmitting Only Once Key Idea: Retransmit only when space becomes available.  Receiver drops packet if full; notes which packet it drops  When space frees up, receiver reserves space so retransmit is successful  Receiver notifies sender to retransmit 25 Reassembly Buffers Retransmit Buffers NACK Sender Reserved Receiver Pending: Node 0 Req 0

Using MSHRs as Reassembly Buffers 26 Inject/Eject Reassembly Buffers Reassembly Buffers InjectEject Miss Buffers (MSHRs)  Using miss buffers for reassembly makes this a truly bufferless network.

Inject Deflection Routing Logic Deflection Routing Logic Crossbar CHIPPER : Cheap Interconnect Partially-Permuting Router 27 Reassembly Buffers Reassembly Buffers Eject Baseline Bufferless Deflection Router Large buffers for worst case  Retransmit-Once  Cache buffers Long critical path: 1. Sort by age 2. Allocate ports sequentially  Golden Packet  Permutation Network

CHIPPER : Cheap Interconnect Partially-Permuting Router 28 Inject/Eject Miss Buffers (MSHRs) InjectEject

EVALUATION 29

Methodology Multiprogrammed workloads: CPU2006, server, desktop  8x8 (64 cores), 39 homogeneous and 10 mixed sets Multithreaded workloads: SPLASH-2, 16 threads  4x4 (16 cores), 5 applications System configuration  Buffered baseline: 2-cycle router, 4 VCs/channel, 8 flits/VC  Bufferless baseline: 2-cycle latency, FLIT-BLESS  Instruction-trace driven, closed-loop, 128-entry OoO window  64KB L1, perfect L2 (stresses interconnect), XOR mapping 30

Methodology Hardware modeling  Verilog models for CHIPPER, BLESS, buffered logic Synthesized with commercial 65nm library  ORION for crossbar, buffers and links Power  Static and dynamic power from hardware models  Based on event counts in cycle-accurate simulations 31

Results: Performance Degradation % 1.8% 3.6% 49.8%  Minimal loss for low-to-medium-intensity workloads

Results: Power Reduction % 73.4%   Removing buffers  majority of power savings   Slight savings from BLESS to CHIPPER

Results: Area and Critical Path Reduction % -29.1% +1.1% -1.6%   CHIPPER maintains area savings of BLESS   Critical path becomes competitive to buffered

Conclusions Two key issues in bufferless deflection routing  livelock freedom and packet reassembly Bufferless deflection routers were high-complexity and impractical  Oldest-first prioritization  long critical path in router  No end-to-end flow control for reassembly  prone to deadlock with reasonably-sized reassembly buffers CHIPPER is a new, practical bufferless deflection router  Golden packet prioritization  short critical path in router  Retransmit-once protocol  deadlock-free packet reassembly  Cache miss buffers as reassembly buffers  truly bufferless network CHIPPER frequency comparable to buffered routers at much lower area and power cost, and minimal performance loss 35

Thank you! The packet flew quick through the NoC, Paced by the clock’s constant tock. But the critical path Soon unleashed its wrath – And further improvements did block. 36

CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu

BACKUP SLIDES 38

What about High Network Loads? Recall, our goal is to enable bufferless deflection routing as a compelling design point. This is orthogonal to the question of spanning the whole spectrum. If performance under very high workload intensity is important, hybrid solutions (e.g., AFC [Jafri10]) can be used to enable buffers selectively. Congestion control might also be used.  Any system that incorporates bufferless deflection routing at lower load points benefits from CHIPPER’s contributions. 39

What About Injection and Ejection? Local access is conceptually separate from in-network routing Separate pipeline stage (before permutation stage)  Eject locally-bound flits  Inject queued new traffic, if there is a free slot Ejection obeys priority rules (Golden Packet first) 40 ejection into reassembly buffers injection from FIFO queue

Use MSHRs as Reassembly Buffers 41 Outstanding Cache Misses Miss Status Handling Register (MSHR) PendingBlock 0x3C Data Buffer Status Address Reassembly buffering for “free”  A truly bufferless NoC!

Golden Packet vs. Golden Flit It is actually Golden Packet, not Golden Flit Within the Golden Packet, each arbiter breaks ties with flit sequence numbers Rare!  Packet golden when injected: its flits never meet each other.  BUT, if it becomes golden later: flits could be anywhere, and might contend. 42

Sensitivity Studies Locality:  We assumed simple striping across all cache slices.  What if data is more intelligently distributed?  10 mixed multiprogrammed workloads, Buffered  CHIPPER: 8x8 baseline: 11.6% weighted-speedup degradation 4x4 neighborhoods: 6.8% degradation 2x2 neighborhoods: 1.1% degradation Golden Packet:  Percentage of packets that are Golden: 0.37% (0.41% max)  Sensitivity to epoch length (8 – 8192 cycles): 0.89% delta 43

Retransmit-Once Operation Requesters always opportunistically send first packet. Response to request implies a buffer reservation. If receiver can’t reserve space, it drops and makes a note.  When space becomes free later, reserves it and requests a retransmit.  only one retransmit necessary! Beyond this point, reassembly buffer is reserved until requester releases it. 44

Retransmit-Once: Example Example: (Core to L2): Request  Response  Writeback 45 Reassembly Buffers Request State Retransmit Sender Node 0 Pending: Node 0 Req 0 Receiver Reserved 1.Send packet: Request 2.Drop (buffers full) 3.Other packet completes 4.Receiver reserves space 5.Send Retransmit packet 6.Sender regenerates request 7.Data response 8.Dirty writeback

Retransmit-Once: Multiple Packets 46

Retransmit-Once: Multiple Packets 47

Parameters for Evaluation 48

Full Workload Results 49

Hardware-Cost Results 50

Network-Level Evaluation: Latency 51

Network-Level Evaluation: Deflections 52

Reassembly Buffer Size Sensitivity 53