Efficient Microarchitecture for Network-on-Chip Routers

Slides:



Advertisements
Similar presentations
Prof. Natalie Enright Jerger
Advertisements

Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.
Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Authors: Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally.
QuT: A Low-Power Optical Network-on-chip
A Novel 3D Layer-Multiplexed On-Chip Network
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
George Michelogiannakis, Nan Jiang, Daniel Becker, William J. Dally This work was completed in Stanford University.
Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.
REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.
Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
Predictive Load Balancing Reconfigurable Computing Group.
Issues in System-Level Direct Networks Jason D. Bakos.
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.
José Vicente Escamilla José Flich Pedro Javier García 1.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Lecture 16: Router Design
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
How to Train your Dragonfly
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Exploring Concentration and Channel Slicing in On-chip Network Router
Interconnection Networks: Flow Control
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Rahul Boyapati. , Jiayi Huang
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Lecture 25: Interconnection Networks
Presentation transcript:

Efficient Microarchitecture for Network-on-Chip Routers Concurrent VLSI Architecture Group Efficient Microarchitecture for Network-on-Chip Routers Daniel U. Becker PhD Oral Examination 8/21/2012

Efficient Microarchitecture for NoC Routers Outline INTRODUCTION Allocator Implementations Buffer Management Infrastructure Conclusions 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Networks-on-Chip Chip Moore’s Law alive & well Many cores per chip Must work together Networks-on-Chip (NoCs) aim to provide scalable, efficient communication fabric Core 8/21/12 Efficient Microarchitecture for NoC Routers

Why Does the Network Matter? Performance Latency Throughput Fairness, QoS Cost Die area Wiring resources Design complexity Power & energy efficiency [Harting et al., “Energy and Performance Benefits of Active Messages “ 8/21/12 Efficient Microarchitecture for NoC Routers

Optimizing the Network Applications & programming models Communication primitives Topologies & Routing Flow control Router microarchitecture Circuit design 8/21/12 Efficient Microarchitecture for NoC Routers

Router Microarchitecture Overview Part 1 Part 2 [Peh and Dally: “A Delay Model for Router Microarchitectures”] 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Outline Introduction ALLOCATOR IMPLEMENTATIONS Buffer Management Infrastructure Conclusions [Becker and Dally: “Allocator Implementations for Network-on-Chip Routers,” SC’09] 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Allocators Fundamental part of router control logic Manage access to network resources Orchestrate flow of packets through router Affect network utilization Potentially affect cycle time 8/21/12 Efficient Microarchitecture for NoC Routers

Virtual Channel Allocation Virtual channels (VCs) allow multiple packets to be interleaved on physical channels Similar to lanes on a highway, allow traffic blocks to be bypassed Before packets can use network channel, need to claim ownership of a VC VC allocator assigns output VCs to waiting packets 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Sparse VC Allocation IVC 64 Requests 32 Requests 24 Requests OVC NM P×2 Requests REQ P×8 Requests MIN P×4 Requests NM P×2 Requests REP MIN P×4 Requests 2×2×2 VCs 2×4 VCs 8 VCs [single input port shown] 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers VC Allocator Delay -58% Canonical design -30% -40% -30% 5 ports, 2x1 VCs 5 ports, 2x2 VCs 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers VC Allocator Area -78% 31800 -50% -78% -60% 5 ports, 2x1 VCs 5 ports, 2x2 VCs 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Switch Allocation Once a VC is allocated, packet can be forwarded Broken down into flits For each flit, must request crossbar access Switch allocator generates crossbar schedule inputs outputs [Enright Jerger and Peh, “On-Chip Networks”] 8/21/12 Efficient Microarchitecture for NoC Routers

Speculative Switch Allocation Reduce pipeline latency by attempting switch allocation in parallel with VC allocation Speculate that VC will be assigned! But mis-speculation wastes crossbar bandwidth Must prioritize non-speculative requests 8/21/12 Efficient Microarchitecture for NoC Routers

Pessimistic Speculation Speculation matters most when network is lightly loaded At low network load, most requests are granted Idea: Assume all non-spec. requests will be granted! nonspec. allocator non-spec. requests nonspec. grants conflict detection spec. allocator spec. requests spec. grants mask 8/21/12 Efficient Microarchitecture for NoC Routers

Performance with Speculation <2% -21% zero-load latency [Mesh, 2 VCs; UR traffic] 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Area and Delay Impact [Full router; Mesh, 2 VCs; TSMC 45nm GP] +16% max. clock freq. -13% area @ 1.2 GHz -5% area @ 1 GHz 8/21/12 Efficient Microarchitecture for NoC Routers

Additional Contributions Fast loop-free wavefront allocators Priority-based speculation Practical combined VC and switch allocation Details in thesis 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Summary Sparse VC allocation exploits traffic classes to reduce VC allocator complexity Reduces delay by 30-60%, area by 50-80% No change in functionality Pessimistic speculation reduces overhead for speculative switch allocation Reduces overall router area by up to 13% Reduces critical path delay by up to 14% Trade for some throughput loss near saturation 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Outline Introduction Allocator Implementations BUFFER MANAGEMENT Infrastructure Conclusions [Becker et al.: “Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks,” to appear in ICCD’12] 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Buffer Cost [Wang et al.: “Power-driven Design of Router Microarchitectures in On-chip Networks”] 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Buffer Management Many designs divide buffer statically among VCs Assign each VC its fair share But optimal buffer organization depends on load Low load favors deep VCs High load favors many VCs For fixed buffer size, static schemes must pick one or the other Improve utilization by allowing buffer space to be shared among VCs 8/21/12 Efficient Microarchitecture for NoC Routers

Buffer Management Performance [linked-list based scheme; harmonic mean across traffic patterns] -18% +8% -28% 8/21/12 Efficient Microarchitecture for NoC Routers

Buffer Monopolization Congestion leads to buffer monopolization Uncongested traffic sees reduced buffer space Increases latency, reduces throughput Congestion spreads across VCs! 8/21/12 Efficient Microarchitecture for NoC Routers

Adaptive Backpressure Avoid unproductive use of buffer space Impose quotas on outstanding credits Share freely under benign conditions Limit sharing to avoid performance pathologies Vary backpressure based on demand 8/21/12 Efficient Microarchitecture for NoC Routers

Buffer Quota Heuristic Goal: Set quota values just high enough to support observed throughput for each VC Allow credit stalls that overlap with other stalls Drain unproductive buffer occupancy Difficult to measure throughput directly Instead, infer from credit round trip times In absence of congestion, set quota to RTT For each downstream stall cycle, reduce by one 8/21/12 Efficient Microarchitecture for NoC Routers

Buffer Quota Motivation (1) Router 0 Router 1 Router 0 Router 1 Tcrt,0 Tcrt,0+Tstall Tstall Excess flits Congestion causes downstream stall and unproductive buffer occupancy Full throughput is achieved in steady state 8/21/12 Efficient Microarchitecture for NoC Routers

Buffer Quota Motivation (2) Router 0 Router 1 Router 0 Router 1 Tstall Tstall Tstall Excess flit drained Tidle Insufficient credit supply causes idle cycle downstream Credit stall resolves unproductive buffer occupancy 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Network Stability 6.3x [tornado traffic] 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Traffic Isolation [Measure zero-load latency increase with background traffic] -38% -33% [uniform random background traffic] [hotspot background traffic] [uniform random foreground traffic] 8/21/12 Efficient Microarchitecture for NoC Routers

Zero-load Latency with Background -31% w/o background [50% uniform random background traffic] 8/21/12 Efficient Microarchitecture for NoC Routers

Throughput with Background -13% w/o background 3.3x [50% uniform random background traffic] 8/21/12 Efficient Microarchitecture for NoC Routers

Application Performance Setup Model traffic in heterogeneous CMP Each node generates two types of traffic: PARSEC application traffic models latency-optimized core Streaming traffic to memory controllers model array of throughput-optimized cores 8/21/12 Efficient Microarchitecture for NoC Routers

Application Performance -31% w/o background [12.5% injection rate for streaming traffic] 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Summary Sharing improves buffer utilization, but can lead to pathological performance Adaptive Backpressure minimizes unproductive use of shared buffer space Mitigates performance degradation in presence of adversarial traffic But maintains key benefits of buffer sharing under benign conditions 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Infrastructure Open source NoC router RTL State-of-the-art router implementation Highly parameterized Topology, routing, allocators, buffers, … Pervasive clock gating Fully synthesizable 100 files, >22k LOC of Verilog-2001 Used in research efforts both inside and outside our research group 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Conclusions Future large-scale chip multiprocessors will require efficient on-chip networks Router microarchitecture is one of many aspects that need to be optimized Allocation has direct impact on router delay and throughput By exploiting higher-level properties, we can reduce cost and delay without degrading performance Input buffers are attractive candidates for optimization However, care must be taken to avoid performance pathologies By avoiding unproductive use of buffer space, Adaptive Backpressure mitigates undesired interference effects 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Acknowledgements Bill Christos and Kunle Prof. Nishi George, Ted, Curt & the rest of the CVA gang 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Acknowledgements 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers Acknowledgements 8/21/12 Efficient Microarchitecture for NoC Routers

Efficient Microarchitecture for NoC Routers That’s it for today. Thank You! 8/21/12 Efficient Microarchitecture for NoC Routers