Design of a High-Throughput Distributed Shared-Buffer NoC Router

Slides:



Advertisements
Similar presentations
Interconnection Networks: Flow Control and Microarchitecture.
Advertisements

Prof. Natalie Enright Jerger
A Novel 3D Layer-Multiplexed On-Chip Network
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
GCA: Global Congestion Awareness for Load Balance in Networks-on- Chip Mukund Ramakrishna, Paul V. Gratz & Alex Sprintson Department of Electrical and.
Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
1 Lecture 16: On-Chip Networks Today: on-chip networks background.
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
McRouter: Multicast within a Router for High Performance NoCs
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
On-Chip Networks and Testing
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Networks-on-Chips (NoCs) Basics
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
1 Optical Burst Switching (OBS). 2 Optical Internet IP runs over an all-optical WDM layer –OXCs interconnected by fiber links –IP routers attached to.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
1 Lecture 26: Networks, Storage Topics: router microarchitecture, disks, RAID (Appendix D) Final exam: Monday 30 th Apr 10:30-12:30 Same rules as the midterm.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Switch Microarchitecture Basics.
University of Michigan, Ann Arbor
Yu Cai Ken Mai Onur Mutlu
Lecture 16: Router Design
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Advanced Processor Group The School of Computer Science A Dynamic Link Allocation Router Wei Song, Doug Edwards Advanced Processor Group The University.
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
Lecture 23: Interconnection Networks
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Addressing: Router Design
Lecture 23: Router Design
Lecture 16: On-Chip Networks
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Lecture 25: Interconnection Networks
Presentation transcript:

Design of a High-Throughput Distributed Shared-Buffer NoC Router Rohit Sunkam Ramanujam*, Vassos Soteriou†, Bill Lin*, Li-Shiuan Peh‡ *Dept. of Electrical Engineering, UCSD, USA †Dept. of Electrical Engineering, CUT, Cyprus ‡Dept. of Electrical Eng. and Computer Science, MIT, USA

Chip Multiprocessors are a reality … Power wall Frequency wall ILP wall Non-Recurring Engineering costs Time to market Chip Multiprocessor we were in the uniprocessor era until earlier this decade when chip makers started facing a number of obstacles that made it hard to improve the performance of uniprocessor systems.  All these factors combined together to lead to the advent of the multicore era where parallelism started being leveraged in order to improve processor performance. Sources: Intel Inc. and Tilera Inc. Uniprocessor

The need for a Network on Chip (NoC) Compute Unit Router In recent times computer architects have discovered that the design of heavily pipelined [monolithic] superscalar uni-processor cores that aim to achieve application execution speedup with ILP along with high operating frequencies as a means to yield greater performance, has reached a fundamental limit, threatening the escalation of performance gains from one generation to the next with diminishing returns as technology keeps scaling. Instead economies of scale and new design paradigms point to “divide and conquer” strategies, where applications are broken down into smaller concurrent operations and then distributed across many smaller processing units which reside on a single chip. These small and many processing units communicate with each other using a communication medium. As the number of these on-chip units increases core-interconnectivity is moving from fully connected crossbars (where is complexity is in the order of n^2 (n is the number of connected cores)) or bus architectures which connect a handful of cores to interconnection networks. An on-chip interconnection network is the communication medium of preference because in these networks wire connections are shorter than global wires therefore enabling wire delays to scale with the size of the architecture as single cycle link traversals between computation units at high bandwidth rates offering delay-predictable communication. Additional benefits of interconnection networks include resource re-use as various components/applications are routed over shared routers and connecting wiring that can be used to achieve traffic balancing and even fault tolerance and therefore enabling higher throughput rates and performance to keep scaling. Also application can be mapped in such a way to evenly distribute its load according to the network topology. Scalable communication Modular design Efficient use of wires A new way to organize and build VLSI systems

The Problem – Delivering high throughput in NoCs Why Care? NoCs in CMPs connect general-purpose processors. Future applications unknown → traffic unknown. Exploiting parallelism needs fine-grained interaction between cores. Can expect high traffic volume for current and future applications running on many-core processors. E.g. Cache coherence between large number of distributed shared L2 caches. Parallel programming and parallel machines have been around for a while in the context of clusters, supercomputers or grids. However, the granularity of parallelism in these machines in quite coarse as off chip communication costs are quite high. With multi-core processors, the communication cost is drastically reduced because of the proximity of the cores within a chip and due to the abundance of on-chip wiring. So, we can expect to exploit parallelism at a much finer granularity. This in turn will greatly  increase the traffic volume  between cores and will require networks that can deliver high throughput.

An important design choice that affects throughput Router microarchitecture How well does a router multiplex packets onto its output links?

NoC routers – Current design Input Buffered Routers (IBRs) – Flits buffered at the input ports cycle = 3 cycle = 2 cycle = 1 Output 1 Input 1 Output 2 Input 2 Crossbar Maximal Matching: Input 2 → Output 1 Maximal Matching: Input 1 → Output 1 Output 2 is unutilized in cycle 3 although there is a flit destined for output 2. Bottleneck: Maximal matching used for arbitration is not good enough. (70-80% efficiency)

Output queueing to the rescue … Output buffered router (OBR) – Flits buffered at the output ports cycle = 3 cycle = 2 cycle = 1 Output 1 Input 1 Output 2 Input 2 Crossbar Output links are always utilized when there are flits available. Better multiplexing of flits onto output links ⇒ higher throughput.

How much difference does it make? Uniform Traffic A throughput gap of 18%! OBR- Infinite buffering, 5 stage pipeline as described in the paper (matches DSB pipeline). IBR- 200 flit buffering, 8VCs, 5flit/VC.

How much difference does it make? Complement Traffic A throughput gap of 12%! OBR- Infinite buffering, 5 stage pipeline as described in the paper (matches DSB pipeline). IBR- 200 flit buffering, 8VCs, 5flit/VC.

How much difference does it make? Tornado Traffic A throughput gap of 22%! OBR- Infinite buffering, 5 stage pipeline as described in the paper (matches DSB pipeline). IBR- 200 flit buffering, 8VCs, 5flit/VC.

Performance impact on real applications Up to 98% reduction in average packet latency Splash 2 benchmark applications

Output Buffering is great … OBRs offer much higher throughput than IBRs. OBRs have predictable delay. Queuing delay modeled using M/D/1 queues. Packet delays not predictable for IBRs.

So why aren’t OBRs used in NoCs ? Input 1 Output 1 Input 2 . . Input P-1 Output P-1 Crossbar Implementing Output Buffering requires either: Crossbar speedup of P, where P is the number of ports. Not practical for aggressively clocked designs. Output buffers with P write ports and a PxP2 crossbar. Has huge area and power penalties.

Our approach: Emulate Output Queueing without any speedup Current time = 5 Current time = 3 Current time = 1 Current time = 2 Current time = 6 Current time = 4 Step2: Find a conflict-free middle memory. Step1: Timestamp the flits Assign a future time at which a flit would depart the router assuming output buffering. Step4: When current time == timestamp, Read flit from middle memory to output port. Step3: Move flits from input buffers to middle memories. 4 Input 1 Output 1 Input 2 5 Output 2 Input 3 6 Output 3 Crossbar 1 Middle Memories Crossbar 2

Arrival and Departure Conflicts Arrival Conflicts – With P input ports, a flit can have an arrival conflict with P-1 other flits. Departure Conflicts – With P output ports, a flit can have a departure conflict with P-1 other flits. By Pigeon hole principle, 2P-1 middle memories needed to avoid all arrival and departure conflicts.

The Distributed Shared-Buffer Router (DSB) Aims at emulating the packet servicing scheme of an OBR with limited buffers and no speedup. First-Come-First-Served servicing of flits. Objectives: Close the performance gap between OBRs with infinite buffers and IBRs (high throughput). Make a feasible design → low power and area overhead. Make packet delays more predictable for delay sensitive NoC applications.

DSB Router Innovations Router pipeline with new stages for: Timestamping flits Finding a conflict free middle memory Complexity and delay-balanced pipeline stages for a high-clocked, high-performance implementation. New flow control to prevent packet dropping when resources are unavailable. Evaluate power-performance tradeoff of DSB architectures with fewer than 2P-1 middle memories.

Evaluation Cycle accurate flit level simulator. Mesh topology – Each router has 5 ports, NSEW + Injection/Ejection. Dimension Ordered Routing (DOR) – decouple effects of routing algorithm on network performance.

Evaluation – Traffic traces 3 Synthetic traffic traces: Uniform Bit Complement (Complement) Tornado Real traffic/memory traces from running multiple threads (49 threads ⇒ 7x7 Mesh) of eight SPLASH-2 benchmarks: Complex 1D FFT, LU decomposition, Water-nsquared, Water-spatial, Ray tracer, Barnes-Hut, Integer Radix sort, Ocean simulation.

Performance on Uniform traffic A throughput gap of just 9% Same as 3. DSB - 200 flits, 4VCs, 5 flits/VC, 5 MM, 20 flits/MM

Performance on Complement traffic A throughput gap of just 4% Same as 3. DSB - 200 flits, 4VCs, 5 flits/VC, 5 MM, 20 flits/MM

Performance on Tornado traffic A throughput gap of just 8% Same as 3. DSB - 200 flits, 4VCs, 5 flits/VC, 5 MM, 20 flits/MM

Performance of DSB on SPLASH-2 benchmarks Small difference in packet latency between OBR and DSB routers is mainly due to the limited buffering in the DSB router. Raytrace, Barnes and Ocean traces have very little contention. For these traces, IBR has lower latency because of a shorter pipeline. Huge performance improvements over IBR in traces exhibiting high contention and demanding high bandwidth. Performance of DSB is very close to an OBR with same number of pipeline stages. 64% 72% Same as 3. DSB - 200 flits, 4VCs, 5 flits/VC, 5 MM, 20 flits/MM 97%

Input Buffered Router (IBR) pipeline RC ST LT VA SA Input 1 utput 1 Input 2 Output 2 Crossbar Virtual Channel Allocation Reserve an output Virtual Channel (buffering) at the next hop router. Route Computation Determine the output port of the flit based on the destination coordinates. Switch Arbitration Acquire access to the output port through the crossbar. Switch Traversal Traverse the crossbar to reach the output link. Link Traversal Traverse the link to reach the input buffer of the next hop router.

Distributed Shared-Buffer Router pipeline If CR or VA fails RC CR XB1 + MM_WR MM_RD + XB2 LT VA TS Input 1 Output 1 Input 2 Output 2 Crossbar 1 Middle Memory Crossbar 2 Middle Memory Read + Crossbar 2 When the current time equals the timestamp, the flit is read from the middle memory and traverses the second crossbar. Timestamp Allocation Assign a timestamp to a flit for the output port requested. Timestamp is the future time (cycle) at which the flit can depart the middle memory buffer. Conflict Resolution + Virtual Channel Allocation Conflict Resolution: Find a conflict free middle memory. Virtual Channel Allocation: Reserve a virtual channel at the input of the next hop router. Route Computation Determine the output port of the flit based on the destination coordinates. Crossbar 1 + Middle Memory Write Flit traverses the first crossbar and gets written into the assigned middle memory. Link Traversal Flit traverses the output link to reach the input buffer of the next-hop router.

Higher throughput – At what cost? Extra power !! RC CR XB1 + MM_WR MM_RD + XB2 LT VA TS Input 1 Output 1 Input 2 Output 2 Crossbar 1 Middle Memory Crossbar 2 Two crossbars instead of one: With N middle memories, need one PxN and one PxN crossbar. Middle memory buffers – Can have fewer input buffers to compensate for extra middle memory buffers. TS stage instead of Switch Arbitration in IBRs Extra stage for Conflict Resolution

Power-Performance tradeoff Theoretically, 2P-1 middle memories needed to resolve all conflicts. For a 5-port mesh router, need > 9 middle memories, a 5x9 and a 9x5 crossbar – large power overhead. What is the impact of using fewer than 2P-1 middle memories?

Power and Area Comparison If NoC consumes 20% of tile power, tile power overhead of only 7% for DSB-5 router Router power overhead of 50% for DSB-5 router If NoC consumes 10% of tile power, tile power overhead of only 3.5% for DSB-5 router

Thank you Questions?