Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Slides:



Advertisements
Similar presentations
Interconnection Networks: Flow Control and Microarchitecture.
Advertisements

Prof. Natalie Enright Jerger
QuT: A Low-Power Optical Network-on-chip
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
A Novel 3D Layer-Multiplexed On-Chip Network
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Packet Switching COM1337/3501 Textbook: Computer Networks: A Systems Approach, L. Peterson, B. Davie, Morgan Kaufmann Chapter 3.
Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
Network-on-Chip Examples System-on-Chip Group, CSE-IMM, DTU.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Itrat Rasool Quadri ST ID COE-543 Wireless and Mobile Networks
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
Chip Multiprocessor Coherence and Interconnect System Design Natalie Enright Jerger PhD Defense October 20, 2008.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
Yu Cai Ken Mai Onur Mutlu
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
Virtual-Channel Flow Control William J. Dally
Improving Fault Tolerance in AODV Matthew J. Miller Jungmin So.
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
Chapter 3 Part 3 Switching and Bridging
Architecture and Algorithms for an IEEE 802
Lecture 23: Interconnection Networks
A Study of Group-Tree Matching in Large Scale Group Communications
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Exploring Concentration and Channel Slicing in On-chip Network Router
Interconnection Networks: Flow Control
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Lecture 23: Router Design
Building Expressive, Area-Efficient Coherence Directories
Lecture 16: On-Chip Networks
Chapter 3 Part 3 Switching and Bridging
Rahul Boyapati. , Jiayi Huang
Using Packet Information for Efficient Communication in NoCs
Impact of Interconnection Network resources on CMP performance
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Design and Management of 3D CMP’s using Network-in-Memory
Lecture: Interconnection Networks
Chapter 3 Part 3 Switching and Bridging
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
A Case for Bufferless Routing in On-Chip Networks
Multiprocessors and Multi-computers
Presentation transcript:

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti University of Wisconsin – Madison and Princeton University

Executive Summary Demonstrate necessity of multicasting on-chip State of the art router insufficient Significant number of proposals could leverage multicasting Provide efficient multicasting solution using Virtual Circuit Trees Overlay logical routing trees on mesh network Reduces interconnect latency by up 90% Reduces switching activity by up to 53% 6/24/2008 Enright Jerger - ISCA 2008

Packet-Switched Unicast Router 3 stage packet-switched router Based on most aggressive recent proposals Aggressive baseline not well matched all types of communication Multicast is performed using multiple unicasts Virtual Channel/ Switch Allocation Switch Traversal Link Traversal Link Traversal Switch Traversal Link Traversal Link Traversal Buffer Write Router Router Link Link 6/24/2008 Enright Jerger - ISCA 2008

State-of-the-Art Router Current router architecture poorly equipped to handle even a low amount of multicast (MC) traffic 6/24/2008 Enright Jerger - ISCA 2008

Outline Motivation VCTM Implementation Multicasting Scenarios Baseline router problems Example Architecture Multicasting Scenarios Description Characterization Evaluation Conclusion 6/24/2008 Enright Jerger - ISCA 2008

Baseline Router Example More resources to solve this problem? More buffers, virtual channels, links? VCs 2B 1B VCs X VCs VCs Busy 2C 1C 2D 1D VCs C D 6/24/2008 Enright Jerger - ISCA 2008

Key Router Problems A B X C D Redundant (wasteful) use of resources: same payload occupying extra buffers, links Injection Bandwidth: Burst of messages at network interface A B 2A VCs 2B VCs X VCs VCs Alternative routing: Improve throughput, but wastes power Busy 1A 1B 2C 1C 1D 2D VCs Speculation Problems: predicated on low loads Burst of messages C D 6/24/2008 Enright Jerger - ISCA 2008

Virtual Circuit Tree Multicasting Overview Builds on existing state-of-the-art router Unicast performance is not impacted Build multicast trees incrementally Tree reuse is necessary for effectiveness Significant temporal destination set reuse across all scenarios Fewer packets improves speculation M: <East, South> M: <Eject, South> Multicast from 0 to <2,4,5> M: <East> 1 M: <East> M: <Eject> Build Tree Incrementally (Tree M) 2 1 2 A M C B M M 3 Unicast Setup Packets (1 per destination) 3 Link Redundancy Removed Injection problem solved 3 Packets Injected into Network 4 A 2 B 4 C 5 3 4 5 M: <Eject> M: <Eject> 6/24/2008 Enright Jerger - ISCA 2008

VCTM Router Architecture Virtual Circuit Tree Table Virtual Channel Allocator Src VCTnum Id Ej N S E W Fork . 1 3 Switch Allocator VC 0 VC 0 VC 0 Input Ports MVC 0 MVC 0 VC x VC 0 VC x VC x MVC 0 6/24/2008 Enright Jerger - ISCA 2008

Implementation Details (1) Destination Set Content Addressable Memory If not present  replace oldest tree  perform setup Destination Set <5,4,2> 1 5 4 2 1 2 3 1 Encode Tree ID 2 into multicast header 6/24/2008 Enright Jerger - ISCA 2008

Implementation Details (2) VCTs provide routing not resources Multicast arbitration same as unicast VCTs do not pre-allocate resources Multiple arbitration steps at tree branch If one desired output is blocked, other tree branch outputs can still proceed Longer buffer occupancy 6/24/2008 Enright Jerger - ISCA 2008

VCTM Overhead Virtual Circuit Tree Routing Tables Destination Set CAMs Access Time < 1 cycle Number of Entries Area (mm2) Energy (nJ) 512 0.024 0.002 1024 0.041 2048 0.078 0.003 Number of Entries Area (mm2) Energy (nJ) 32 0.018 0.007 64 0.021 0.010 128 0.029 0.017 6/24/2008 Enright Jerger - ISCA 2008

Outline Motivation VCTM Implementation Multicasting Scenarios Baseline router problems Example Architecture Multicasting Scenarios Description Characterization Evaluation Conclusion 6/24/2008 Enright Jerger - ISCA 2008

Multicasting Scenarios (1) Token Coherence [Martin, 2003] TokenB: Broadcast for tokens 1 Token to read All Tokens to write SGI Origin Directory Protocol [Laudon, 1997] Multicast invalidate requests Opteron Protocol [Conway, 2007] Coherence requests sent to ordering point and broadcast to all cores Some filtering of destinations 6/24/2008 Enright Jerger - ISCA 2008

Multicasting Scenarios (2) Region Multicasting Two level protocol 1st level: Multicast to sharers of address region 2nd level: Fall back on directory when no region information available TRIPs [Sankaralingam, 2003] Operand network Multicast results of instructions to tiles containing dependent instructions 35% of dynamic instructions have 2 or more future uses 6/24/2008 Enright Jerger - ISCA 2008

Multicasting Scenarios (3) Uncorq [Strauss, 2007] Unordered broadcast, ordered response network Virtual Hierarchies [Marty, 2007] 1st level directory 2nd level global broadcast Dynamic NUCA caches [Kim, 2002] Multicast for cache hit 6/24/2008 Enright Jerger - ISCA 2008

Characterizing Multicasts Unique Destination Sets: combination of destinations in multicast Number of Destinations per multicast Token: 1 destination set for each node TRIPs and Directory: Small destination sets TokenB and Opteron: Large destination sets Up to 13% of traffic is multicast VCTM is an inexpensive solution to support multicasting Region Multicast: Wide variety of sizes Region and Directory: Much larger variety of destination sets 6/24/2008 Enright Jerger - ISCA 2008

Simulation Methodology Network traffic from 5 different scenarios Detailed network simulator Cycle-accurate modeling of router stages Flexible, lightweight VCTM mechanism provides improvement for diverse scenarios Many more results in paper 6/24/2008 Enright Jerger - ISCA 2008

Network Configuration Topology 4-ary 2-mesh 5-ary 2-mesh (TRIPs) Routing Dimension Order: X-Y Routing Channel Width 16 Bytes Packet Size 1 flit (Coherence request = Address + Command) 5 flits (Data) 3 flits (TRIPs) Virtual Channels 4 Buffers per port 24 Router ports 5 Virtual Circuit Trees Varied from 16 to 4K (1 to 256 VCTS/core) 6/24/2008 Enright Jerger - ISCA 2008

Power Savings On-chip networks consume up to ~36% of chip power [Wang, 2002] Links, buffers and crossbars consume nearly 100% of network power Power saved through activity reduction 6/24/2008 Enright Jerger - ISCA 2008

Performance Results Summary SPECweb: 12% Art: 55% TPC-H: 68% Small number of trees required for majority of benefit Performance improvement depends on network pressure 6/24/2008 Enright Jerger - ISCA 2008

VCTM vs. Aggressive Network VCTM outperforms aggressive (unrealistic) network 6/24/2008 Enright Jerger - ISCA 2008

VCTM Summary (1) Improves performance across a variety of scenarios Reduces interconnect latency by up 90% Reduces switching activity by up to 53% Small number of trees necessary 8 trees/core achieves substantial benefit Dynamic table partitioning could further reduce total tree storage 6/24/2008 Enright Jerger - ISCA 2008

VCTM Summary (2) Outperforms aggressive router No impact on unicast performance Integrates with existing state-of-the-art router architecture Easily extendable to more scalable topologies and routing algorithms Open door for new optimizations 6/24/2008 Enright Jerger - ISCA 2008

Thank you Questions 6/24/2008 Enright Jerger - ISCA 2008