Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti University of Wisconsin – Madison and Princeton University
Executive Summary Demonstrate necessity of multicasting on-chip State of the art router insufficient Significant number of proposals could leverage multicasting Provide efficient multicasting solution using Virtual Circuit Trees Overlay logical routing trees on mesh network Reduces interconnect latency by up 90% Reduces switching activity by up to 53% 6/24/2008 Enright Jerger - ISCA 2008
Packet-Switched Unicast Router 3 stage packet-switched router Based on most aggressive recent proposals Aggressive baseline not well matched all types of communication Multicast is performed using multiple unicasts Virtual Channel/ Switch Allocation Switch Traversal Link Traversal Link Traversal Switch Traversal Link Traversal Link Traversal Buffer Write Router Router Link Link 6/24/2008 Enright Jerger - ISCA 2008
State-of-the-Art Router Current router architecture poorly equipped to handle even a low amount of multicast (MC) traffic 6/24/2008 Enright Jerger - ISCA 2008
Outline Motivation VCTM Implementation Multicasting Scenarios Baseline router problems Example Architecture Multicasting Scenarios Description Characterization Evaluation Conclusion 6/24/2008 Enright Jerger - ISCA 2008
Baseline Router Example More resources to solve this problem? More buffers, virtual channels, links? VCs 2B 1B VCs X VCs VCs Busy 2C 1C 2D 1D VCs C D 6/24/2008 Enright Jerger - ISCA 2008
Key Router Problems A B X C D Redundant (wasteful) use of resources: same payload occupying extra buffers, links Injection Bandwidth: Burst of messages at network interface A B 2A VCs 2B VCs X VCs VCs Alternative routing: Improve throughput, but wastes power Busy 1A 1B 2C 1C 1D 2D VCs Speculation Problems: predicated on low loads Burst of messages C D 6/24/2008 Enright Jerger - ISCA 2008
Virtual Circuit Tree Multicasting Overview Builds on existing state-of-the-art router Unicast performance is not impacted Build multicast trees incrementally Tree reuse is necessary for effectiveness Significant temporal destination set reuse across all scenarios Fewer packets improves speculation M: <East, South> M: <Eject, South> Multicast from 0 to <2,4,5> M: <East> 1 M: <East> M: <Eject> Build Tree Incrementally (Tree M) 2 1 2 A M C B M M 3 Unicast Setup Packets (1 per destination) 3 Link Redundancy Removed Injection problem solved 3 Packets Injected into Network 4 A 2 B 4 C 5 3 4 5 M: <Eject> M: <Eject> 6/24/2008 Enright Jerger - ISCA 2008
VCTM Router Architecture Virtual Circuit Tree Table Virtual Channel Allocator Src VCTnum Id Ej N S E W Fork . 1 3 Switch Allocator VC 0 VC 0 VC 0 Input Ports MVC 0 MVC 0 VC x VC 0 VC x VC x MVC 0 6/24/2008 Enright Jerger - ISCA 2008
Implementation Details (1) Destination Set Content Addressable Memory If not present replace oldest tree perform setup Destination Set <5,4,2> 1 5 4 2 1 2 3 1 Encode Tree ID 2 into multicast header 6/24/2008 Enright Jerger - ISCA 2008
Implementation Details (2) VCTs provide routing not resources Multicast arbitration same as unicast VCTs do not pre-allocate resources Multiple arbitration steps at tree branch If one desired output is blocked, other tree branch outputs can still proceed Longer buffer occupancy 6/24/2008 Enright Jerger - ISCA 2008
VCTM Overhead Virtual Circuit Tree Routing Tables Destination Set CAMs Access Time < 1 cycle Number of Entries Area (mm2) Energy (nJ) 512 0.024 0.002 1024 0.041 2048 0.078 0.003 Number of Entries Area (mm2) Energy (nJ) 32 0.018 0.007 64 0.021 0.010 128 0.029 0.017 6/24/2008 Enright Jerger - ISCA 2008
Outline Motivation VCTM Implementation Multicasting Scenarios Baseline router problems Example Architecture Multicasting Scenarios Description Characterization Evaluation Conclusion 6/24/2008 Enright Jerger - ISCA 2008
Multicasting Scenarios (1) Token Coherence [Martin, 2003] TokenB: Broadcast for tokens 1 Token to read All Tokens to write SGI Origin Directory Protocol [Laudon, 1997] Multicast invalidate requests Opteron Protocol [Conway, 2007] Coherence requests sent to ordering point and broadcast to all cores Some filtering of destinations 6/24/2008 Enright Jerger - ISCA 2008
Multicasting Scenarios (2) Region Multicasting Two level protocol 1st level: Multicast to sharers of address region 2nd level: Fall back on directory when no region information available TRIPs [Sankaralingam, 2003] Operand network Multicast results of instructions to tiles containing dependent instructions 35% of dynamic instructions have 2 or more future uses 6/24/2008 Enright Jerger - ISCA 2008
Multicasting Scenarios (3) Uncorq [Strauss, 2007] Unordered broadcast, ordered response network Virtual Hierarchies [Marty, 2007] 1st level directory 2nd level global broadcast Dynamic NUCA caches [Kim, 2002] Multicast for cache hit 6/24/2008 Enright Jerger - ISCA 2008
Characterizing Multicasts Unique Destination Sets: combination of destinations in multicast Number of Destinations per multicast Token: 1 destination set for each node TRIPs and Directory: Small destination sets TokenB and Opteron: Large destination sets Up to 13% of traffic is multicast VCTM is an inexpensive solution to support multicasting Region Multicast: Wide variety of sizes Region and Directory: Much larger variety of destination sets 6/24/2008 Enright Jerger - ISCA 2008
Simulation Methodology Network traffic from 5 different scenarios Detailed network simulator Cycle-accurate modeling of router stages Flexible, lightweight VCTM mechanism provides improvement for diverse scenarios Many more results in paper 6/24/2008 Enright Jerger - ISCA 2008
Network Configuration Topology 4-ary 2-mesh 5-ary 2-mesh (TRIPs) Routing Dimension Order: X-Y Routing Channel Width 16 Bytes Packet Size 1 flit (Coherence request = Address + Command) 5 flits (Data) 3 flits (TRIPs) Virtual Channels 4 Buffers per port 24 Router ports 5 Virtual Circuit Trees Varied from 16 to 4K (1 to 256 VCTS/core) 6/24/2008 Enright Jerger - ISCA 2008
Power Savings On-chip networks consume up to ~36% of chip power [Wang, 2002] Links, buffers and crossbars consume nearly 100% of network power Power saved through activity reduction 6/24/2008 Enright Jerger - ISCA 2008
Performance Results Summary SPECweb: 12% Art: 55% TPC-H: 68% Small number of trees required for majority of benefit Performance improvement depends on network pressure 6/24/2008 Enright Jerger - ISCA 2008
VCTM vs. Aggressive Network VCTM outperforms aggressive (unrealistic) network 6/24/2008 Enright Jerger - ISCA 2008
VCTM Summary (1) Improves performance across a variety of scenarios Reduces interconnect latency by up 90% Reduces switching activity by up to 53% Small number of trees necessary 8 trees/core achieves substantial benefit Dynamic table partitioning could further reduce total tree storage 6/24/2008 Enright Jerger - ISCA 2008
VCTM Summary (2) Outperforms aggressive router No impact on unicast performance Integrates with existing state-of-the-art router architecture Easily extendable to more scalable topologies and routing algorithms Open door for new optimizations 6/24/2008 Enright Jerger - ISCA 2008
Thank you Questions 6/24/2008 Enright Jerger - ISCA 2008