Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Similar presentations


Presentation on theme: "Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti"— Presentation transcript:

1 Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti University of Wisconsin – Madison and Princeton University

2 Executive Summary Demonstrate necessity of multicasting on-chip
State of the art router insufficient Significant number of proposals could leverage multicasting Provide efficient multicasting solution using Virtual Circuit Trees Overlay logical routing trees on mesh network Reduces interconnect latency by up 90% Reduces switching activity by up to 53% 6/24/2008 Enright Jerger - ISCA 2008

3 Packet-Switched Unicast Router
3 stage packet-switched router Based on most aggressive recent proposals Aggressive baseline not well matched all types of communication Multicast is performed using multiple unicasts Virtual Channel/ Switch Allocation Switch Traversal Link Traversal Link Traversal Switch Traversal Link Traversal Link Traversal Buffer Write Router Router Link Link 6/24/2008 Enright Jerger - ISCA 2008

4 State-of-the-Art Router
Current router architecture poorly equipped to handle even a low amount of multicast (MC) traffic 6/24/2008 Enright Jerger - ISCA 2008

5 Outline Motivation VCTM Implementation Multicasting Scenarios
Baseline router problems Example Architecture Multicasting Scenarios Description Characterization Evaluation Conclusion 6/24/2008 Enright Jerger - ISCA 2008

6 Baseline Router Example
More resources to solve this problem? More buffers, virtual channels, links? VCs 2B 1B VCs X VCs VCs Busy 2C 1C 2D 1D VCs C D 6/24/2008 Enright Jerger - ISCA 2008

7 Key Router Problems A B X C D Redundant (wasteful) use of
resources: same payload occupying extra buffers, links Injection Bandwidth: Burst of messages at network interface A B 2A VCs 2B VCs X VCs VCs Alternative routing: Improve throughput, but wastes power Busy 1A 1B 2C 1C 1D 2D VCs Speculation Problems: predicated on low loads Burst of messages C D 6/24/2008 Enright Jerger - ISCA 2008

8 Virtual Circuit Tree Multicasting Overview
Builds on existing state-of-the-art router Unicast performance is not impacted Build multicast trees incrementally Tree reuse is necessary for effectiveness Significant temporal destination set reuse across all scenarios Fewer packets improves speculation M: <East, South> M: <Eject, South> Multicast from 0 to <2,4,5> M: <East> 1 M: <East> M: <Eject> Build Tree Incrementally (Tree M) 2 1 2 A M C B M M 3 Unicast Setup Packets (1 per destination) 3 Link Redundancy Removed Injection problem solved 3 Packets Injected into Network 4 A 2 B 4 C 5 3 4 5 M: <Eject> M: <Eject> 6/24/2008 Enright Jerger - ISCA 2008

9 VCTM Router Architecture
Virtual Circuit Tree Table Virtual Channel Allocator Src VCTnum Id Ej N S E W Fork . 1 3 Switch Allocator VC 0 VC 0 VC 0 Input Ports MVC 0 MVC 0 VC x VC 0 VC x VC x MVC 0 6/24/2008 Enright Jerger - ISCA 2008

10 Implementation Details (1)
Destination Set Content Addressable Memory If not present  replace oldest tree  perform setup Destination Set <5,4,2> 1 5 4 2 1 2 3 1 Encode Tree ID 2 into multicast header 6/24/2008 Enright Jerger - ISCA 2008

11 Implementation Details (2)
VCTs provide routing not resources Multicast arbitration same as unicast VCTs do not pre-allocate resources Multiple arbitration steps at tree branch If one desired output is blocked, other tree branch outputs can still proceed Longer buffer occupancy 6/24/2008 Enright Jerger - ISCA 2008

12 VCTM Overhead Virtual Circuit Tree Routing Tables Destination Set CAMs
Access Time < 1 cycle Number of Entries Area (mm2) Energy (nJ) 512 0.024 0.002 1024 0.041 2048 0.078 0.003 Number of Entries Area (mm2) Energy (nJ) 32 0.018 0.007 64 0.021 0.010 128 0.029 0.017 6/24/2008 Enright Jerger - ISCA 2008

13 Outline Motivation VCTM Implementation Multicasting Scenarios
Baseline router problems Example Architecture Multicasting Scenarios Description Characterization Evaluation Conclusion 6/24/2008 Enright Jerger - ISCA 2008

14 Multicasting Scenarios (1)
Token Coherence [Martin, 2003] TokenB: Broadcast for tokens 1 Token to read All Tokens to write SGI Origin Directory Protocol [Laudon, 1997] Multicast invalidate requests Opteron Protocol [Conway, 2007] Coherence requests sent to ordering point and broadcast to all cores Some filtering of destinations 6/24/2008 Enright Jerger - ISCA 2008

15 Multicasting Scenarios (2)
Region Multicasting Two level protocol 1st level: Multicast to sharers of address region 2nd level: Fall back on directory when no region information available TRIPs [Sankaralingam, 2003] Operand network Multicast results of instructions to tiles containing dependent instructions 35% of dynamic instructions have 2 or more future uses 6/24/2008 Enright Jerger - ISCA 2008

16 Multicasting Scenarios (3)
Uncorq [Strauss, 2007] Unordered broadcast, ordered response network Virtual Hierarchies [Marty, 2007] 1st level directory 2nd level global broadcast Dynamic NUCA caches [Kim, 2002] Multicast for cache hit 6/24/2008 Enright Jerger - ISCA 2008

17 Characterizing Multicasts
Unique Destination Sets: combination of destinations in multicast Number of Destinations per multicast Token: 1 destination set for each node TRIPs and Directory: Small destination sets TokenB and Opteron: Large destination sets Up to 13% of traffic is multicast VCTM is an inexpensive solution to support multicasting Region Multicast: Wide variety of sizes Region and Directory: Much larger variety of destination sets 6/24/2008 Enright Jerger - ISCA 2008

18 Simulation Methodology
Network traffic from 5 different scenarios Detailed network simulator Cycle-accurate modeling of router stages Flexible, lightweight VCTM mechanism provides improvement for diverse scenarios Many more results in paper 6/24/2008 Enright Jerger - ISCA 2008

19 Network Configuration
Topology 4-ary 2-mesh 5-ary 2-mesh (TRIPs) Routing Dimension Order: X-Y Routing Channel Width 16 Bytes Packet Size 1 flit (Coherence request = Address + Command) 5 flits (Data) 3 flits (TRIPs) Virtual Channels 4 Buffers per port 24 Router ports 5 Virtual Circuit Trees Varied from 16 to 4K (1 to 256 VCTS/core) 6/24/2008 Enright Jerger - ISCA 2008

20 Power Savings On-chip networks consume up to ~36% of chip power [Wang, 2002] Links, buffers and crossbars consume nearly 100% of network power Power saved through activity reduction 6/24/2008 Enright Jerger - ISCA 2008

21 Performance Results Summary
SPECweb: 12% Art: 55% TPC-H: 68% Small number of trees required for majority of benefit Performance improvement depends on network pressure 6/24/2008 Enright Jerger - ISCA 2008

22 VCTM vs. Aggressive Network
VCTM outperforms aggressive (unrealistic) network 6/24/2008 Enright Jerger - ISCA 2008

23 VCTM Summary (1) Improves performance across a variety of scenarios
Reduces interconnect latency by up 90% Reduces switching activity by up to 53% Small number of trees necessary 8 trees/core achieves substantial benefit Dynamic table partitioning could further reduce total tree storage 6/24/2008 Enright Jerger - ISCA 2008

24 VCTM Summary (2) Outperforms aggressive router
No impact on unicast performance Integrates with existing state-of-the-art router architecture Easily extendable to more scalable topologies and routing algorithms Open door for new optimizations 6/24/2008 Enright Jerger - ISCA 2008

25 Thank you Questions 6/24/2008 Enright Jerger - ISCA 2008


Download ppt "Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti"

Similar presentations


Ads by Google