Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chip Multiprocessor Coherence and Interconnect System Design Natalie Enright Jerger PhD Defense October 20, 2008.

Similar presentations


Presentation on theme: "Chip Multiprocessor Coherence and Interconnect System Design Natalie Enright Jerger PhD Defense October 20, 2008."— Presentation transcript:

1 Chip Multiprocessor Coherence and Interconnect System Design Natalie Enright Jerger PhD Defense October 20, 2008

2 Key Contributions – Interconnects (1)  Trend: proliferation of architectures requiring broadcast and multicast communication  Challenge: too expensive for on-chip  Solution: Virtual Circuit Tree Multicasting  Improves throughput with low-overhead routing tables 10/20/2008 2 Natalie Enright Jerger - PhD Defense

3 Key Contributions – Interconnects (2)  Trend: interconnect latency has a significant impact on overall performance  Challenge: router overhead contributes substantially to total network latency  Solution: Hybrid Circuit Switching  Circuit switching can be made fast without sacrificing bandwidth 10/20/2008 3 Natalie Enright Jerger - PhD Defense

4 Key Contributions – Cache Coherence  Trend: Need for all cores to observe coherence requests will be rare  Challenge: low latency cache-to-cache transfers without high bandwidth overhead  Solutions: leverage properties of interconnection network  Virtual Tree Coherence  Efficient multicast  Circuit-Switched Coherence  Optimized pair-wise sharing 10/20/2008 4 Natalie Enright Jerger - PhD Defense

5 Outline  Overview/Background  On-chip Networks  Cache Coherence  Virtual Circuit Tree Multicasting  Hybrid Circuit Switching  Virtual Tree Coherence  Circuit-Switched Coherence  Conclusion/Future Work 10/20/2008 5 Natalie Enright Jerger - PhD Defense

6 Communication system co-design  Communication consists of both  How data is communicated (interconnect)  What is communicated to whom and when (coherence)  Focus on co-design  Interconnect provides efficient multicasting design multicast coherence protocol  Interconnect optimizes pair-wise communication integrate pair-wise sharing into protocol 10/20/2008 6 Natalie Enright Jerger - PhD Defense

7 On-Chip Interconnection Networks  Replace global wiring as communication medium  Prohibitive latency, inefficient, poor scalability  Modular network design  High bandwidth  Multiple concurrent communication flows  Challenges:  Low latency communication  Low power and area 10/20/2008 7 Natalie Enright Jerger - PhD Defense

8 Multicore Examples 10/20/2008 8 Natalie Enright Jerger - PhD Defense Sun Niagara XBAR 0 1 2 3 0123 4 5 45

9 Multicore Examples 10/20/2008 9 Natalie Enright Jerger - PhD Defense IBM Cell RING

10 Many Core Example  Intel Terascale  80 core prototype  2-D Mesh Topology  MIT RAW, TRIPS 10/20/2008Natalie Enright Jerger - PhD Defense 10 2D MESH

11 Designing an on-chip network  Routing  Path from source to destination  Switching & Flow Control  Packet-Switching  Virtual channels  Router micro- architecture  Logic, buffers 10/20/2008Natalie Enright Jerger - PhD Defense router 11

12 Routing Design Decisions  Determine path through network  Single source to a single destination  Multicasting: small percentage  One source with multiple destinations  Current solution: replicate message at source 10/20/2008 12 Natalie Enright Jerger - PhD Defense B A Deterministic X-Y Route C D UnicastHighly Optimized Multicast Too Expensive Unnecessary Multicast: Multiple unicast

13 Communication latency  Network latency  Router overheads  Network congestion due to inefficient multicasting  Coherence protocol  Cache-to-cache transfer latency  Bandwidth requirements 10/20/2008 13 Natalie Enright Jerger - PhD Defense

14 Existing Coherence Schemes  Broadcast  Low latency  Simple ordering (using bus)  Single ordering point  bottleneck  Low bandwidth interconnect  High coherence traffic requirements  Directory  Distributed ordering point  Mostly unicast coherence traffic  Scalable topologies  Indirection latency  Storage overhead 10/20/2008 14 Natalie Enright Jerger - PhD Defense

15 Full-chip Coherence: Broadcast  Too expensive  Too much power, too much bandwidth  Majority of cores do not need to see broadcast 10/20/2008 15 Natalie Enright Jerger - PhD Defense

16 Full-chip Coherence: Directory  Long latency  Sharers may be closer than directory  Directory overheads 10/20/2008 16 Natalie Enright Jerger - PhD Defense Store Miss ForwardX Response

17 Hierarchical Coherence for CMPs  Limit coherence actions to subset  Reduce bandwidth demands and power consumption  A multicast to active sharers  Fast cache-to-cache transfers  Avoid indirection (directory)  Limited storage overhead  Leverage coarse-grain information  Efficient directories 10/20/2008 17 Natalie Enright Jerger - PhD Defense

18 Workloads & Coherence  Hierarchical coherence  good match to applications  Server consolidation  Emerging class of workloads for many-core  Limited global (inter-VM) communication  Communication and coherence dominated by intra-VM sharing  Traditional workloads  Small number of sharers per cache line or region (multiple cache lines) 10/20/2008 18 Natalie Enright Jerger - PhD Defense

19 Outline  Motivation  Virtual Circuit Tree Multicasting  Baseline Inefficiencies  Router Architecture  Results  Hybrid Circuit Switching  Virtual Tree Coherence  Circuit-Switched Coherence  Conclusion/Future Work 10/20/2008 19 Natalie Enright Jerger - PhD Defense

20 State-of-the-Art Router  Current router architecture poorly equipped to handle even a low amount of multicast (MC) traffic 10/20/2008 20 Natalie Enright Jerger - PhD Defense

21 Packet-Switched Router  3 stage router with virtual channel flow control  Based on most aggressive recent proposals  Aggressive baseline not well matched all types of communication  Multicast is performed using multiple unicasts 10/20/2008 21 Natalie Enright Jerger - PhD Defense Buffer Write Virtual Channel/ Switch Allocation Switch Traversal Link Traversal Switch Traversal Link Traversal Router Link Router

22 Baseline Router Example 10/20/2008 22 Natalie Enright Jerger - PhD Defense VCs X AB DC Busy VCs 1B 1A 1C 1D 2A 2B 2C 2D More resources to solve this problem? More buffers, virtual channels, links? More resources to solve this problem? More buffers, virtual channels, links?

23 Key Router Problems 10/20/2008 23 Natalie Enright Jerger - PhD Defense VCs X AB DC Busy VCs 2A 2B 2C 2D Injection Bandwidth: Burst of messages at network interface Injection Bandwidth: Burst of messages at network interface 1A 1C 1B 1D Redundant (wasteful) use of resources: same payload occupying extra buffers, links Redundant (wasteful) use of resources: same payload occupying extra buffers, links Alternative routing: Improve throughput, but wastes power Alternative routing: Improve throughput, but wastes power Speculation Problems: predicated on low loads Burst of messages Speculation Problems: predicated on low loads Burst of messages

24 VCTM Overview  Multiple unicast vs. tree-based multicast 10/20/2008 24 Natalie Enright Jerger - PhD Defense A BC DEFG A BC DEFG

25 VCTM Overview  Mapping trees to mesh (or arbitrary topology)  With Dimension Order X-Y Routing AC EF G D B Root 10/20/2008 25 Natalie Enright Jerger - PhD Defense A BC DEFG

26 M: Multicast from 0 to 1 Build Tree Incrementally (Tree M) 2 3 Unicast Setup Packets (1 per destination) 3 012 345 3 Packets Injected into Network 4 Virtual Circuit Tree Multicasting Example  Builds on existing state-of-the-art router  Unicast performance is not impacted  Build multicast trees incrementally 10/20/2008 26 Natalie Enright Jerger - PhD Defense ABCMMM A2 B4 C5 M: M: <East, South> M: <Eject, South> M: Injection problem solved Injection problem solved Link Redundancy Removed Link Redundancy Removed Fewer packets improves speculation Fewer packets improves speculation Tree reuse is necessary for effectiveness Significant temporal destination set reuse across all scenarios Tree reuse is necessary for effectiveness Significant temporal destination set reuse across all scenarios

27 VCTM Router Architecture 10/20/2008 27 Natalie Enright Jerger - PhD Defense VC 0 MVC 0 VC 0 VC x MVC 0 VC 0 VC x MVC 0 Src VCTnum Virtual Circuit Tree Table Switch Allocator Virtual Channel Allocator VC 0 VC x Input Ports IdEjNSEW...... 010110 Routing Computation

28 Implementation Details (1)  Destination Set Content Addressable Memory  If not present  replace tree  perform setup 10/20/2008 28 Natalie Enright Jerger - PhD Defense 001100 011100 110100 001010 542 110100 0 1 2 3 Encode Tree ID 2 into multicast header Destination Set

29 Implementation Details (2)  VCTs provide routing not resources  Multicast arbitration same as unicast  VCTs do not pre-allocate resources  Multiple arbitration steps at tree branch  If one desired output is blocked, other tree branch outputs can still proceed  Longer buffer occupancy  VCT entries statically partitioned equally among nodes  Dynamic partitioning explored 10/20/2008 29 Natalie Enright Jerger - PhD Defense

30 TokenB VTC Characterizing Multicasts  Unique Destination Sets: combination of destinations in multicast  Number of Destinations per multicast 10/20/2008 30 Natalie Enright Jerger - PhD Defense TokenB: 1 destination set for each node TokenB: 1 destination set for each node VTC and Directory: Much larger variety of destination sets VTC and Directory: Much larger variety of destination sets TRIPS and Directory: Small destination sets TRIPS and Directory: Small destination sets TokenB and Opteron: Large destination sets TokenB and Opteron: Large destination sets VTC: Wide variety of sizes VTC: Wide variety of sizes Up to 13% of traffic is multicast VCTM is an inexpensive solution to support multicasting Up to 13% of traffic is multicast VCTM is an inexpensive solution to support multicasting VTC TokenB

31 Simulation Methodology  Network traffic from 5 different scenarios  Detailed network simulator  Cycle-accurate modeling of router stages  Evaluate 5 Scenarios:  TokenB, VTC, AMD Opteron, SGI Origin Directory, TRIPS* *Thanks to Niket Agarwal & Noel Eisley for TokenB, Opteron and TRIPS traces 10/20/2008 31 Natalie Enright Jerger - PhD Defense

32 Network Configuration 10/20/2008 32 Natalie Enright Jerger - PhD Defense Topology4-ary 2-mesh 5-ary 2-mesh (TRIPS) RoutingDimension Order: X-Y Routing Channel Width16 Bytes Packet Size1 flit (Coherence request = Address + Command) 5 flits (Data) 3 flits (TRIPS) Virtual Channels4 Buffers per port24 Router ports5 Virtual Circuit TreesVaried from 16 to 4K (1 to 256 VCTs/core)

33 Power Savings  On-chip networks consume up to ~36% of chip power [Wang, 2002]  Links, buffers and crossbars consume nearly 100% of network power  Power saved through activity reduction 10/20/2008 33 Natalie Enright Jerger - PhD Defense

34 Performance Results Summary  Small number of trees required for majority of benefit  Performance improvement depends on network pressure 10/20/2008 34 Natalie Enright Jerger - PhD Defense TPC-H: 68% Art: 55% SPECweb: 12% VTC 1248321282560Trees per core Total trees TRIPS

35 VCTM vs. Aggressive Network  VCTM outperforms aggressive (unrealistic) network 10/20/2008 35 Natalie Enright Jerger - PhD Defense

36 Related Work  Off-Chip Multicast Routers  [Sivaram, 99; Turner, 94; Stunkel, 99, Chiang, 95; Malumbres, 96]  Expensive:  High Radix Routers  Large Routing Tables  Targeting Multi-stage networks  On-Chip  Path-based, SOC [Lu, 06]  D-NUCA cache [Jin, 07]  Table-based [Rodrigo, 08] 10/20/2008 36 Natalie Enright Jerger - PhD Defense

37 VCTM Summary  Key contributions  Characterize multicasting: necessary  VCTM improves throughput of network  Improves performance across a variety of scenarios  Reduces interconnect latency by up to 90%  Reduces switching activity by up to 53%  Small number of trees necessary (8/core) 10/20/2008 37 Natalie Enright Jerger - PhD Defense

38 Outline  Motivation  Virtual Circuit Tree Multicasting  Hybrid Circuit Switching  Brief (work presented at Prelim) (2 slides)  Extended (12 slides)  Virtual Tree Coherence  Circuit-Switched Coherence  Conclusion/Future Work 10/20/2008 38 Natalie Enright Jerger - PhD Defense

39 Hybrid Circuit Switching (1)  Latency is critical  Utilize Circuit Switching for lower latency  A circuit connects resources across multiple hops to avoid router overhead  Traditional circuit-switching performs poorly  HCS Contributions  Novel setup mechanism  Bandwidth stealing 10/20/2008Natalie Enright Jerger - PhD Defense 39

40 Hybrid Circuit Switching (2)  Hybrid Network  Interleaves circuit-switched and packet- switched flits  Optimize setup latency  Improve throughput over traditional circuit- switching  Packet-switched flits snoop idle bandwidth  Reduce interconnect delay by average of 18% (up to 22%) 10/20/2008 40 Natalie Enright Jerger - PhD Defense

41 Outline  Motivation  Virtual Circuit Tree Multicasting  Hybrid Circuit Switching  Virtual Tree Coherence  Hierarchical Coherence  Coarse Grain Regions  Ordering  Results  Circuit-Switched Coherence  Conclusion/Future Work 10/20/2008 41 Natalie Enright Jerger - PhD Defense

42 Virtual Tree Coherence  Local Coherence: Snoopy multicast  Root information present: send coherence request to tree root  Tree Root multicasts request to sharers using VCTM  Add requestor to sharing vector  Sharers supply data (and invalidate if store)  Global Coherence  No root information:  Request root from coarse directory –OR – broadcast for root  Then perform first level actions 10/20/2008 42 Natalie Enright Jerger - PhD Defense

43 Coarse Grain Region Overview  Take a macro view of memory  Better observe access patterns (1KB Regions)  Use RegionTracker for efficient storage [Zebchuk, 07]  Region-based Optimizations  Coarse Grain Coherence Tracking [Cantin, 05]  Eliminate unnecessary broadcasts  Track sharers and roots on coarse granularity  Multicast to sharers for data  Efficient upgrading for non-shared regions 10/20/2008 43 Natalie Enright Jerger - PhD Defense

44 Selecting Root Node  Root should be sharer  fast ordering  First touch  Possibility of single node becoming root for many regions  Migration  Migrate to more central location as sharers are added  Forces majority of roots onto center nodes  Forward in-flight requests to old root 10/20/2008 44 Natalie Enright Jerger - PhD Defense

45 Global Coherence Trade-offs  Global Coarse Directories  Storage overhead  Unicast requests  Global Broadcast  No storage overhead  Broadcast traffic: expensive but rare 10/20/2008 45 Natalie Enright Jerger - PhD Defense

46 Virtual Tree Coherence Ordering  Intra-region Order  Ordering Point  Tree Root orders all messages to same address region  Multiple roots prevent ordering bottleneck  Sharers  A multicast must contain all current sharers  Too many false sharers  wastes power  Too few  incorrect execution 10/20/2008 46 Natalie Enright Jerger - PhD Defense

47 Preserving Network Order  Arbitrary topology  Messages can become reordered either  By virtual channel allocation  By adaptive routing  Restrict virtual channel assignment to prevent re-ordering in the network  Tree must use the same VC throughout network traversal  Leaves (sharers) observe requests in same order that root processed them 10/20/2008 47 Natalie Enright Jerger - PhD Defense  Disallow  Address

48 Virtual Tree Coherence Example  Competing store requests to address A[0] from Nodes 0 and 1. 10/20/2008 48 Natalie Enright Jerger - PhD Defense 012 345 RegionSharersRoot A0, 1, 4, 54 B2,32 Order: 1, 0 Order: 1 Order: 1 Order: 1 Data A[0] 1 1 1 Store Complete! 0 0 0 Data A[0] Store Complete! Order: 1,0 Order: 1, 0 Order: 1, 0 Invalid A[0] Modified A[0]

49 Inter-Region Order (1)  Safe to collect acknowledgments on every store miss (Opteron-like)  Given our intra-region tree order  Acknowledging every store miss overly conservative  Before a dirty block becomes visible to system (leaves the core)  Must fence on all previous dirty/unfenced regions  A fence issued for each region on that region’s tree  Memory barrier requires that fences on all dirty/unfenced regions are complete (PowerPC consistency) 10/20/2008 49 Natalie Enright Jerger - PhD Defense

50 Inter-Region Order (2)  Write atomicity  Remote reads to dirty/unfenced regions initiate fence(s) and are deferred until fence(s) completes  Fence acks guarantee that all prior stores (invalidates) to that region have been sunk on all cores  Write serialization  Writes within a region serialized through Tree Order  Serialize writes across region boundary with fence (e.g. St A, St B)  Fence occurs on tree of earlier store (A)  Eager fence 10/20/2008 50 Natalie Enright Jerger - PhD Defense

51 Simulation Methodology  PHARMsim  Full-system multi-core simulator  Detailed network level model with VCTM  Baseline:  Directory Coherence  Workloads:  Commercial: SPECjbb, SPECweb, TPC-H, TPC-W  Scientific: Barnes-Hut, Ocean, Radiosity, Raytrace  Server Consolidation 10/20/2008 51 Natalie Enright Jerger - PhD Defense

52 Simulation Configuration 10/20/2008 52 Natalie Enright Jerger - PhD Defense Processors Cores16,64 in-order general purpose Memory System L1 I/D Caches32 KB 2-way set associative 1 cycle Private L2 caches1MB 8-way set associative, 6 cycles 64 Byte lines RegionTracker (associated with each L2)1024 sets, 8 ways, 1KB regions Interconnect: 4x4 2-D Mesh Packet-switched baseline3 router stages 8 Virtual channels with 4 Buffers each Virtual Circuit Tree Multicasting64 Trees per source node 1024, 4096 total trees

53 Performance (16 cores)  Outperform Directory by average of 25%  Difference between VTC-Bcast and VTC-Mcast-* will become more pronounced in larger systems 10/20/2008 53 Natalie Enright Jerger - PhD Defense

54 Server Consolidation Performance  Performance difference between VTC-Bcast and VTC-Mcast-Dir goes from 6% to 16% moving from 16 to 64 cores 10/20/2008 54 Natalie Enright Jerger - PhD Defense

55 Ordering Hop Counts  Significant reduction in hop count to order requests: 21% and 41% on average 10/20/2008 55 Natalie Enright Jerger - PhD Defense

56 Tree Root Migration: Hotspots  Migration based on lowest ordering hop count results in severe congestion 10/20/2008 56 Natalie Enright Jerger - PhD Defense

57 Activity (16 cores)  VTC-Mcast-Dir increases bandwidth 2.1x over directory 10/20/2008 57 Natalie Enright Jerger - PhD Defense

58 Related Work  Unordered Interconnects  Logical/Virtual Ordering: UnCorq [Strauss, 07]  Token Counting: Token Coherence [Martin, 03]  Hierarchical Coherence  Virtual Hierarchies [Marty, 07]  Multiprocessors built out of SMP building blocks [Laudon, 97; Lenoski, 90; Lovett, 96; Hagersten, 99]  Multicast coherence  Multicast Snooping, Destination Set Prediction [Bilir, 99; Martin, 03] 10/20/2008 58 Natalie Enright Jerger - PhD Defense

59 Virtual Tree Coherence Summary  High bandwidth, low latency ordering substrate  Fast cache-to-cache transfers  Efficient bandwidth utilization through multicasting  Coarse-grain tracking of sharers  Scales well for server consolidation workloads 10/20/2008 59 Natalie Enright Jerger - PhD Defense

60 Outline  Motivation  Virtual Circuit Tree Multicasting  Hybrid Circuit Switching  Virtual Tree Coherence  Circuit-Switched Coherence  Brief (work presented at Prelim) (1 slide)  Extended (6 slides)  Conclusion/Future Work 10/20/2008 60 Natalie Enright Jerger - PhD Defense

61 Circuit-Switched Coherence  Goal: Better exploit circuits through coherence protocol  Observe pair-wise sharing  Take Directory off critical path  Directory is sole ordering point  Performance improvement increased with HCS + Protocol Optimization  Protocol Optimization drives up circuit reuse, better utilizing HCS  Performance improvement up to 18% 10/20/2008Natalie Enright Jerger - PhD Defense 61

62 Outline  Motivation  Virtual Circuit Tree Multicasting  Hybrid Circuit Switching  Virtual Tree Coherence  Circuit-Switched Coherence  Conclusion/Future Work 10/20/2008 62 Natalie Enright Jerger - PhD Defense

63 Conclusion  Communication behaviors must influence interconnect design  Coherence protocol dictates interconnect traffic  Designing networks solely for unicast traffic problematic  Protocol should take advantage of interconnect properties  If interconnect provides implicit ordering  use it 10/20/2008 63 Natalie Enright Jerger - PhD Defense

64 Future Work  Network Quality of Service/Fairness  Root selection results in potential for hotspots in network  Reduction/Collection Networks  Reduce interconnect activity  Combine acknowledgments in network  Techniques to improve scalability of VCTM 10/20/2008 64 Natalie Enright Jerger - PhD Defense

65 Thank you 10/20/2008 65 Natalie Enright Jerger - PhD Defense

66 Extended Versions (HCS, CSC) 10/20/2008 66 Natalie Enright Jerger - PhD Defense

67 Hybrid Circuit Switching (Extended)  Latency is critical  Utilize Circuit Switching for lower latency  A circuit connects resources across multiple hops to avoid router overhead  Traditional circuit-switching performs poorly  HCS Contributions  Novel setup mechanism  Bandwidth stealing 10/20/2008Natalie Enright Jerger - PhD Defense 67

68 Traditional Circuit Switching Path Setup (with Acknowledgement)  Significant latency overhead prior to data transfer  Other requests forced to wait for resources 10/20/2008 Natalie Enright Jerger - PhD Defense Acknowledgement Configuration Probe Data Circuit 0 5 68

69 Novel Circuit Setup Policy  Overlap circuit setup with 1 st data transfer  Reconfigure existing circuits if no unused links available  Allows piggy-backed request to always achieve low latency  Multiple circuit planes prevent frequent reconfiguration Configuration Packet Data Circuit A 0 5 Natalie Enright Jerger - PhD Defense 10/20/2008 69

70 Setup Network  Light-weight setup network  Narrow  Circuit plane identifier (2 bits) +  Destination (4 bits)  Low Load  No virtual channels  small area footprint  Stores circuit configuration information  Multiple narrow circuit planes prevent frequent reconfiguration  Reconfiguration  Buffered, traverses packet-switched pipeline 10/20/2008 70 Natalie Enright Jerger - PhD Defense

71 Packet-Switched Bandwidth Stealing  Remember: problem with traditional Circuit-Switching is poor bandwidth  Need to overcome this limitation  Hybrid Circuit-Switched Solution: Packet- switched messages snoop incoming links  When there are no circuit-switched messages on the link  A waiting packet-switched message can steal idle bandwidth 10/20/2008 71 Natalie Enright Jerger - PhD Defense

72 Hybrid Circuit-Switched Router Design T T T T T Allocators Crossbar Inj N S E W W E S N Ej 10/20/2008Natalie Enright Jerger - PhD Defense 72

73 HCS Pipeline  Circuit-switched messages: 1 stage  Packet-switched messages: 3 stages  Aggressive Speculation reduces stages Buffer Write Virtual Channel/ Switch Allocation Switch Traversal Link Traversal 10/20/2008Natalie Enright Jerger - PhD Defense Switch Traversal Link Traversal 73 RouterLink RouterLink Link Traversal

74 Simulation Workloads 10/20/2008 74 Natalie Enright Jerger - PhD Defense Commercial SPECjbbJava server workload 24 warehouse, 200 requests SPECwebWeb server, 300 requests TPC-WWeb e-commerce, 40 transactions TPC-HDecision support system Scientific Barnes-Hut8k particles, full run Ocean514x514, parallel phase RadiosityParallel phase RaytraceCar input, parallel phase Synthetic Uniform RandomDestination select with uniform random distribution Permutation TrafficEach node communicates with one other node (pair-wise)

75 Simulation Configuration  Table with config parameters 10/20/2008 75 Natalie Enright Jerger - PhD Defense Processors Cores16 in-order general purpose Memory System L1 I/D Caches32 KB 2-way set associative 1 cycle Private L2 caches512 KB 4-way set associative 6 cycles 64 Byte lines Shared L3 Cache16 MB (1MB bank/tile) 4-way set associative 12 cycles Main Memory Latency100 cycles Interconnect: 4x4 2-D Mesh Packet-switched baselineOptimized 1-3 router stages 4 Virtual channels with 4 Buffers each Hybrid Circuit Switching1 router stage 2 or 4 Circuit planes

76 Network Results  Communication latency is key: shave off precious cycles in network latency 10/20/2008 76 Natalie Enright Jerger - PhD Defense

77 Flit breakdown  Reduce interconnect latency for a significant fraction of messages 10/20/2008 77 Natalie Enright Jerger - PhD Defense

78 Uniform Random Traffic  HCS successfully overcomes bandwidth limitations associated with Circuit Switching 10/20/2008 78 Natalie Enright Jerger - PhD Defense

79 HCS Conclusion  Overcome the bandwidth limitations of traditional circuit-switching  Fast setup technique for low latency  Improve performance  Network latency reduced by average of 18% (up to 22%)  Overall performance improvement: up to 7% 10/20/2008 79 Natalie Enright Jerger - PhD Defense

80 Circuit-Switched Coherence (Extended)  Goal: Better exploit circuits through coherence protocol  Modifications:  Allow a cache to send a request directly to another cache  Notify the directory in parallel  Prediction mechanism for pair-wise sharers  Directory is sole ordering point 10/20/2008Natalie Enright Jerger - PhD Defense 80

81 Circuit-Switched Coherence Optimization 10/20/2008 81 Natalie Enright Jerger - PhD Defense Directory AddressStateSharers AExclusive2 BShared1,2 12 Update A 1 Data Response A 2 3 Directory AddressStateSharers AShared1,2 BShared1,2 Ack A Read A 1

82 Region Prediction  Each memory region spans 1KB  Takes advantage of spatial and temporal sharing 10/20/2008 82 Natalie Enright Jerger - PhD Defense Directory AddressStateSharers A[0]Shared2 A[1]Shared2 12 Miss A[0] 1 Forward Read A[0] 2 Data Response A[0] 3 Region Table A-- B3 Region Table A2 B3 Region A Update 4 5 Read A[1] Directory AddressStateSharers A[0]Shared1,2 A[1]Shared2

83 Simulation Configuration  Table with config parameters 10/20/2008 83 Natalie Enright Jerger - PhD Defense Processors Cores16 in-order general purpose Memory System L1 I/D Caches32 KB 2-way set associative 1 cycle Private L2 caches512 KB 4-way set associative 6 cycles 64 Byte lines Shared L3 Cache16 MB (1MB bank/tile) 4-way set associative 12 cycles Main Memory Latency100 cycles Interconnect: 4x4 2-D Mesh Packet-switched baselineOptimized 1-3 router stages 4 Virtual channels with 4 Buffers each Hybrid Circuit Switching1 router stage 2 or 4 Circuit planes

84 HCS + Protocol Optimization  Improvement of HCS + Protocol optimization is greater than the sum of HCS or Protocol Optimization alone.  Protocol Optimization drives up circuit reuse, better utilizing HCS 10/20/2008Natalie Enright Jerger - PhD Defense 84

85 Circuit-Switched Coherence Summary  Replace packet-switched mesh with hybrid circuit-switched mesh  Interleave circuit and packet switched flits  Reconfigurable circuits  Dedicated bandwidth for frequent pair-wise sharers  Low Latency and low power  Avoid switching/routing  Devise novel coherence mechanisms to take advantage of benefits of circuit switching 10/20/2008Natalie Enright Jerger - PhD Defense 85

86 Inter-Region Order (3)  Option 1: require acknowledgments from multicast sharers  Option 2: Coarse grain fences  Fence between stores to different regions  Significantly reduces acknowledgments  Fence not required if previous region was held in Exclusive 10/20/2008 86 Natalie Enright Jerger - PhD Defense P0P1 Store A[0]Rd A[1] Store A[1]Rd A[0] If P1 observes the store to A[1], then it has also observed A[0] because of tree order P0P1 Store ARd B Store BRd A Delayed on Tree 1 Arrives quickly on Tree 2 Result: P1 reads new values of B and stale value of A Remote Reads to B cannot be satisfied by P0 until acks collected for Fence A, B P0P1 Store miss ARd B Implicit Fence ARd A Store miss B Implicit Fence B

87 Inter-Region Order (4)  Write Atomicity 10/20/2008 87 Natalie Enright Jerger - PhD Defense P0P1P2 A = 1While (A == 0);While (B == 0); B = 1Rd A A P0P1P2 B P2 reads B = 1, A = 0 P2 reads B = 1, A = 0

88 Inter-Region Order (5)  Write Atomicity 10/20/2008 88 Natalie Enright Jerger - PhD Defense Remote Reads to dirty/unfenced region require fence to guarantee that store is visible to all processors P0P1P2 A = 1While (A == 0);While (B == 0); Remote Rd A  Fence AB = 1Rd A P1’s read of A cannot be satisfied until Fence A sunk by everyone P2 will Rd A = 1 P1’s read of A cannot be satisfied until Fence A sunk by everyone P2 will Rd A = 1


Download ppt "Chip Multiprocessor Coherence and Interconnect System Design Natalie Enright Jerger PhD Defense October 20, 2008."

Similar presentations


Ads by Google