Download presentation
Presentation is loading. Please wait.
Published byJayson Holt Modified over 8 years ago
1
Chip Multiprocessor Coherence and Interconnect System Design Natalie Enright Jerger PhD Defense October 20, 2008
2
Key Contributions – Interconnects (1) Trend: proliferation of architectures requiring broadcast and multicast communication Challenge: too expensive for on-chip Solution: Virtual Circuit Tree Multicasting Improves throughput with low-overhead routing tables 10/20/2008 2 Natalie Enright Jerger - PhD Defense
3
Key Contributions – Interconnects (2) Trend: interconnect latency has a significant impact on overall performance Challenge: router overhead contributes substantially to total network latency Solution: Hybrid Circuit Switching Circuit switching can be made fast without sacrificing bandwidth 10/20/2008 3 Natalie Enright Jerger - PhD Defense
4
Key Contributions – Cache Coherence Trend: Need for all cores to observe coherence requests will be rare Challenge: low latency cache-to-cache transfers without high bandwidth overhead Solutions: leverage properties of interconnection network Virtual Tree Coherence Efficient multicast Circuit-Switched Coherence Optimized pair-wise sharing 10/20/2008 4 Natalie Enright Jerger - PhD Defense
5
Outline Overview/Background On-chip Networks Cache Coherence Virtual Circuit Tree Multicasting Hybrid Circuit Switching Virtual Tree Coherence Circuit-Switched Coherence Conclusion/Future Work 10/20/2008 5 Natalie Enright Jerger - PhD Defense
6
Communication system co-design Communication consists of both How data is communicated (interconnect) What is communicated to whom and when (coherence) Focus on co-design Interconnect provides efficient multicasting design multicast coherence protocol Interconnect optimizes pair-wise communication integrate pair-wise sharing into protocol 10/20/2008 6 Natalie Enright Jerger - PhD Defense
7
On-Chip Interconnection Networks Replace global wiring as communication medium Prohibitive latency, inefficient, poor scalability Modular network design High bandwidth Multiple concurrent communication flows Challenges: Low latency communication Low power and area 10/20/2008 7 Natalie Enright Jerger - PhD Defense
8
Multicore Examples 10/20/2008 8 Natalie Enright Jerger - PhD Defense Sun Niagara XBAR 0 1 2 3 0123 4 5 45
9
Multicore Examples 10/20/2008 9 Natalie Enright Jerger - PhD Defense IBM Cell RING
10
Many Core Example Intel Terascale 80 core prototype 2-D Mesh Topology MIT RAW, TRIPS 10/20/2008Natalie Enright Jerger - PhD Defense 10 2D MESH
11
Designing an on-chip network Routing Path from source to destination Switching & Flow Control Packet-Switching Virtual channels Router micro- architecture Logic, buffers 10/20/2008Natalie Enright Jerger - PhD Defense router 11
12
Routing Design Decisions Determine path through network Single source to a single destination Multicasting: small percentage One source with multiple destinations Current solution: replicate message at source 10/20/2008 12 Natalie Enright Jerger - PhD Defense B A Deterministic X-Y Route C D UnicastHighly Optimized Multicast Too Expensive Unnecessary Multicast: Multiple unicast
13
Communication latency Network latency Router overheads Network congestion due to inefficient multicasting Coherence protocol Cache-to-cache transfer latency Bandwidth requirements 10/20/2008 13 Natalie Enright Jerger - PhD Defense
14
Existing Coherence Schemes Broadcast Low latency Simple ordering (using bus) Single ordering point bottleneck Low bandwidth interconnect High coherence traffic requirements Directory Distributed ordering point Mostly unicast coherence traffic Scalable topologies Indirection latency Storage overhead 10/20/2008 14 Natalie Enright Jerger - PhD Defense
15
Full-chip Coherence: Broadcast Too expensive Too much power, too much bandwidth Majority of cores do not need to see broadcast 10/20/2008 15 Natalie Enright Jerger - PhD Defense
16
Full-chip Coherence: Directory Long latency Sharers may be closer than directory Directory overheads 10/20/2008 16 Natalie Enright Jerger - PhD Defense Store Miss ForwardX Response
17
Hierarchical Coherence for CMPs Limit coherence actions to subset Reduce bandwidth demands and power consumption A multicast to active sharers Fast cache-to-cache transfers Avoid indirection (directory) Limited storage overhead Leverage coarse-grain information Efficient directories 10/20/2008 17 Natalie Enright Jerger - PhD Defense
18
Workloads & Coherence Hierarchical coherence good match to applications Server consolidation Emerging class of workloads for many-core Limited global (inter-VM) communication Communication and coherence dominated by intra-VM sharing Traditional workloads Small number of sharers per cache line or region (multiple cache lines) 10/20/2008 18 Natalie Enright Jerger - PhD Defense
19
Outline Motivation Virtual Circuit Tree Multicasting Baseline Inefficiencies Router Architecture Results Hybrid Circuit Switching Virtual Tree Coherence Circuit-Switched Coherence Conclusion/Future Work 10/20/2008 19 Natalie Enright Jerger - PhD Defense
20
State-of-the-Art Router Current router architecture poorly equipped to handle even a low amount of multicast (MC) traffic 10/20/2008 20 Natalie Enright Jerger - PhD Defense
21
Packet-Switched Router 3 stage router with virtual channel flow control Based on most aggressive recent proposals Aggressive baseline not well matched all types of communication Multicast is performed using multiple unicasts 10/20/2008 21 Natalie Enright Jerger - PhD Defense Buffer Write Virtual Channel/ Switch Allocation Switch Traversal Link Traversal Switch Traversal Link Traversal Router Link Router
22
Baseline Router Example 10/20/2008 22 Natalie Enright Jerger - PhD Defense VCs X AB DC Busy VCs 1B 1A 1C 1D 2A 2B 2C 2D More resources to solve this problem? More buffers, virtual channels, links? More resources to solve this problem? More buffers, virtual channels, links?
23
Key Router Problems 10/20/2008 23 Natalie Enright Jerger - PhD Defense VCs X AB DC Busy VCs 2A 2B 2C 2D Injection Bandwidth: Burst of messages at network interface Injection Bandwidth: Burst of messages at network interface 1A 1C 1B 1D Redundant (wasteful) use of resources: same payload occupying extra buffers, links Redundant (wasteful) use of resources: same payload occupying extra buffers, links Alternative routing: Improve throughput, but wastes power Alternative routing: Improve throughput, but wastes power Speculation Problems: predicated on low loads Burst of messages Speculation Problems: predicated on low loads Burst of messages
24
VCTM Overview Multiple unicast vs. tree-based multicast 10/20/2008 24 Natalie Enright Jerger - PhD Defense A BC DEFG A BC DEFG
25
VCTM Overview Mapping trees to mesh (or arbitrary topology) With Dimension Order X-Y Routing AC EF G D B Root 10/20/2008 25 Natalie Enright Jerger - PhD Defense A BC DEFG
26
M: Multicast from 0 to 1 Build Tree Incrementally (Tree M) 2 3 Unicast Setup Packets (1 per destination) 3 012 345 3 Packets Injected into Network 4 Virtual Circuit Tree Multicasting Example Builds on existing state-of-the-art router Unicast performance is not impacted Build multicast trees incrementally 10/20/2008 26 Natalie Enright Jerger - PhD Defense ABCMMM A2 B4 C5 M: M: <East, South> M: <Eject, South> M: Injection problem solved Injection problem solved Link Redundancy Removed Link Redundancy Removed Fewer packets improves speculation Fewer packets improves speculation Tree reuse is necessary for effectiveness Significant temporal destination set reuse across all scenarios Tree reuse is necessary for effectiveness Significant temporal destination set reuse across all scenarios
27
VCTM Router Architecture 10/20/2008 27 Natalie Enright Jerger - PhD Defense VC 0 MVC 0 VC 0 VC x MVC 0 VC 0 VC x MVC 0 Src VCTnum Virtual Circuit Tree Table Switch Allocator Virtual Channel Allocator VC 0 VC x Input Ports IdEjNSEW...... 010110 Routing Computation
28
Implementation Details (1) Destination Set Content Addressable Memory If not present replace tree perform setup 10/20/2008 28 Natalie Enright Jerger - PhD Defense 001100 011100 110100 001010 542 110100 0 1 2 3 Encode Tree ID 2 into multicast header Destination Set
29
Implementation Details (2) VCTs provide routing not resources Multicast arbitration same as unicast VCTs do not pre-allocate resources Multiple arbitration steps at tree branch If one desired output is blocked, other tree branch outputs can still proceed Longer buffer occupancy VCT entries statically partitioned equally among nodes Dynamic partitioning explored 10/20/2008 29 Natalie Enright Jerger - PhD Defense
30
TokenB VTC Characterizing Multicasts Unique Destination Sets: combination of destinations in multicast Number of Destinations per multicast 10/20/2008 30 Natalie Enright Jerger - PhD Defense TokenB: 1 destination set for each node TokenB: 1 destination set for each node VTC and Directory: Much larger variety of destination sets VTC and Directory: Much larger variety of destination sets TRIPS and Directory: Small destination sets TRIPS and Directory: Small destination sets TokenB and Opteron: Large destination sets TokenB and Opteron: Large destination sets VTC: Wide variety of sizes VTC: Wide variety of sizes Up to 13% of traffic is multicast VCTM is an inexpensive solution to support multicasting Up to 13% of traffic is multicast VCTM is an inexpensive solution to support multicasting VTC TokenB
31
Simulation Methodology Network traffic from 5 different scenarios Detailed network simulator Cycle-accurate modeling of router stages Evaluate 5 Scenarios: TokenB, VTC, AMD Opteron, SGI Origin Directory, TRIPS* *Thanks to Niket Agarwal & Noel Eisley for TokenB, Opteron and TRIPS traces 10/20/2008 31 Natalie Enright Jerger - PhD Defense
32
Network Configuration 10/20/2008 32 Natalie Enright Jerger - PhD Defense Topology4-ary 2-mesh 5-ary 2-mesh (TRIPS) RoutingDimension Order: X-Y Routing Channel Width16 Bytes Packet Size1 flit (Coherence request = Address + Command) 5 flits (Data) 3 flits (TRIPS) Virtual Channels4 Buffers per port24 Router ports5 Virtual Circuit TreesVaried from 16 to 4K (1 to 256 VCTs/core)
33
Power Savings On-chip networks consume up to ~36% of chip power [Wang, 2002] Links, buffers and crossbars consume nearly 100% of network power Power saved through activity reduction 10/20/2008 33 Natalie Enright Jerger - PhD Defense
34
Performance Results Summary Small number of trees required for majority of benefit Performance improvement depends on network pressure 10/20/2008 34 Natalie Enright Jerger - PhD Defense TPC-H: 68% Art: 55% SPECweb: 12% VTC 1248321282560Trees per core Total trees TRIPS
35
VCTM vs. Aggressive Network VCTM outperforms aggressive (unrealistic) network 10/20/2008 35 Natalie Enright Jerger - PhD Defense
36
Related Work Off-Chip Multicast Routers [Sivaram, 99; Turner, 94; Stunkel, 99, Chiang, 95; Malumbres, 96] Expensive: High Radix Routers Large Routing Tables Targeting Multi-stage networks On-Chip Path-based, SOC [Lu, 06] D-NUCA cache [Jin, 07] Table-based [Rodrigo, 08] 10/20/2008 36 Natalie Enright Jerger - PhD Defense
37
VCTM Summary Key contributions Characterize multicasting: necessary VCTM improves throughput of network Improves performance across a variety of scenarios Reduces interconnect latency by up to 90% Reduces switching activity by up to 53% Small number of trees necessary (8/core) 10/20/2008 37 Natalie Enright Jerger - PhD Defense
38
Outline Motivation Virtual Circuit Tree Multicasting Hybrid Circuit Switching Brief (work presented at Prelim) (2 slides) Extended (12 slides) Virtual Tree Coherence Circuit-Switched Coherence Conclusion/Future Work 10/20/2008 38 Natalie Enright Jerger - PhD Defense
39
Hybrid Circuit Switching (1) Latency is critical Utilize Circuit Switching for lower latency A circuit connects resources across multiple hops to avoid router overhead Traditional circuit-switching performs poorly HCS Contributions Novel setup mechanism Bandwidth stealing 10/20/2008Natalie Enright Jerger - PhD Defense 39
40
Hybrid Circuit Switching (2) Hybrid Network Interleaves circuit-switched and packet- switched flits Optimize setup latency Improve throughput over traditional circuit- switching Packet-switched flits snoop idle bandwidth Reduce interconnect delay by average of 18% (up to 22%) 10/20/2008 40 Natalie Enright Jerger - PhD Defense
41
Outline Motivation Virtual Circuit Tree Multicasting Hybrid Circuit Switching Virtual Tree Coherence Hierarchical Coherence Coarse Grain Regions Ordering Results Circuit-Switched Coherence Conclusion/Future Work 10/20/2008 41 Natalie Enright Jerger - PhD Defense
42
Virtual Tree Coherence Local Coherence: Snoopy multicast Root information present: send coherence request to tree root Tree Root multicasts request to sharers using VCTM Add requestor to sharing vector Sharers supply data (and invalidate if store) Global Coherence No root information: Request root from coarse directory –OR – broadcast for root Then perform first level actions 10/20/2008 42 Natalie Enright Jerger - PhD Defense
43
Coarse Grain Region Overview Take a macro view of memory Better observe access patterns (1KB Regions) Use RegionTracker for efficient storage [Zebchuk, 07] Region-based Optimizations Coarse Grain Coherence Tracking [Cantin, 05] Eliminate unnecessary broadcasts Track sharers and roots on coarse granularity Multicast to sharers for data Efficient upgrading for non-shared regions 10/20/2008 43 Natalie Enright Jerger - PhD Defense
44
Selecting Root Node Root should be sharer fast ordering First touch Possibility of single node becoming root for many regions Migration Migrate to more central location as sharers are added Forces majority of roots onto center nodes Forward in-flight requests to old root 10/20/2008 44 Natalie Enright Jerger - PhD Defense
45
Global Coherence Trade-offs Global Coarse Directories Storage overhead Unicast requests Global Broadcast No storage overhead Broadcast traffic: expensive but rare 10/20/2008 45 Natalie Enright Jerger - PhD Defense
46
Virtual Tree Coherence Ordering Intra-region Order Ordering Point Tree Root orders all messages to same address region Multiple roots prevent ordering bottleneck Sharers A multicast must contain all current sharers Too many false sharers wastes power Too few incorrect execution 10/20/2008 46 Natalie Enright Jerger - PhD Defense
47
Preserving Network Order Arbitrary topology Messages can become reordered either By virtual channel allocation By adaptive routing Restrict virtual channel assignment to prevent re-ordering in the network Tree must use the same VC throughout network traversal Leaves (sharers) observe requests in same order that root processed them 10/20/2008 47 Natalie Enright Jerger - PhD Defense Disallow Address
48
Virtual Tree Coherence Example Competing store requests to address A[0] from Nodes 0 and 1. 10/20/2008 48 Natalie Enright Jerger - PhD Defense 012 345 RegionSharersRoot A0, 1, 4, 54 B2,32 Order: 1, 0 Order: 1 Order: 1 Order: 1 Data A[0] 1 1 1 Store Complete! 0 0 0 Data A[0] Store Complete! Order: 1,0 Order: 1, 0 Order: 1, 0 Invalid A[0] Modified A[0]
49
Inter-Region Order (1) Safe to collect acknowledgments on every store miss (Opteron-like) Given our intra-region tree order Acknowledging every store miss overly conservative Before a dirty block becomes visible to system (leaves the core) Must fence on all previous dirty/unfenced regions A fence issued for each region on that region’s tree Memory barrier requires that fences on all dirty/unfenced regions are complete (PowerPC consistency) 10/20/2008 49 Natalie Enright Jerger - PhD Defense
50
Inter-Region Order (2) Write atomicity Remote reads to dirty/unfenced regions initiate fence(s) and are deferred until fence(s) completes Fence acks guarantee that all prior stores (invalidates) to that region have been sunk on all cores Write serialization Writes within a region serialized through Tree Order Serialize writes across region boundary with fence (e.g. St A, St B) Fence occurs on tree of earlier store (A) Eager fence 10/20/2008 50 Natalie Enright Jerger - PhD Defense
51
Simulation Methodology PHARMsim Full-system multi-core simulator Detailed network level model with VCTM Baseline: Directory Coherence Workloads: Commercial: SPECjbb, SPECweb, TPC-H, TPC-W Scientific: Barnes-Hut, Ocean, Radiosity, Raytrace Server Consolidation 10/20/2008 51 Natalie Enright Jerger - PhD Defense
52
Simulation Configuration 10/20/2008 52 Natalie Enright Jerger - PhD Defense Processors Cores16,64 in-order general purpose Memory System L1 I/D Caches32 KB 2-way set associative 1 cycle Private L2 caches1MB 8-way set associative, 6 cycles 64 Byte lines RegionTracker (associated with each L2)1024 sets, 8 ways, 1KB regions Interconnect: 4x4 2-D Mesh Packet-switched baseline3 router stages 8 Virtual channels with 4 Buffers each Virtual Circuit Tree Multicasting64 Trees per source node 1024, 4096 total trees
53
Performance (16 cores) Outperform Directory by average of 25% Difference between VTC-Bcast and VTC-Mcast-* will become more pronounced in larger systems 10/20/2008 53 Natalie Enright Jerger - PhD Defense
54
Server Consolidation Performance Performance difference between VTC-Bcast and VTC-Mcast-Dir goes from 6% to 16% moving from 16 to 64 cores 10/20/2008 54 Natalie Enright Jerger - PhD Defense
55
Ordering Hop Counts Significant reduction in hop count to order requests: 21% and 41% on average 10/20/2008 55 Natalie Enright Jerger - PhD Defense
56
Tree Root Migration: Hotspots Migration based on lowest ordering hop count results in severe congestion 10/20/2008 56 Natalie Enright Jerger - PhD Defense
57
Activity (16 cores) VTC-Mcast-Dir increases bandwidth 2.1x over directory 10/20/2008 57 Natalie Enright Jerger - PhD Defense
58
Related Work Unordered Interconnects Logical/Virtual Ordering: UnCorq [Strauss, 07] Token Counting: Token Coherence [Martin, 03] Hierarchical Coherence Virtual Hierarchies [Marty, 07] Multiprocessors built out of SMP building blocks [Laudon, 97; Lenoski, 90; Lovett, 96; Hagersten, 99] Multicast coherence Multicast Snooping, Destination Set Prediction [Bilir, 99; Martin, 03] 10/20/2008 58 Natalie Enright Jerger - PhD Defense
59
Virtual Tree Coherence Summary High bandwidth, low latency ordering substrate Fast cache-to-cache transfers Efficient bandwidth utilization through multicasting Coarse-grain tracking of sharers Scales well for server consolidation workloads 10/20/2008 59 Natalie Enright Jerger - PhD Defense
60
Outline Motivation Virtual Circuit Tree Multicasting Hybrid Circuit Switching Virtual Tree Coherence Circuit-Switched Coherence Brief (work presented at Prelim) (1 slide) Extended (6 slides) Conclusion/Future Work 10/20/2008 60 Natalie Enright Jerger - PhD Defense
61
Circuit-Switched Coherence Goal: Better exploit circuits through coherence protocol Observe pair-wise sharing Take Directory off critical path Directory is sole ordering point Performance improvement increased with HCS + Protocol Optimization Protocol Optimization drives up circuit reuse, better utilizing HCS Performance improvement up to 18% 10/20/2008Natalie Enright Jerger - PhD Defense 61
62
Outline Motivation Virtual Circuit Tree Multicasting Hybrid Circuit Switching Virtual Tree Coherence Circuit-Switched Coherence Conclusion/Future Work 10/20/2008 62 Natalie Enright Jerger - PhD Defense
63
Conclusion Communication behaviors must influence interconnect design Coherence protocol dictates interconnect traffic Designing networks solely for unicast traffic problematic Protocol should take advantage of interconnect properties If interconnect provides implicit ordering use it 10/20/2008 63 Natalie Enright Jerger - PhD Defense
64
Future Work Network Quality of Service/Fairness Root selection results in potential for hotspots in network Reduction/Collection Networks Reduce interconnect activity Combine acknowledgments in network Techniques to improve scalability of VCTM 10/20/2008 64 Natalie Enright Jerger - PhD Defense
65
Thank you 10/20/2008 65 Natalie Enright Jerger - PhD Defense
66
Extended Versions (HCS, CSC) 10/20/2008 66 Natalie Enright Jerger - PhD Defense
67
Hybrid Circuit Switching (Extended) Latency is critical Utilize Circuit Switching for lower latency A circuit connects resources across multiple hops to avoid router overhead Traditional circuit-switching performs poorly HCS Contributions Novel setup mechanism Bandwidth stealing 10/20/2008Natalie Enright Jerger - PhD Defense 67
68
Traditional Circuit Switching Path Setup (with Acknowledgement) Significant latency overhead prior to data transfer Other requests forced to wait for resources 10/20/2008 Natalie Enright Jerger - PhD Defense Acknowledgement Configuration Probe Data Circuit 0 5 68
69
Novel Circuit Setup Policy Overlap circuit setup with 1 st data transfer Reconfigure existing circuits if no unused links available Allows piggy-backed request to always achieve low latency Multiple circuit planes prevent frequent reconfiguration Configuration Packet Data Circuit A 0 5 Natalie Enright Jerger - PhD Defense 10/20/2008 69
70
Setup Network Light-weight setup network Narrow Circuit plane identifier (2 bits) + Destination (4 bits) Low Load No virtual channels small area footprint Stores circuit configuration information Multiple narrow circuit planes prevent frequent reconfiguration Reconfiguration Buffered, traverses packet-switched pipeline 10/20/2008 70 Natalie Enright Jerger - PhD Defense
71
Packet-Switched Bandwidth Stealing Remember: problem with traditional Circuit-Switching is poor bandwidth Need to overcome this limitation Hybrid Circuit-Switched Solution: Packet- switched messages snoop incoming links When there are no circuit-switched messages on the link A waiting packet-switched message can steal idle bandwidth 10/20/2008 71 Natalie Enright Jerger - PhD Defense
72
Hybrid Circuit-Switched Router Design T T T T T Allocators Crossbar Inj N S E W W E S N Ej 10/20/2008Natalie Enright Jerger - PhD Defense 72
73
HCS Pipeline Circuit-switched messages: 1 stage Packet-switched messages: 3 stages Aggressive Speculation reduces stages Buffer Write Virtual Channel/ Switch Allocation Switch Traversal Link Traversal 10/20/2008Natalie Enright Jerger - PhD Defense Switch Traversal Link Traversal 73 RouterLink RouterLink Link Traversal
74
Simulation Workloads 10/20/2008 74 Natalie Enright Jerger - PhD Defense Commercial SPECjbbJava server workload 24 warehouse, 200 requests SPECwebWeb server, 300 requests TPC-WWeb e-commerce, 40 transactions TPC-HDecision support system Scientific Barnes-Hut8k particles, full run Ocean514x514, parallel phase RadiosityParallel phase RaytraceCar input, parallel phase Synthetic Uniform RandomDestination select with uniform random distribution Permutation TrafficEach node communicates with one other node (pair-wise)
75
Simulation Configuration Table with config parameters 10/20/2008 75 Natalie Enright Jerger - PhD Defense Processors Cores16 in-order general purpose Memory System L1 I/D Caches32 KB 2-way set associative 1 cycle Private L2 caches512 KB 4-way set associative 6 cycles 64 Byte lines Shared L3 Cache16 MB (1MB bank/tile) 4-way set associative 12 cycles Main Memory Latency100 cycles Interconnect: 4x4 2-D Mesh Packet-switched baselineOptimized 1-3 router stages 4 Virtual channels with 4 Buffers each Hybrid Circuit Switching1 router stage 2 or 4 Circuit planes
76
Network Results Communication latency is key: shave off precious cycles in network latency 10/20/2008 76 Natalie Enright Jerger - PhD Defense
77
Flit breakdown Reduce interconnect latency for a significant fraction of messages 10/20/2008 77 Natalie Enright Jerger - PhD Defense
78
Uniform Random Traffic HCS successfully overcomes bandwidth limitations associated with Circuit Switching 10/20/2008 78 Natalie Enright Jerger - PhD Defense
79
HCS Conclusion Overcome the bandwidth limitations of traditional circuit-switching Fast setup technique for low latency Improve performance Network latency reduced by average of 18% (up to 22%) Overall performance improvement: up to 7% 10/20/2008 79 Natalie Enright Jerger - PhD Defense
80
Circuit-Switched Coherence (Extended) Goal: Better exploit circuits through coherence protocol Modifications: Allow a cache to send a request directly to another cache Notify the directory in parallel Prediction mechanism for pair-wise sharers Directory is sole ordering point 10/20/2008Natalie Enright Jerger - PhD Defense 80
81
Circuit-Switched Coherence Optimization 10/20/2008 81 Natalie Enright Jerger - PhD Defense Directory AddressStateSharers AExclusive2 BShared1,2 12 Update A 1 Data Response A 2 3 Directory AddressStateSharers AShared1,2 BShared1,2 Ack A Read A 1
82
Region Prediction Each memory region spans 1KB Takes advantage of spatial and temporal sharing 10/20/2008 82 Natalie Enright Jerger - PhD Defense Directory AddressStateSharers A[0]Shared2 A[1]Shared2 12 Miss A[0] 1 Forward Read A[0] 2 Data Response A[0] 3 Region Table A-- B3 Region Table A2 B3 Region A Update 4 5 Read A[1] Directory AddressStateSharers A[0]Shared1,2 A[1]Shared2
83
Simulation Configuration Table with config parameters 10/20/2008 83 Natalie Enright Jerger - PhD Defense Processors Cores16 in-order general purpose Memory System L1 I/D Caches32 KB 2-way set associative 1 cycle Private L2 caches512 KB 4-way set associative 6 cycles 64 Byte lines Shared L3 Cache16 MB (1MB bank/tile) 4-way set associative 12 cycles Main Memory Latency100 cycles Interconnect: 4x4 2-D Mesh Packet-switched baselineOptimized 1-3 router stages 4 Virtual channels with 4 Buffers each Hybrid Circuit Switching1 router stage 2 or 4 Circuit planes
84
HCS + Protocol Optimization Improvement of HCS + Protocol optimization is greater than the sum of HCS or Protocol Optimization alone. Protocol Optimization drives up circuit reuse, better utilizing HCS 10/20/2008Natalie Enright Jerger - PhD Defense 84
85
Circuit-Switched Coherence Summary Replace packet-switched mesh with hybrid circuit-switched mesh Interleave circuit and packet switched flits Reconfigurable circuits Dedicated bandwidth for frequent pair-wise sharers Low Latency and low power Avoid switching/routing Devise novel coherence mechanisms to take advantage of benefits of circuit switching 10/20/2008Natalie Enright Jerger - PhD Defense 85
86
Inter-Region Order (3) Option 1: require acknowledgments from multicast sharers Option 2: Coarse grain fences Fence between stores to different regions Significantly reduces acknowledgments Fence not required if previous region was held in Exclusive 10/20/2008 86 Natalie Enright Jerger - PhD Defense P0P1 Store A[0]Rd A[1] Store A[1]Rd A[0] If P1 observes the store to A[1], then it has also observed A[0] because of tree order P0P1 Store ARd B Store BRd A Delayed on Tree 1 Arrives quickly on Tree 2 Result: P1 reads new values of B and stale value of A Remote Reads to B cannot be satisfied by P0 until acks collected for Fence A, B P0P1 Store miss ARd B Implicit Fence ARd A Store miss B Implicit Fence B
87
Inter-Region Order (4) Write Atomicity 10/20/2008 87 Natalie Enright Jerger - PhD Defense P0P1P2 A = 1While (A == 0);While (B == 0); B = 1Rd A A P0P1P2 B P2 reads B = 1, A = 0 P2 reads B = 1, A = 0
88
Inter-Region Order (5) Write Atomicity 10/20/2008 88 Natalie Enright Jerger - PhD Defense Remote Reads to dirty/unfenced region require fence to guarantee that store is visible to all processors P0P1P2 A = 1While (A == 0);While (B == 0); Remote Rd A Fence AB = 1Rd A P1’s read of A cannot be satisfied until Fence A sunk by everyone P2 will Rd A = 1 P1’s read of A cannot be satisfied until Fence A sunk by everyone P2 will Rd A = 1
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.