Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking Jason F. Cantin, Mikko H. Lipasti, and James E. Smith International Symposium on Computer Architecture June 7th, 2005
Overview of Idea Coarse-Grain Coherence Tracking: Monitors coherence status of memory at a multi-line granularity Uses the coarse-grain information to identify requests that don’t need a coherence broadcast Sends these requests directly to memory June 7, 2005 ISCA 2005
Problem Snoop-based systems support a limited number of processors Broadcast Network $ P DRAM NC MC Data Network Snoop-based systems support a limited number of processors Limited broadcast bandwidth Increasing memory latency June 7, 2005 ISCA 2005
Opportunity Some data requests don’t need a broadcast Requests for non-shared data Fetches of unmodified instructions Write-backs Some non-data requests don’t need to leave the processor Requests to upgrade copy, but not shared Requests to flush copies, but not cached elsewhere June 7, 2005 ISCA 2005
Unnecessary Broadcasts June 7, 2005 ISCA 2005
Our Approach Identify requests that don’t need a broadcast Send data requests directly to memory Reduce broadcast traffic Reduce latency in some systems Avoid sending non-data requests externally Further reduce broadcast traffic Reduce latency June 7, 2005 ISCA 2005
Coarse-Grain Coherence Tracking Memory is divided into coarse-grain regions Aligned, power-of-two multiple of cache line size Can range from two lines to a physical page A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA) June 7, 2005 ISCA 2005
Coarse-Grain Coherence Tracking Each entry has an address tag, state, and count of lines cached by the processor The state indicates if the processor and / or other processors are sharing / modifying lines in the region On cache misses, the region state is read to determine if a broadcast is necessary June 7, 2005 ISCA 2005
Coarse-Grain Coherence Tracking On snoops, the region state provides a response for the region Piggy-backed onto the conventional response Used to update other processors’ region state RCA maintains inclusion over caches When regions are evicted, their lines are evicted RCA must respond correctly if region’s lines cached Replacement algorithm uses line count June 7, 2005 ISCA 2005
Example: Conventional Snooping Network Read: P0, 100002 Read: P0, 100002 Invalid Invalid Tag State P0 loads 100002 0000 0010 $0 Exclusive Invalid Pending $1 0000 Invalid MISS 0000 Invalid 0000 Invalid Snoop performed Data Load: 100002 Data P0 P1 Response sent Data transfer M0 M1 June 7, 2005 ISCA 2005
Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network P0 has exclusive access to region Read: P0, 100002 Invalid, Region Not Shared Read: P0, 100002 Invalid, Region Not Shared Tag State P0 loads 100002 0010 0000 $0 Pending Invalid Exclusive 000 001 RCA DI Pending Invalid 0000 $1 Invalid 000 RCA Invalid MISS 0000 Invalid 000 Invalid 0000 Invalid 000 Invalid Snoop performed Data Load: 100002 P0 P1 Response sent Data Data transfer M0 M1 June 7, 2005 ISCA 2005
Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network Exclusive region state, broadcast unnecessary Tag State P0 loads 110002 $0 0010 Exclusive 001 RCA DI $1 0000 Invalid RCA 000 Invalid MISS, Region Hit 0011 0000 Invalid Exclusive Pending 000 Invalid 0000 Invalid 000 Invalid Direct request sent Data Load: 110002 P0 P1 Data transfer Read: P0, 110002 Data M0 M1 June 7, 2005 ISCA 2005
Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network Region not exclusive anymore Owned, Region Owned RFO: P1, 100002 Owned, Region Owned RFO: P1, 100002 P1 stores 100002 0010 $0 Exclusive Pending Invalid 001 RCA DI DD 0010 $1 0000 Invalid Modified Pending 001 000 RCA DD Pending Invalid MISS 0011 Exclusive 000 Invalid 0000 Invalid 000 Invalid Snoop performed Data Data Store: 100002 Hits in P0 cache P0 P1 Response sent Data transfer M0 M1 June 7, 2005 ISCA 2005
Overhead Storage space needed for RCA 3-6% storage overhead for cache Two bits needed in snoop response for region response Path to memory needed to avoid broadcasts Simple with on-chip memory controllers May leverage data network June 7, 2005 ISCA 2005
Simulator PHARMsim: Execution-driven simulator built on top of SimOS-PPC Four 4-way superscalar out-of-order processors Two-level hierarchy with split L1, unified L2 caches Separate address / data networks –similar to Fireplane Region Coherence Array with same sets/assoc. as L2 June 7, 2005 ISCA 2005
Workloads Scientific Multiprogrammed Commercial Ocean, Raytrace, Barnes Multiprogrammed SPECint2000_rate Commercial TPC-W, TPC-B, TPC-H, SPECweb99, SPECjbb2000 June 7, 2005 ISCA 2005
Broadcasts Avoided June 7, 2005 ISCA 2005
Snoop Traffic Reduction – Peak 64% 51% 38% June 7, 2005 ISCA 2005
Snoop Traffic Reduction – Average 47% 74% 86% June 7, 2005 ISCA 2005
Execution Time 91.2% June 7, 2005 ISCA 2005
Remaining Opportunity With 512B regions, ~10% of requests are broadcast unnecessarily A third of the 10% are region false sharing Half of the 10% miss in RCA Potential for prefetching June 7, 2005 ISCA 2005
Inclusion Overhead --Regions with no lines cached replaced first June 7, 2005 ISCA 2005
Conclusion Coarse-Grain Coherence Tracking: Reduces broadcast traffic Most data requests sent directly to memory Reduces latency Many requests not sent to central arbitration point Many non-data requests not sent externally Improves scalability and performance June 7, 2005 ISCA 2005
The End June 7, 2005 ISCA 2005
Inclusion Evictions June 7, 2005 ISCA 2005
Ordering Ordering point is now the Region Coherence Array A direct request is ordered once it accesses the RCA Direct requests are serialized w.r.t. to snoop requests A direct request occurs either before, or after a snoop All must appear to access and update RCA atomically No two processors can have exclusive access to a region at the same time (no races) June 7, 2005 ISCA 2005
Comparison to RegionScout CGCT RegionScout Optimization Latency Power Avoids broadcast for non-shared data Yes Avoids broadcast for clean data No Avoids tag lookups on snoops Yes –Like Jetty Region state storage Inclusive cache Hash table, small cache Region state transfer 2 bits in snoop response 1 bit in snoop response Region protocol 7 states Effectively 4 states June 7, 2005 ISCA 2005
Execution Time June 7, 2005 ISCA 2005