Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Similar presentations


Presentation on theme: "Jason F. Cantin, Mikko H. Lipasti, and James E. Smith"— Presentation transcript:

1 Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith International Symposium on Computer Architecture June 7th, 2005

2 Overview of Idea Coarse-Grain Coherence Tracking:
Monitors coherence status of memory at a multi-line granularity Uses the coarse-grain information to identify requests that don’t need a coherence broadcast Sends these requests directly to memory June 7, 2005 ISCA 2005

3 Problem Snoop-based systems support a limited number of processors
Broadcast Network $ P DRAM NC MC Data Network Snoop-based systems support a limited number of processors Limited broadcast bandwidth Increasing memory latency June 7, 2005 ISCA 2005

4 Opportunity Some data requests don’t need a broadcast
Requests for non-shared data Fetches of unmodified instructions Write-backs Some non-data requests don’t need to leave the processor Requests to upgrade copy, but not shared Requests to flush copies, but not cached elsewhere June 7, 2005 ISCA 2005

5 Unnecessary Broadcasts
June 7, 2005 ISCA 2005

6 Our Approach Identify requests that don’t need a broadcast
Send data requests directly to memory Reduce broadcast traffic Reduce latency in some systems Avoid sending non-data requests externally Further reduce broadcast traffic Reduce latency June 7, 2005 ISCA 2005

7 Coarse-Grain Coherence Tracking
Memory is divided into coarse-grain regions Aligned, power-of-two multiple of cache line size Can range from two lines to a physical page A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA) June 7, 2005 ISCA 2005

8 Coarse-Grain Coherence Tracking
Each entry has an address tag, state, and count of lines cached by the processor The state indicates if the processor and / or other processors are sharing / modifying lines in the region On cache misses, the region state is read to determine if a broadcast is necessary June 7, 2005 ISCA 2005

9 Coarse-Grain Coherence Tracking
On snoops, the region state provides a response for the region Piggy-backed onto the conventional response Used to update other processors’ region state RCA maintains inclusion over caches When regions are evicted, their lines are evicted RCA must respond correctly if region’s lines cached Replacement algorithm uses line count June 7, 2005 ISCA 2005

10 Example: Conventional Snooping
Network Read: P0, Read: P0, Invalid Invalid Tag State P0 loads 0000 0010 $0 Exclusive Invalid Pending $1 0000 Invalid MISS 0000 Invalid 0000 Invalid Snoop performed Data Load: Data P0 P1 Response sent Data transfer M0 M1 June 7, 2005 ISCA 2005

11 Coarse-Grain Coherence Tracking
Region Coherence Array added; two lines per region Network P0 has exclusive access to region Read: P0, Invalid, Region Not Shared Read: P0, Invalid, Region Not Shared Tag State P0 loads 0010 0000 $0 Pending Invalid Exclusive 000 001 RCA DI Pending Invalid 0000 $1 Invalid 000 RCA Invalid MISS 0000 Invalid 000 Invalid 0000 Invalid 000 Invalid Snoop performed Data Load: P0 P1 Response sent Data Data transfer M0 M1 June 7, 2005 ISCA 2005

12 Coarse-Grain Coherence Tracking
Region Coherence Array added; two lines per region Network Exclusive region state, broadcast unnecessary Tag State P0 loads $0 0010 Exclusive 001 RCA DI $1 0000 Invalid RCA 000 Invalid MISS, Region Hit 0011 0000 Invalid Exclusive Pending 000 Invalid 0000 Invalid 000 Invalid Direct request sent Data Load: P0 P1 Data transfer Read: P0, Data M0 M1 June 7, 2005 ISCA 2005

13 Coarse-Grain Coherence Tracking
Region Coherence Array added; two lines per region Network Region not exclusive anymore Owned, Region Owned RFO: P1, Owned, Region Owned RFO: P1, P1 stores 0010 $0 Exclusive Pending Invalid 001 RCA DI DD 0010 $1 0000 Invalid Modified Pending 001 000 RCA DD Pending Invalid MISS 0011 Exclusive 000 Invalid 0000 Invalid 000 Invalid Snoop performed Data Data Store: Hits in P0 cache P0 P1 Response sent Data transfer M0 M1 June 7, 2005 ISCA 2005

14 Overhead Storage space needed for RCA
3-6% storage overhead for cache Two bits needed in snoop response for region response Path to memory needed to avoid broadcasts Simple with on-chip memory controllers May leverage data network June 7, 2005 ISCA 2005

15 Simulator PHARMsim: Execution-driven simulator built on top of SimOS-PPC Four 4-way superscalar out-of-order processors Two-level hierarchy with split L1, unified L2 caches Separate address / data networks –similar to Fireplane Region Coherence Array with same sets/assoc. as L2 June 7, 2005 ISCA 2005

16 Workloads Scientific Multiprogrammed Commercial
Ocean, Raytrace, Barnes Multiprogrammed SPECint2000_rate Commercial TPC-W, TPC-B, TPC-H, SPECweb99, SPECjbb2000 June 7, 2005 ISCA 2005

17 Broadcasts Avoided June 7, 2005 ISCA 2005

18 Snoop Traffic Reduction – Peak
64% 51% 38% June 7, 2005 ISCA 2005

19 Snoop Traffic Reduction – Average
47% 74% 86% June 7, 2005 ISCA 2005

20 Execution Time 91.2% June 7, 2005 ISCA 2005

21 Remaining Opportunity
With 512B regions, ~10% of requests are broadcast unnecessarily A third of the 10% are region false sharing Half of the 10% miss in RCA Potential for prefetching June 7, 2005 ISCA 2005

22 Inclusion Overhead --Regions with no lines cached replaced first
June 7, 2005 ISCA 2005

23 Conclusion Coarse-Grain Coherence Tracking: Reduces broadcast traffic
Most data requests sent directly to memory Reduces latency Many requests not sent to central arbitration point Many non-data requests not sent externally Improves scalability and performance June 7, 2005 ISCA 2005

24 The End June 7, 2005 ISCA 2005

25 Inclusion Evictions June 7, 2005 ISCA 2005

26 Ordering Ordering point is now the Region Coherence Array
A direct request is ordered once it accesses the RCA Direct requests are serialized w.r.t. to snoop requests A direct request occurs either before, or after a snoop All must appear to access and update RCA atomically No two processors can have exclusive access to a region at the same time (no races) June 7, 2005 ISCA 2005

27 Comparison to RegionScout
CGCT RegionScout Optimization Latency Power Avoids broadcast for non-shared data Yes Avoids broadcast for clean data No Avoids tag lookups on snoops Yes –Like Jetty Region state storage Inclusive cache Hash table, small cache Region state transfer 2 bits in snoop response 1 bit in snoop response Region protocol 7 states Effectively 4 states June 7, 2005 ISCA 2005

28 Execution Time June 7, 2005 ISCA 2005


Download ppt "Jason F. Cantin, Mikko H. Lipasti, and James E. Smith"

Similar presentations


Ads by Google