Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Similar presentations


Presentation on theme: "The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University."— Presentation transcript:

1 The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs 1

2 Cache Hierarchy Organization Directory-Based Coherence 2 Private cache Write miss 1 1 2 2 Shared Cache + Directory 4 4 Sharer 3 3 Private caches: 1 or 2 levels Shared cache: Last-level Write word Concurrent reads lead to replication in private caches Directory maintains coherence for replicated lines

3 Private Caching Advantages & Drawbacks 3 ☺ Exploits spatio- temporal locality ☺ Efficient low-latency local access to private + shared data (cache line replication) ☹ Inefficiently handles data with LOW spatio-temporal locality ☹ Working set > private cache size ☹ Inefficient cache utilization (Cache thrashing) ☹ Unnecessary fetch of entire cache line ☹ Shared data replication increases working set

4 Private Caching Advantages & Drawbacks 4 ☺ Exploits spatio- temporal locality ☺ Efficient low-latency local access to private + shared data (cache line replication) ☹ Inefficiently handles data with LOW spatio-temporal locality ☹ Working set > private cache size ☹ Shared data with frequent writes ☹ Wasteful invalidations, synchronous writebacks, cache line ping-ponging Increased on-chip communication and time spent waiting for expensive events

5 On-Chip Communication Problem 5  Wires relative to gates are getting worse every generation Shekhar Borkar, Intel Must Architect Efficient Coherence Protocols Bit movement is much more expensive than computation Bill Dally, Stanford

6 Utilization: # private L1 cache accesses before cache line is evicted 40% of lines evicted have a utilization < 4 Locality of Benchmarks Evaluating Reuse before Evictions 6 80% 20%

7 Utilization: # private L1 cache accesses before cache line is invalidated (intervening write) Locality of Benchmarks Evaluating Reuse before Invalidations 7 80% 10%

8 1 Remote-Word Access (RA) 8 2 Home core NUCA-based protocol [Fensch et al HPCA’08] [Hoffmann et al HiPEAC’10] Write word Assign each memory address to unique “home” core – Cache line present only in shared cache at “home” core (single location) For access to non-locally cached word, request “remote” shared cache on “home” core to perform the read/write access

9 Remote-Word Access Advantages & Drawbacks 9 ☺ Energy Efficient (low locality data)  Word access (~200 bits) cheaper than cache line fetch (~640 bits) ☺ NO data replication  Efficient private cache utilization ☺ NO invalidations / synchronous writebacks ☹ Round-trip network request for remote- WORD access ☹ Expensive for high locality data ☹ Data placement dictates distance & frequency of remote accesses

10 Locality-Aware Cache Coherence Combine advantages of private caching and remote access Privately cache high locality lines – Optimize hit latency and energy Remotely cache low locality lines – Prevent data replication & costly data movement Private Caching Threshold (PCT) – Utilization >= PCT  Mark as private – Utilization < PCT  Mark as remote 10

11 Locality-Aware Cache Coherence 11 Invalidations vs Utilization Private Caching Theshold (PCT) = 4 Remote Private

12 Outline Motivation for Locality-Aware Coherence Detailed Implementation Optimizations Evaluation Conclusion 12

13 Baseline System 13 Compute pipeline Private L1-I and L1-D caches Logically shared physically distributed L2 cache with integrated directory Router L1 I-CacheL1 D-Cache L2 Shared Cache Core Compute Pipeline Directory M M M L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09] ACKwise limited-directory protocol [Kurian – PACT10]

14 Locality-Aware Coherence Important Features Intelligent allocation of cache lines – In the private L1 cache – Allocation decision made per-core at cache line level Efficient locality tracking hardware – Decoupled from traditional coherence tracking structures Protocol complexity low – NO additional networks for deadlock avoidance 14

15 Implementation Details Private Cache Line Tag Private Utilization bits to track cache line usage in L1 cache Communicated back to directory on eviction or invalidation Storage overhead is only 0.4% 15 StateLRUTag Private Utilization

16 Implementation Details Directory Entry P/R i : Private/Remote Mode Remote-Utilization i : Line usage by Core i at shared L2 cache Complete Locality Classifier: Track mode/remote-utilization for all cores Storage overhead reduced later 16 StateTag ACKwise Pointers 1 … p Remote Utilization 1 Remote Utilization n … P/R 1 … P/R n

17 Mode Transitions Summary Classification based on previous behavior 17 Remote Private Private Utilization < PCT Private Utilization >= PCT Initial Remote Utilization < PCT Remote Utilization >= PCT

18 Walk Through Example 18 Core-A Private U Core-B Private U Core-C Private U Directory Core A Core B Core D Core C Private Caching Threshold PCT = 2 Uncached Pipeline + L1 Cache Pipeline + L1 Cache Pipeline + L1 Cache L2 Cache + Directory All cores start out in private mode Network

19 Walk Through Example 19 Core-A Private U Core-B Private U Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Uncached Read[X]

20 Walk Through Example 20 Core-B Private U Core-A Private C Core-C Private U Directory Core A Core B Core D Core C Shared PCT = 2 Cache Line [X] Clean -

21 Walk Through Example 21 Core-A Private C Core-B Private U Core-C Private U Directory Core A Core B Core D Core C Shared 1 PCT = 2 Shared Cache Line [X] Clean -

22 Walk Through Example 22 Core-A Private C Core-B Private U Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Read[X] Clean -

23 Walk Through Example 23 Core-A Private C Core-B Private U Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Cache Line [X] Clean -

24 Walk Through Example 24 Core-A Private C Core-B Private U Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Cache Line [X] Clean -

25 Walk Through Example 25 Core-A Private C Core-B Private U Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Read[X] Clean -

26 Walk Through Example 26 Core-A Private C Core-B Private U Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Shared 2 Clean -

27 Walk Through Example 27 Core-A Private C Core-B Private U Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Write[X] Shared 2 Clean -

28 Walk Through Example 28 Core-A Private C Core-B Private U Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Shared 2 Inv [X] Clean -

29 Walk Through Example 29 Core-A Private C Core-B Private U Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Invalid 0 Shared 2 Inv-Reply [X] (1) Clean -

30 Walk Through Example 30 Core-B Private U Core-A Remote 0 Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Shared 2 Inv-Reply [X] (1) Clean -

31 Walk Through Example 31 Core-A Remote 0 Core-B Private U Core-C Private C Directory Core A Core B Core D Core C PCT = 2 Shared Inv-Reply [X] (2) Invalid 0 Clean -

32 Walk Through Example 32 Core-A Remote 0 Core-B Private U Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Uncached Inv-Reply [X] (2) Clean -

33 Walk Through Example 33 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Modified Cache Line [X] Clean -

34 Walk Through Example 34 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Modified Modified 1 Cache Line [X] Clean -

35 Walk Through Example 35 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Modified Modified 1 Read[X] Clean -

36 Walk Through Example 36 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Modified Modified 1 WB [X] Clean -

37 Walk Through Example 37 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Modified Shared 1 WB-Reply [X] Clean -

38 Walk Through Example 38 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 WB-Reply [X] Dirty -

39 Walk Through Example 39 Core-B Private C Core-A Remote 1 Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Word [X] Dirty -

40 Walk Through Example 40 Core-A Remote 1 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 1 Write [X] Dirty -

41 Walk Through Example 41 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Modified Shared 1 Dirty - Upgrade- Reply [X]

42 Walk Through Example 42 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Modified Modified 2 Dirty -

43 Walk Through Example 43 Core-A Remote 0 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Modified 2 Read [X] Dirty -

44 Walk Through Example 44 Core-B Private C Core-A Remote 1 Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 2 Dirty - Read [X]

45 Walk Through Example 45 Core-B Private C Core-A Remote 1 Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 2 Word [X] Dirty -

46 Walk Through Example 46 Core-A Remote 1 Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 2 Read [X] Dirty -

47 Walk Through Example 47 Core-B Private C Core-A Remote 2 Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 2 Dirty - Read [X]

48 Walk Through Example 48 Core-B Private C Core-A Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 2 Dirty - Cache Line [X] (2)

49 Walk Through Example 49 Core-A Private C Core-B Private C Core-C Private U Directory Core A Core B Core D Core C PCT = 2 Shared Shared 2 Cache Line [X] (2) Dirty -

50 Outline Motivation for Locality-Aware Coherence Detailed Implementation Optimizations Evaluation Conclusion 50

51 Complete Locality Classifier High Directory Storage Complete Locality Classifier: Tracks locality information for all cores 51 StateTag ACKwise Pointers 1 … p Remote Utilization 1 Remote Utilization n … P/R 1 … P/R n ClassifierComplete Bit Overhead per core (256 KB L2) 192 KB (60%)

52 Limited Locality Classifier Reduces Directory Storage Utilization and mode tracked for k sharers Modes of other sharers obtained by taking a majority vote 52 StateTag ACKwise Pointers 1 … p Core ID 1 Remote Utilization 1 Core ID k Remote Utilization k … … P/R 1 … P/R k

53 Limited-3 Locality Classifier 53 ClassifierCompleteLimited-3 Bit Overhead per core (256 KB L2) 192 KB (60%)18 KB (5.7%) MetricLimited-3 vs Complete Completion Time3 % lower Energy1.5 % lower Utilization and mode tracked for 3 sharers Achieves the performance and energy of the Complete locality classifier CT and Energy lower because remote mode classification learned faster with Limited-3

54 Private Remote Transition Results In Private Cache Thrashing 54 Remote Private Private Utilization < PCT Private Utilization >= PCT Initial Remote Utilization < PCT Remote Utilization >= PCT Core reverts back to private mode after #PCT accesses to cache line at shared L2 cache Evicts other lines in the private L1 cache Results in low spatio-temporal locality for all Difficult to measure private cache locality of line in shared L2 cache

55 Ideal Classifier NO Private Cache Thrashing 55 Ideal classifier maintains part of the working set in the private cache Other lines placed in remote mode at shared cache

56 Remote Access Threshold Reduces Private Cache Thrashing Remote Access Threshold (RAT) varied based on PCT & application behavior [details in paper] 56 Remote Private Private Utilization < PCT Private Utilization >= PCT Initial Remote Utilization < RAT Remote Utilization >= RAT If core classified as remote sharer (capacity), increase cost of promotion to private mode If core classified as private sharer, reset the cost back to its starting value Reduces private cache thrashing to a negligible level

57 Outline Motivation for Locality-Aware Coherence Implementation Details Optimizations Evaluation Conclusion 57

58 Reducing Capacity Misses Private L1 Cache Miss Rate vs PCT (Blackscholes) 58 Miss rate reduces as PCT increases (better utilization) Multiple capacity misses (expensive) replaced with single word access (cheap) Cache miss rate increases towards the end (one capacity miss turns into multiple word misses) PCT

59 Energy vs PCT Blackscholes Reducing L1 cache misses (& Capacity  Word) lead to lesser network traffic and L2 accesses Accessing a word (200 bits) cheaper than fetching the entire cache line (640 bits) 59 PCT

60 Completion Time vs PCT Blackscholes Lower L1 cache miss rate + miss penalty Less time spent waiting on L1 cache misses 60

61 Reducing Sharing Misses Private L1 Cache Miss Rate vs PCT (Streamcluster) 61 Sharing misses (expensive) turned into word misses (cheap) as PCT increases PCT

62 Energy vs PCT Streamcluster Reduce invalidations, asynchronous write- backs and cache-line ping-pong’ing 62 PCT

63 Completion Time vs PCT Streamcluster Less time spent waiting for invalidations and invalidations and by loads waiting for previous stores Critical section time reduction -> synchronization time reduction 63 PCT

64 Variation with PCT Results Summary Evaluated 18 benchmarks from the SPLASH-2, PARSEC, parallel-MI bench and UHPC suites + 3 hand-written benchmarks PCT of 4 obtains 25% reduction in energy and 15% reduction in completion time Evaluations done using Graphite simulator for 64 cores, McPAT/CACTI cache energy models and DSENT network energy models at 11 nm 64

65 Conclusion Three potential advantages of the locality-aware adaptive cache coherence protocol – Better private cache utilization – Reduced on-chip communication (invalidations, asynchronous write-backs and cache-line transfers) – Reduced memory access latency and energy Efficient locality tracking hardware Decoupled from traditional coherence tracking structures Limited 3 locality classifier has low overhead of 18KB per-core (with 256KB per-core L2 cache) Simple to implement – NO additional networks for deadlock avoidance 65


Download ppt "The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University."

Similar presentations


Ads by Google