Presentation is loading. Please wait.

Presentation is loading. Please wait.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer Intel Corporation,

Similar presentations


Presentation on theme: "Achieving Non-Inclusive Cache Performance with Inclusive Caches Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer Intel Corporation,"— Presentation transcript:

1 Achieving Non-Inclusive Cache Performance with Inclusive Caches Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD IEEE/ACM International Symposium on Microarchitecture (MICRO’2010)

2 Motivation Factors making caching important CPU speed >> Memory speed Chip Multi-Processors (CMPs) Goal: High performing LLC 2 iL1 Last Level Cache (LLC) dL1 L2 iL1dL1 L2

3 Motivation Factors making caching important CPU speed >> Memory speed Chip Multi-Processors (CMPs) Goal: High performing LLC High performing cache hierarchy 3 iL1dL1 L2 iL1dL1 L2 Last Level Cache (LLC) Focus Of This Talk Is to Design a High Performing Cache Hierarchy

4 Cache Hierarchy 101: Kinds of Cache Hierarchies Inclusive Hierarchy L1 subset of LLC L1 LLC victim fill Core request memory evict BackInval

5 Cache Hierarchy 101: Kinds of Cache Hierarchies L1 LLC Inclusive Hierarchy L1 subset of LLC Exclusive Hierarchy L1 is NOT in LLC victim L1 LLC victim fill Core request memory evict victim

6 BackInval Cache Hierarchy 101: Kinds of Cache Hierarchies L1 LLC Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC Exclusive Hierarchy L1 is NOT in LLC victim L1 LLC victim fill L1 LLC fill Core request memory evict victim

7 BackInval Cache Hierarchy 101: Kinds of Cache Hierarchies 7 L1 LLC Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC Exclusive Hierarchy L1 is NOT in LLC victim L1 LLC victim fill L1 LLC fill Core request Back-Invalidate: YES NO NO memory Total Capacity: LLC >= LLC and <= (L1+LLC) L1 + LLC evict victim Coherence: LLC Acts As LLC miss snoops ALL L1$ LLC miss snoops ALL L1$ Directory (or use Snoop Filter) (or use Snoop Filter) Inclusive Caches (+) simplify cache coherence (−) waste cache capacity (−) back-invalidates limits performance Non-Inclusive Caches (+) do not waste cache capacity (−) complicate cache coherence (−) extra hardware for snoop filtering IN A NUTSHELL

8 Performance of Non-Inclusive and Exclusive LLCs 8 AMD INTEL Baseline Inclusion (2-core CMP with 32KB L1, 256KB L2, LLC based on ratio) Enforcing inclusion is bad when LLC is not significantly larger than MLC Why Non-inclusive (NI) and Exclusive LLCs perform better? 1.Make use of extra cache capacity by avoiding duplication 2.Avoid problems dealing with harmful back-invalidates Which Of the Above Two Reasons Limits Performance of Inclusion?

9 Back-Invalidate Problem with Inclusive Caches Inclusion Victims: Lines evicted from core caches due to LLC eviction Small caches filter temporal locality Small cache hits do not update LLC LRU “Hot” small cache lines  LRU in LLC Example Reference Pattern: … a, b, a, c, a, d, a, e, a, f… 9 a a ba ba ba ab cba ca cba ac dcba da dcba ad MRU LRU Reference ‘e’ misses and evicts ‘a’ from hierarchy Next Reference to ‘a’ misses  L1: L2: Filtered Temporal Locality  Lines Become LRU in LLC  Hierarchy Eviction  edcb ed

10 Inclusion Problem Exacerbated on CMPs! Types of Applications: Core Cache Fitting (CCF) Apps: working set fits in the core caches LLC Fitting (LLCF) Apps: working set fits in the LLC LLC Thrashing (LLCT) Apps: working set is larger than LLC 10 iL1 LLC dL1 L2 iL1dL1 L2 CCF LLCF LLCT

11 Inclusion Problem Exacerbated on CMPs! 11 LLC iL1dL1 L2 iL1dL1 L2 CCFLLCF

12 Inclusion Problem Exacerbated on CMPs! CCF apps serviced from L2 cache and rarely from the LLC Replacement state of CCF apps becomes LRU at LLC 12 LLC iL1dL1 L2 iL1dL1 L2 CCFLLCF

13 Inclusion Problem Exacerbated on CMPs! CCF apps serviced from L2 cache and rarely from the LLC Replacement state of CCF apps becomes LRU at LLC LLCF app replaces CCF working set from LLC Inclusion mandates removing CCF working set from entire hierarchy  13 LLC iL1dL1 L2 iL1dL1 L2 Performance of CCF apps significantly degrades in presence of LLCF/LLCT apps CCFLLCF

14 Eliminate “Inclusion Victims” Using Temporal Hints 14 Baseline policies only update replacement state at level of hit Proposal: convey temporal locality in small caches to LLC Temporal Locality Hints: Non-data requests sent to update LLC replacement state L1 L2 Core request LLC (TLH) Update LRU (L1 hit) Update LRU

15 Conveying ALL Temporal Locality to the LLC 15 Bulk of non-inclusive cache performance is from avoiding back-invalidates NOT Capacity! Inclusive LLC management must be Temporal Locality Aware Baseline System *Our studies do not model TLH BW Baseline Inclusion

16 Performance of L1 Temporal Locality Hints L1 hints bridge 85% of gap between inclusion & non-inclusion Limitations of L1 Hints: Very high BW num messages = num L1 hits 16 2T Workloads on a 1:4 Hierarchy *Our studies do not model TLH BW 5.2% 6.1% Baseline Inclusion Need Low Bandwidth Alternative to L1 Temporal Locality Hints

17 Improving Inclusive Cache Performance Eliminate back-invalidates (i.e. build non-inclusive caches) Increases coherence complexity  Goal: Retain benefits of inclusion yet avoid inclusion victims Solution: Temporal Locality Aware (TLA) Cache Management Ensure LLC DOES NOT evict “hot” lines from core caches Must identify LLC lines that have high temporal locality in core caches 17

18 Early Core Invalidate (ECI) Main Idea: Derive temporal locality by removing line early from core caches Early Core Invalidate (ECI): Send early invalidate for the next victim in same set If line is “hot”, it will be “rescued” from LLC  “rescue” updates LLC replacement state as a side effect 18 Memory L1 L2 L3 Back Invalidate Miss Flow dcba e MRU LRU Next Victim Early Core Invalidate

19 Performance of Early Core Invalidate (ECI) ECI bridges 55% of gap between inclusion & non-inclusion Pros: No HW overhead, Low BW num messages = num LLC misses Limitations : Short time to rescue. Rescue must occur BEFORE next miss to set 19 2T Workloads on a 1:4 Hierarchy 3.4% 6.1% Baseline Inclusion We can Still Do Better Than ECI…

20 Query Based Selection (QBS) Main Idea: Replace lines that are NOT resident in core caches Query Based Selection (QBS): LLC sends back-inval request Core rejects back-inval if line is resident in core caches 20 L1 L2 L3 Back-Invalidate Request Miss Flow edcb a MRU LRU REJECT Memory

21 Query Based Selection (QBS) Main Idea: Replace lines that are NOT resident in core caches Query Based Selection (QBS): LLC sends back-inval request Core rejects back-inval if line is resident in core caches If core rejects, update to MRU in LLC LLC repeats back-inval process till core accepts back-inval request (or timeout) 21 L1 L2 L3 Miss Flow a edc b MRU LRU Back-Invalidate Request ACCEPT Memory

22 Performance of Query Based Selection (QBS) QBS outperforms non-inclusion Pros: No HW overhead, Low BW num messages = num LLC misses Studies show maximum of two tries sufficient for victim selection 22 2T Workloads on a 1:4 Hierarchy 6.6% 6.1% Baseline Inclusion

23 Summary of TLA Cache Management (2-core CMP) QBS performs similar to non-inclusive caches for all cache ratios 23 Baseline Inclusion

24 24 QBS Scalability (2-core, 4-core, 8-core CMPs) 2T, 4T and 8T Workloads on a 1:4 Hierarchy QBS scales with number of cores and performs similar to non-inclusive caches Baseline Inclusion

25 Summary Problem: Inclusive cache problem becomes WORSE on CMPs –E.g. Core Cache fitting + LLC Fitting/Thrashing Conventional Wisdom: Primary benefit of non-inclusive cache is because of higher capacity We show: primary benefit NOT capacity but avoiding back-invalidates Proposal: Temporal Locality Aware Cache Management –Retains benefit of inclusion while minimizing back-invalidate problem –TLA managed inclusive cache = performance of non-inclusive cache 25

26 26 Q&A

27 BackInval Cache Hierarchy 101: Kinds of Cache Hierarchies Inclusive Hierarchy L1 subset of LLC Non-Inclusive Hierarchy L1 not subset of LLC L1 LLC victim fill L1 LLC fill Core request memory evict victim

28 Eliminate “Inclusion Victims” in Inclusive Caches Using Temporal Locality Hints 28 Baseline policies only update replacement state at level of hit Proposal: convey temporal locality in small caches to large caches Temporal Locality Hints: Non-data requests sent to update LLC replacement state L1 L2 Core request LLC (TLH) Update LRU (L1 hit) Update LRU (L2 Hit) Update LRU (TLH) Update LRU Core request

29 Performance of L2 Temporal Locality Hints L2 hints bridge <50% of gap between inclusion & non-inclusion Limitations of L2 Hints: Not as good as L1 hints High BW num messages = num L2 hits 29 2T Workloads on a 1:4 Hierarchy *Our studies do not model TLH BW Need Low Bandwidth Alternative to L1 Temporal Locality Hints 2.8% 6.1% Baseline Inclusion

30 30 QBS Scalability (4-core and 8-core CMPs) 4T and 8T Workloads on a 1:4 Hierarchy QBS scales with number of cores and performs similar to non-inclusive caches Baseline Inclusion


Download ppt "Achieving Non-Inclusive Cache Performance with Inclusive Caches Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer Intel Corporation,"

Similar presentations


Ads by Google