International Symposium on Computer Architecture ( ISCA – 2010 )

International Symposium on Computer Architecture ( ISCA – 2010 )
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD International Symposium on Computer Architecture ( ISCA – 2010 )

Motivation Factors making caching important
Increasing ratio of CPU speed to memory speed Multi-core poses challenges on better shared cache management LRU has been the standard replacement policy at LLC However LRU has problems! A mature field such as caching still has significant importance today! This is because the memory speeds continue to lag behind processor speeds. Additionally, the multi-core era poses significant challenges on better cache management. Until now, LRU and its approximations have been the standard replacement policy for caches. Why? Because it works well in most cases! However, for some application access patterns, LRU poses significant problems.

Problems with LRU Replacement
LLCsize Wsize Working set larger than the cache causes thrashing miss miss miss miss miss The first problem with LRU replacement is when the working set is larger than the cache. In such scenarios, LRU causes cache thrashing and always results in misses! The second problem is when references to non-temporal data, called scans, discards the frequently referenced working set from the cache. Let me illustrate. When the working set is smaller than the LLC it receives cache hits. Successive references to the working set continue to receive cache hits. However, after a one-time reference to a long stream of data, re-references to the working set after the scan result in a miss under LRU replacement. Successive re-references after re-fetching the data from memory result in hits until the next scan. And the problem repeats . After every scan, the frequently referenced working set always misses! Why is this important? Well, our studies show that scans occur frequently in many commercial workloads. References to non-temporal data (scans) discards frequently referenced working set hit hit hit miss hit miss hit miss scan scan scan Our studies show that scans occur frequently in many commercial workloads Wsize

Desired Behavior from Cache Replacement
Working set larger than the cache  Preserve some of working set in the cache hit hit hit hit hit miss miss miss miss miss Wsize LLCsize Under both these scenarios, the desired behavior from cache replacement is as follows: If working set is larger than cache, preserve some of it in the cache. In the presence of recurring scans, the replacement policy should preserve the frequently referenced working set in the cache. Recurring scans  Preserve frequently referenced working set in the cache hit scan

Prior Solutions to Enhance Cache Replacement
Working set larger than the cache  Preserve some of working set in the cache Dynamic Insertion Policy (DIP)  Thrash-resistance with minimal changes to HW Recurring scans  Preserve frequently referenced working set in the cache When the working set is larger than the cache, preserving some of the working set in the cache can be accomplished using DIP. DIP is a simple solution that requires minimal changes to HW. Dealing with scans was addressed using LFU replacement. However, LFU does not perform well for recency friendly workloads. The goal of our work is to design a scan-resistant replacement policy that performs well for recency friendly workloads and can also be easily extended to provide thrash-resistance. Least Frequently Used (LFU)  addresses scans LFU adds complexity and also performs bad for recency friendly workloads GOAL: Design a High Performing Scan-Resistant Policy that Requires Minimum Changes to HW

Belady’s Optimal (OPT) Replacement Policy
Replacement decisions using perfect knowledge of future reference order Victim Selection Policy: Replaces block that will be re-referenced furthest in future victim block In attempts to design a high performing replacement policy, let us try to take some lessons from Belady’s OPT replacement policy. Here, is an illustration of the physical ways in a cache set that is 8-way set associative. At the time of a cache miss, Belady makes replacement decisions by using perfect knowledge of the future reference order. This is accomplished by looking into the future and determining the “time” when block will be re-referenced next. On a cache miss, the Belady Victim Selection Policy replaces the block that will be re-referenced furthest in the future. In this case, block ‘c’ 1 2 3 4 5 6 7 Physical Way # Cache Tag a c b h f d g e 4 13 11 5 3 6 9 1 “Time” when block will be referenced next

Practical Cache Replacement Policies
Replacement decisions made by predicting the future reference order Victim Selection Policy: Replace block predicted to be re-referenced furthest in future Continually update predictions on the future reference order Natural update opportunities are on cache fills and cache hits Practical cache replacement policies on the other hand do not have perfect knowledge of the future. Instead, they make replacement decisions by predicting the future reference order. Again, for the illustrated set in a cache, practical replacement policies maintain some state that holds the “predicted time” when the block will be re-referenced next. Thus, the victim selection policy replaces the block predicted to be re-referenced furthest in the future. Unlike Belady, practical replacement policies can not see into the future, so they also need to update predictions on the future reference order. Natural update opportunities are on cache fills and cache hits victim block 1 2 3 4 5 6 7 Physical Way # Cache Tag a c b h f d g e 4 13 11 5 3 6 9 1 “Predicted Time” when block will be referenced next ~

LRU Replacement in Prediction Framework
The “LRU chain” maintains the re-reference prediction Head of chain (i.e. MRU position) predicted to be re-referenced soon Tail of chain (i.e. LRU position) predicted to re-referenced far in the future LRU predicts that blocks are re-referenced in reverse order of reference Rename “LRU Chain” to the “Re-Reference Prediction (RRP) Chain ” Rename “MRU position”  RRP Head and “LRU position”  RRP Tail Using the prediction framework, we can describe the idea behind LRU replacement. The LRU chain maintains the re-reference prediction where the head of the chain (i.e. MRU position) points to a block that is predicted to be re-referenced soon. The tail of the LRU chain (i.e. the LRU postion) points to a block that is predicted to be be re-referenced far in the future. In essence, LRU predicts that blocks are re-referenced in the reverse order of reference. When implemented, the LRU chain position is stored with each cache block. This chain is in fact the Re-reference Prediction Chain and the MRU and LRU positions are the RRP head and RRP tail respectively RRP head RRP tail MRU position LRU position h g f e d c b a 1 2 3 4 5 6 7 LRU chain position stored with each cache block

Practicality of Chain Based Replacement
RRP Head RRP Tail h g f e d c b a RRPV (n=2): ‘near- immediate’ 1 ‘intermediate’ 2 ‘far’ 3 ‘distant’ Qualitative Prediction: Problem: Chain based replacement is too expensive! log2(associativity) bits required per cache block (16-way requires 4-bits/block) Solution: LRU chain positions can be quantized into different buckets Each bucket corresponds to a predicted Re-Reference Interval Value of bucket is called the Re-Reference Prediction Value (RRPV) Hardware Cost: ‘n’ bits per block [ ideally you would like n < log2A ] One problem with chain based replacement is that it is too expensive. The # of bits requiered is the log of the associativity. To minimize the storage required, LRU chain positions can be quantized into different buckets. Each bucket corresponds to the re-reference interval and the value of each bucket is called the re-reference prediction value. The hardware cost of such a scheme would be ‘n’ bits per block, and ideally you would like n be less than the log of the associativity. So let’s now map the position on the chain to a prediction value. A value of ‘0’ implies block will be re-referenced soon. A value of 1 implies that block has intermediate re-reference prediction, value of ‘2’ is it will be re-referenced far in the future, and ‘3’ implies that it will be re-referenced in the distant future. Based on these prediction values, it is clear that you want to evict blocks from last bucket, i.e. with distant prediction

Representation of Quantized Replacement (n = 2)
RRP Head RRP Tail RRPV: 3 ‘distant’ 2 ‘far’ 1 ‘intermediate’ ‘near- immediate’ Qualitative Prediction: h g f e d c b a OK, now we can map the logical LRU chain and quantized predictions into the physical implementation of the cache. With each cache block, we will store the blocks re-reference prediction. For example, block ‘a’ has re-reference prediction ‘3’, we store in its physical way we will store the value 3. Cache Tag 3 a 2 c b h 1 f d g e RRPV 4 5 6 7 Physical Way #

Emulating LRU with Quantized Buckets (n=2)
Victim Selection Policy: Evict block with distant RRPV (i.e. 2n-1 = ‘3’) If no distant RRPV (i.e. ‘3’) found, increment all RRPVs and repeat the search If multiple found, need tie breaker. Let us always start search from physical way ‘0’ Insertion Policy: Insert new block with RRPV=‘0’ Update Policy: Cache hits update the block’s RRPV=‘0’ hit victim block Ok, now with the quantized buckets, we can emulate LRU. Remember, re-reference preditions of 0 point ot head of chain, and re-reference predictions of 2n-1 point to the tail. For n=2, you would find a victim by searching for a ‘3’. If there is more than one ‘3’, a tie breaker is needed. As the tie breaker, we always start search from physical way ‘0’. So, here, we would replace ‘a’. Since LRU inserts at the head, the insertion policy would be to insertion block with prediction ‘0’. On cache hits, LRU moves block to head. Similarly, here, we would update block’s prediction to ‘0’. Thus, with fewer bits than LRU, we can emulate LRU with quantized buckets. However, we want to do better than LRU! Cache Tag 3 a 2 c b h 1 f d g e RRPV 4 5 6 7 Physical Way # s But We Want to do BETTER than LRU!!!

Re-Reference Interval Prediction (RRIP)
Cache Tag 3 a 2 c b h 1 f d g e RRPV 4 5 6 7 Physical Way # Framework enables re-reference predictions to be tuned at insertion/update Unlike LRU, can use non-zero RRPV on insertion Unlike LRU, can use a non-zero RRPV on cache hits Static Re-Reference Interval Prediction (SRRIP) Determine best insertion/update prediction using profiling [and apply to all apps] Dynamic Re-Reference Interval Prediction (DRRIP) Dynamically determine best re-reference prediction at insertion With this framework, we can tune re-reference predictions on insertion and update. For example, we can use non-zero predictions on insertion and cache hits We present Static Re-reference Interval Prediction where we profiled workloads and find what combination of best insertion and best update works best. We also present Dynamic Re-reference Interval prediction where we dynamically determine the best insertion prediction

Static RRIP Insertion Policy – Learn Block’s Re-reference Interval
Key Idea: Do not give new blocks too much (or too little) time in the cache Predict new cache block will not be re-referenced soon Insert new block with some RRPV other than ‘0’ Similar to inserting in the “middle” of the RRP chain However it is NOT identical to a fixed insertion position on RRP chain (see paper) OK. So, let’s talk about Static RRIP. Here, we are going to learn a block’s re-reference interval. The key idea is that we should not give new blocks too much (or too little) time in the cache. If you give too much time, you waste cache space. If you give too little time, you suffer from cache misses. Thus, we need to be careful about how we insert blocks. For example, n=2, probably don’t want to insert blocks with ‘0’ or ‘3’. So, let’s insert blocks with some value in between (with n=2, you have possibilities of ‘1’ or ‘2’) To illustrate, the victim block is replaced and new block is inserted with prediction ‘2’ Note that this is similar to inserting in the “middle” of the chain. However, it is not identical to a fixed insertion position and I would refer you to the paper to understand why I say that. victim block Cache Tag 3 a 2 c b h 1 f d g e RRPV 4 5 6 7 Physical Way # s 2

Static RRIP Update Policy on Cache Hits
Hit Priority (HP) Like LRU, Always update RRPV=0 on cache hits. Intuition: Predicts that blocks receiving hits after insertion will be re-referenced soon hit On a cache hit, we will do what LRU does. Always update prediction to ‘0’. We present an alternative update scheme in the paper. 1 2 3 4 5 6 7 Physical Way # Cache Tag 2 s c 3 b h 1 f 1 d g 1 e RRPV 2 An Alternative Update Scheme Also Described in Paper

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC
Averaged Across PC Games, Multimedia, Server, and SPEC06 Workloads on 16-way 2MB LLC OK… So let’s look at how this performs. X-axis is the re-reference prediction used on a cache miss, y-axis is reduction in misses relative to LRU. We present reduction averaged across all commercial workloads in the study on a 16-way 2MB LLC. We will vary the number of bits stored with each block. When n=1, you have two possitibiliteis. Always isnert at ‘0’ or always insert at ‘1’. Always inserting at ‘0’ performs slightly worst than LRU while always inserting at ‘1’ performs worst when n=1 n=1 Re-Reference Interval Prediction (RRIP) Value At Insertion n=1 is in fact the NRU replacement policy commonly used in commercial processors

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC
Averaged Across PC Games, Multimedia, Server, and SPEC06 Workloads on 16-way 2MB LLC So, now let’s take a look at n=2. Here, n=2 has four insertion predictions, 0, 1, 2, and 3. When n=2, best insertion prediction is at n=2 and you get 6% fewer misses. When n=3, best is at n=6 you get about 8% fewer misses. When n=4, best is at n=14 and n=5, best is at 30. You can get as much as 10% fewer misses over LRU. Across all ‘n’, static rrip performs best when insertion is at 2n-2 It performs worst when insertion is at 2n-1 Using n=2 or n=3 performs the bulk of the benefits from Static RRIP.  n=1 n=2 n=3 n=4 n=5 Re-Reference Interval Prediction (RRIP) Value At Insertion Regardless of ‘n’ Static RRIP Performs Best When RRPVinsertion is 2n-2 Regardless of ‘n’ Static RRIP Performs Worst When RRPVinsertion is 2n-1

Why Does RRPVinsertion of 2n-2 Work Best for SRRIP?
Wsize Slen hit hit hit ? hit ? hit ? scan scan scan Before scan, re-reference prediction of active working set is ‘0’ Recall, NRU (n=1) is not scan-resistant For scan resistance RRPVinsertion MUST be different from RRPV of working set blocks Larger insertion RRPV tolerates larger scans Maximum insertion prediction (i.e. 2n-2) works best! In general, re-references after scan hit IF Slen < ( RRPVinsertion – Starting-RRPVworkingset) * (LLCsize – Wsize) SRRIP is Scan Resistant for Slen < ( RRPVinsertion ) * (LLCsize – Wsize) So, why does insertion of 2n-2 work best for SRRIP? First, recall that n=1 is not scan resistant? Why? Because NRU cannot distinguish between blocks that receive hits from blocks that do not receive hits. Before the scan, prediction of active workingg set is ‘0’. So, to achieve scan-resistance, you must insert blocks with a prediction different from the working set so that those blocks be evicted before the working set blocks. So, the insertion prediction must be larger than ‘0’. However, the prediction must allow the block to have enough time to receive hits. So the block must have enough time to receive hits, so for the common case the prediction must be less than 2n-1. The larger the insertion prediction, the longer it takes for the working set blocks to reach ‘3’. Hence, you can tolerate larger scans. In fact, mathematically you can calculate the maximum scan length you would be resistant. I’ll refer you to the paper for that. So, for n>1, Static RRIP is scan-resistant. What about thrash resistance? For n > 1 Static RRIP is Scan Resistant! What about Thrash Resistance?

DRRIP: Extending Scan-Resistant SRRIP to Be Thrash-Resistant
miss miss miss miss miss miss hit DRRIP Always using same prediction for all insertions will thrashes the cache Like DIP, need to preserve some fraction of working set in cache Extend DIP to SRRIP to provide thrash resistance Dynamic Re-Reference Interval Prediction: Dynamically select between inserting blocks with 2n-1 and 2n-2 using Set Dueling Inserting blocks with 2n-1 is same as “no update on insertion” As with LRU and its approximations, when the working set is larger than the cache, they cause thrashing. SRRIP is also susceptible to the same problem. Thrashing can be avoided by preserving portion of the working set in the cache. We extend DIP to SRRIP to provide thrash resistance. We call this DRRIP where we dynamically select between inserting blocks with 2n-1 and 2n-2 using set dueling. Thus, DRRIP addresses both scan-resistance and thrash-resistance. DRRIP Provides Both Scan-Resistance and Thrash-Resistance

Performance Comparison of Replacement Policies
16-way 2MB LLC We now provide a performance comparison of DRRIP and SRRIP for the different commercial workloads used in the paper. The x-axis is the workload categories and the y-axis is the performance improvement over LRU. The first bar is NRU which always performs similar or slightly worst than LRU. Next DIP performs best for SPEC06 but not the commercial workloads. Next SRRIP which provides performance for MultiMedia and Games workloads. Finally we show DRRIP which provides additional benefits except for multimedia workloads. It turns out that for multi-media workloads, the cost of a miss varies significantly. Since DRRIP primarily trains on reducing misses, it affects multi-media workload performance. Nontheless, we observe that SRRIP always outperforms LRU and on average DRRIP provides additional gain over SRRIP. Static RRIP Always Outperforms LRU Replacement Dynamic RRIP Further Improves Performance of Static RRIP

Cache Replacement Competition (CRC) Results
Averaged Across PC Games, Multimedia, Enterprise Server, SPEC CPU2006 Workloads D R I P DRRIP At this years ISCA, we organized a cache replacement championship workshop where we invited contestants to submit their replacement ideas. The ideas were evaluated on private and shared caches across many SPEC and commercial workloads. Un-tuned DRRIP ranked would’ve 2nd in the competition and is within 1% of the CRC winner. Across 65 ST workloads DRRIP provides a little over 2% performance and a little over 6% for a 4-core CMP with a shared cache. Unlike the CRC winner, DRRIP does not require any changes to the cache structure. 16-way 1MB Private Cache 65 Single-Threaded Workloads 16-way 4MB Shared Cache 165 4-core Workloads Un-tuned DRRIP Would Be Ranked 2nd and is within 1% of CRC Winner Unlike CRC Winner, DRRIP Does Not Require Any Changes to Cache Structure

Total Storage Overhead (16-way Set Associative Cache)
LRU: 4-bits / cache block NRU 1-bit / cache block DRRIP-3: 3-bits / cache block CRC Winner: ~8-bits / cache block DRRIP Outperforms LRU With Less Storage Than LRU NRU Can Be Easily Extended to Realize DRRIP!

Summary Scan-resistance is an important problem in commercial workloads State-of-the art policies do not address scan-resistance We Propose a Simple and Practical Replacement Policy Static RRIP (SRRIP) for scan-resistance Dynamic RRIP (DRRIP) for thrash-resistance and scan-resistance DRRIP requires ONLY 3-bits per block In fact it incurs less storage than LRU Un-tuned DRRIP would be 2nd place in CRC Championship DRRIP requires significantly less storage than CRC winner

But NRU Is Not Scan-Resistant 
Static RRIP with n=1 Static RRIP with n = 1 is the commonly used NRU policy (polarity inverted) Victim Selection Policy: Evict block with RRPV=‘1’ Insertion Policy: Insert new block with RRPV=‘0’ Update Policy: Cache hits update the block’s RRPV=‘0’ hit victim block It turns out that, n=1 with Static RRIP is in fact the commonly used NRU policy (however the polarity is inverted). Let’s quickly review how NRU works. On a cache miss, the victim selection policy searches from the first ‘1’ from physical way ‘0’. The victim block is replaced and the new block is inserted with a prediction of ‘0’. On cache hits, the prediction is updated to ‘0’ However, NRU is not scan-resistant because it cannot distinguish between blocks that receive hits from recently filled blocks Cache Tag 1 a c b h f d g e RRPV 2 3 4 5 6 7 Physical Way # s But NRU Is Not Scan-Resistant 

SRRIP Update Policy on Cache Hits
Frequency Priority (FP): Improve re-reference prediction to be shorter than before on hits (i.e. RRPV--). Intuition: Like LFU, predicts that frequently referenced blocks should have higher priority to stay in cache 1 2 3 4 5 6 7 Physical Way # Cache Tag 2 s c 3 b h 1 f 1 d g 1 e RRPV 1 2

SRRIP-HP and SRRIP-FP Cache Performance
SRRIP-Frequency Priority SRRIP-HP has 2X better cache performance relative to LRU than SRRIP-FP We do not need to precisely detect frequently referenced blocks We need to preserve blocks that receive hits SRRIP-Hit Priority

Common Access Patterns in Workloads
Games, Multimedia, Enterprise Server, Mixed Workloads Stack Access Pattern: (a1, a2,…ak,…a2, a1)A Solution: For any ‘k’, LRU performs well for such access patterns Streaming Access Pattern: (a1, a2,… ak) for k >> assoc No Solution: Cache replacement can not solve this problem Thrashing Access Pattern: (a1, a2,… ak)A , for k > assoc LRU receives no cache hits due to cache thrashing Solution: preserve some fraction of working set in cache (e.g. Use BIP) BIP does NOT update replacement state for the majority of cache insertions Mixed Access Pattern: [(a1, a2,…ak,…a2, a1)A (b1, b2,… bm)] N, m > assoc-k LRU always misses on frequently referenced: (a1, a2, … ak, … a2, a1)A (b1, b2, … bm) commonly referenced to as a scan in literature In absence of scan, LRU performs well for such access patterns Solution: preserve frequently referenced working set in cache (e.g. use LFU) LFU replaces infrequently referenced blocks in the presence of frequently referenced blocks RID

Performance of Hybrid Replacement Policies at LLC
PC Games / multimedia server SPEC CPU2006 Average 4-way OoO Processor, 32KB L1, 256KB L2, 2MB LLC DIP addresses SPEC workloads but NOT PC games & multimedia workloads Real world workloads prefer scan-resistance instead of thrash-resistance

Understanding LRU Enhancements in the Prediction Framework
RRP Head RRP Tail h g f e d c b a Recent policies, e.g., DIP, say “Insert new blocks at the ‘LRU position’” What does it mean to insert an MRU line in the LRU position? Prediction that new block will be re-referenced later than existing blocks in the cache What DIP really means is “Insert new blocks at the `RRIP Tail’ ” Other policies, e.g., PIPP, say “Insert new blocks in ‘middle of the LRU chain’” Prediction that new block will be re-referenced at an intermediate time Say here in words that it is in between immediate and distant. The Re-Reference Prediction Framework Helps Describe the Intuitions Behind Existing Replacement Policy Enhancements

Performance Comparison of Replacement Policies
16-way 2MB LLC We now provide a performance comparison of DRRIP and SRRIP for the different commercial workloads. The x-axis is the workload categories and the y-axis is the performance improvement over LRU. The first bar is NRU which always performs similar or slightly worst than LRU. Next DIP performs best for SPEC06 but not the commercial workloads. Next SRRIP which provides performance for MultiMedia and Games workloads. Since SRRIP in some way inserts blocks in the middle of the LRU chain, we compare to a statically profiled Static RRIP Always Outperforms LRU Replacement Dynamic RRIP Further Improves Performance of Static RRIP

International Symposium on Computer Architecture ( ISCA – 2010 )

Similar presentations

Presentation on theme: "International Symposium on Computer Architecture ( ISCA – 2010 )"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

International Symposium on Computer Architecture ( ISCA – 2010 )

Similar presentations

Presentation on theme: "International Symposium on Computer Architecture ( ISCA – 2010 )"— Presentation transcript:

Similar presentations

About project

Feedback