Sampling Dead Block Prediction for Last-Level Caches

Sampling Dead Block Prediction for Last-Level Caches
Samira Khan Yingying Tian Daniel A. Jiménez

On average 86% of blocks are dead in a 2MB last-level cache!
Dead Blocks The last-level cache (LLC) contains useless blocks A dead block will not be used again until replaced Dead blocks waste cache space and power Each pixel presents average live time of a block brighter is higher live time Cache is a very important component that mitigates long memory Latency. We want our cache to have blocks that will get used again. But actually our cache contains many useless blocks. A live block Gets referenced before eviction. From the last hit till the block gets Evicted, he block is dead. Dead blocks uses the valuable cache space But they do not contribute to hit rate. From our experiment 86% of Blocks are dead in 2MB last level cache 400.perlbench, On average 86% of blocks are dead in a 2MB last-level cache!

Origin of Dead Blocks Least-recently-used replacement (LRU)
fill hit hit hit last hit eviction Least-recently-used replacement (LRU) live dead MRU The question is why there are dead blocks in the cache. The reason is the LRU replacement policy. After the last hit The block is placed in the MRU position. Then it has to Move through the LRU stack from MRU position to LRU position Before it is evicted. . In this example A block is brought to cache. It gets accessed couple of time. After the last access it moves down the LRU stack and eventually Gets evicted. We can see that after the last hit the block remains in the cache for A long period of time. LRU Cache set After last hit blocks remain in the cache for a long time

Goal: A Dead block predictor that uses far less state and
Dead Block Predictors Identify dead blocks Problems with current predictors: Consume significant state Update predictor at every access Depend on the LRU replacement policy Do not work well in last-level cache L1 and L2 filter the temporal locality I think this slide should have some other title We have different types of dead block predictors in the literature. They can predict when a block becomes dead. But they all have some Common problems. First thing is they consume a lot of extra state. They update the predictor at every access, which takes a lot of power. The concept of dead block is dependent on LRU replacement. And they do not Work well in LLC as the L1 and L2 cache filters out all the temporal locality. Our goal is to have a dead block predictor that uses small state and works well For LLC Goal: A Dead block predictor that uses far less state and works well for LLC

Sampling Dead Block Predictor
No need to update predictor at every cache access Predictor can learn from a few sampler sets of partial tags Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Sampling set for set 3 In this work we introduce sampling based dead block prediction. The key idea is the predictor dose not need to learn from every cache access. We can randomly select some sample sets and the predictor will learn only From those sets. Set 6 Sampling set for set 7 Set 7 Predictor learns only from some sample sets

Contribution Prediction using sampling Decoupled replacement policy
Predictor learns from only a few sample sets Sampled sets do not need to reflect real sets Decoupled replacement policy Cache can have different replacement policy than sampler Skewed predictor Results Speedup of 5.9% for single-thread workloads Weighted speedup of 12.5% for multi-core workloads Sampling predictor consumes low power The contribution of our work is we introduce sampling based dead Block prediction. Instead of every access to cache predictor learns Only from sample sets. We show that these sample sets do not need to Reflect real sets. That is we can decouple the replacement policy In the cache from the policy used in the sample sets. We also introduce a Skewed based dead block predictor. Our work achieves 5.9% speedup for a subset of memory intensive SPEC 2006 benchmarks. And it achieves 12.5% speedup for 4 core multi threaded Workloads. It also improves power and accuracy

Outline Introduction Background Sampling Predictor Methodology Results
Conclusion This is the outline of my rest of the talk. Next I will talk about dead block Predictors in the background section. Then I will explain how our sampling based Predictor works. Then I will move to the methodology we have used in our experiment And show the results. Then I will conclude the talk with a summary.

Dead Block Predictors Trace Based [Lai & Falsafi 2001]
Predicts the last touch based on PC sequence Time Based [Hu et al. 2002] Predicts dead after certain number of cycles Counting Based [Kharbutli & Solihin 2008] Predicts dead after certain number of accesses Dead Block predictors predicts a block dead after the last access. These dead blocks are then used for different optimizations. Prefetched blocks can be placed on dead blocks. Dead Blocks Can be replaced instead of the LRU block. Also it can reduce leakage power By turning off dead blocks.

Dead Block Optimizations
Dead Blocks can be used for optimization Prefetching [Lai & Falsafi 2001, Hu et al. 2002, Liu et al. 2008] Reducing cache leakage power [Abella et al. 2005] Dead block replacement and bypass [Kharbutli & Solihin 2008] On a cache miss: Replaced a dead block rather than the LRU block If the new block is predicted dead, do not place it We use dead block replacement and bypass with Sampling Dead Block Predictor

Reference Trace predictor [ Lai & Falsafi 2001 ]
Predicts last touch based on sequence of instructions Encoding: truncated addition of instruction PCs Called signature Predictor table is indexed by hashed signature 2 bit saturating counters

Reference Trace predictor [ Lai & Falsafi 2001 ]
Predictor Table Update live PC sequence Update evicted dead PC i: ld a Miss in set 2, replace Set 0 PC j: ld b Set 1 Hit in set 4 PC k: st c Set 2 PC l: ld a Update live Set 3 Set 4 Hit in set 7 Set 5 There are different kind of dead block predictors. We will look Into details of reference trace predictors. It uses PC sequence to identify Dead blocks. This example shows how every cache access results In an update to the predictor. Here we have a cache and dead block predictor table. There is a PC Sequence that is accessing the cache. PC I hits in set 4 and the predictor Table is updates that the block is live. Then PC j misses in set 2. The LRU Block is replaced and the predictor table is updated that LRU block is dead. Similarly PC k hits in the set and updates the predictor. PC l hits in set 4 and updates The predictor. Set 6 Set 7 Cache Predictor learns from every access to the cache

Conclusion

Sampling Dead Block Prediction
Cache behavior is more-or-less uniform across all sets [Moin et al. 2007] Keep a few sampler sets of partial tags Update the predictor when a sampler set is accessed Set 0 Update the predictor Only when these sets Are accessed Set 1 Set 2 Set 3 The intuition behind the sampling predictor is that cache behavior is uniform Across all sets in the cache. So we do not need to learni from every set We can keep a few sampler sets with partial tags. The predictor Learns only from cache access to sampler sets Set 4 Set 5 Sampling set for set 3 Set 6 Sampling set for set 7 Set 7 Predictor learns only from accesses to the sampler sets

Sampling Dead Block Prediction
Predictor table No update PC i: ld a Hit in set 4 Set 0 PC j: ld b Set 1 PC k: st c Update live Set 2 Miss in set 2, replace PC l: ld a Set 3 Set 4 Hit in set 7 Hit in set 4 Set 5 If we go through the same example again for sampler predictor. The only difference is instead of every access, only access to sampler Sets will result in predictor update. So PC I hits in set 4, set 4 is not a sampler set So there will be no update. PC j misses in set 2. Set 2 is not a sampler set so there is No update also. PC k hits in set 7. This is a sampler set. So the predictor is updated Similarly PC l hits in set 2 which is not a sampler set, so there is no update. So using sampler we can reduce the number of predictor update significantly As only 32 sets are enough in a 2MB LLC Set 6 Set 7 Sampling set for set 3 Sampling set for set 7 cache Sampler sets Only 32 sampler sets in 2MB LLC

Sampler can have reduced associativity
Sampler can have associativity different from cache Blocks closer to LRU are dead most of the time Reduced associativity evicts them early Accelerates discovery of dead blocks Set 0 Set 1 Set 2 Set 3 Set 4 Another advantage of sampler sets is sampler sets can have reduced associativity Than the original sets. This enables the predictor to learn quicker. So it can identify dead blocks Early. Here we show that sampler sets can have reduced associativity. In our experiments the sampler sets have 12 way associtivity Though the original cache has 16 way associativity Set 5 Sampling set for set 3 Set 6 Sampling set for set 7 Set 7 In a 16 way 2MB set associative last level cache the sampler can be 12 way set associative

Sampler Decouples Replacement Policy
Predictor learns from the LRU policy in sampler Cache can deploy a cheap replacement policy e.g. random replacement Set 0 Random Replacement Set 1 R LRU Replacement Set 2 Set 3 Another advantage of sampler sets is it decouples the replacement policy. The sampler can have LRU replacement policy and the predictor learns From the sampler sets. But the cache can use any cheap replacement Policy like random. This saves the state and power needed for LRU replacement policy Set 4 Set 5 Sampling set for set 3 Set 6 Sampling set for set 7 Set 7 Can save state and power needed for LRU policy

Skewed Predictor Reduces conflict in the predictor table
Index1 = hash1(pc) Index = hash(signature) Index2 = hash2(pc) Index3 = hash3(pc) conf3 conf1 conf2 confidence We have proposed to use a skewed organization of the predictor table. Instead of using one hash function, we use three hash functions to index three Predictor tables. Previously a block would have been predicted dead if the confidence was greater than Some predefined threshold. Now we will have three confidences counters and the block will Be predicted dead if sum of the confidence is greater than the threshold. dead if confidence >= threshold dead if conf1+conf2+conf3 >= threshold Reference trace predictor table Skewed predictor table Reduces conflict in the predictor table

Conclusion

Methodology CMP$im cycle accurate simulator [Jaleel et al. 2008]
2MB /core 16-way set-associative shared LLC 32KB I+D L1, 256KB L2 200-cycle DRAM access time 19 memory-intensive SPEC CPU 2006 benchmarks 10 mixes of SPEC CPU 2006 for 4-cores Power numbers are from CACTI 5.3 [Shyamkumar et al 2008]

Fewer Dead Blocks from Sampling Based Dead Block Replacement and Bypass
This animation shows the reduced dead time from sampling Based dead block replacement and bypass. The first picture shows live time in a regular cache and The second picture shows the optimized cache. Each pixel presents The average live time of block. Brighter means higher live time. We can clearly see that live time is much higher in our optimized cache. 400.perlbench Each pixel = average live time of a cache block brighter means higher live time

Space Overhead This graph shows the space overhead is different dead block Predictors. In the x axis we have reference trace predictor, Counting based predictor and sampling predictor. In the y axis we have space overhead in KB. Reference trace predictor uses 8KB predictor table and 16 bits per cache line. Counting based predictor uses 40KB predictor Table and 17 bits per cache line. Our sampling predictor uses 3KB predictor table and 1 bit per Cache line. But it also uses 32 sampler sets which is 6.75 KB Sampling Dead Block Prediction uses 3KB predictor table, one bit per cache line and 6.75KB sampler tag array

Power Usage This slide shows the power consumption of different Dead block predictors. In the x axis we again have reference trace Predictor, counting based predictor and sampling based predictor. In the y axis we have total power consumed by predictor table And cache meta data in Watts Our predictor uses less dynamic power as only few accesses update The predictor. Sampling predictor also uses less than half of the leakage power than. Other two. Mainly because it has only 1 bit cache metadata associated with each block. If we compare the power with the total power needed by LLC, smapling uses 3.1% Of the total dynamic power ans 1.2% of the total leakage power. sampling prediction consumes 3.1% of the total dynamic power and 1.2% of the total leakage power of LLC

Component Contribution to Speedup
Predictor table Here we show contribution of each component in the geometric mean speedup. Sampling achieves 3.9% speedup over the baseline LRU. When we reduce the sampler Associativity we get 5.6% speedup. When we add our skewed predictor we get 5.9% Average speedup. Sampler sets Cache sets Achieves 5.9% average geometric mean speedup for single threaded workloads

Speedup for single-thread workloads
This graph shows the average geometric mean speedup for single Threaded workloads. In the x axis first we have reference trace predictor. Then we have Counting based predictor. Dynamic insertion policy and Rereference Interval predictor are two recent replacement policy. DIP takes into account thrashing And RRIP takes into account thrashing and scanning. We also compare sampling with random replacement with random cache replacement and random Counting based predictor. Reference trace predictor performs very poorly as LLC does not have any temporal locality And it depends on the PC sequence accessing the block. Counting based predictor achieves 2.3% speedup. DIP 3.1% RRIP 4.1% Sampler 5.9% Comparing with random and counting based random, sampler performs very well. It achieves 3.4% speedup On average sampler provides 5.9% speedup over LRU Sampler with default random policy yields 3.5% speedup

Normalized weighted speedup for multi-core workloads
This graph shows the average geometric mean speedup for multi. Threaded workloads. On average sampler provides 12.5% speedup over LRU Sampler with default random policy yields 7% speedup. I want to point out that The dead block predictor is not optimized for multiple threads unlike DIP/RRIP The same predictor that is used for single thread is used for multi threaded workloads On average sampler provides 12.5% benefit over LRU Sampler with default random policy yields 7% benefit

Conclusion Sampling Dead block replacement and bypass with sampling
Consumes less power Reduces storage overhead Decouples replacement policy Dead block replacement and bypass with sampling achieves geometric mean speedup of 5.9% for single threaded workloads 12.5% for multi threaded workloads

Thank you

Sampling Dead Block Prediction for Last-Level Caches

Similar presentations

Presentation on theme: "Sampling Dead Block Prediction for Last-Level Caches"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sampling Dead Block Prediction for Last-Level Caches

Similar presentations

Presentation on theme: "Sampling Dead Block Prediction for Last-Level Caches"— Presentation transcript:

Similar presentations

About project

Feedback