Using Dead Blocks as a Virtual Victim Cache

Using Dead Blocks as a Virtual Victim Cache
Samira Khan, Daniel A. Jiménez, Doug Burger, Babak Falsafi

The Cache Utilization Wall
Performance gap Processors getting faster Memory only getting larger Caches are not efficient Designed for fast lookup Contain too many useless blocks! We all know that there is huge performance gap between the processor and the memory. Large on chip cache hide can hide most of memory latency. It can access a block in few cycles, but a miss that goes memory Waits for hundreds of cycles for the data. We want our cache to be as efficient as possible. But it turns out that caches are designed for fast lookup. Actually more than half of the cache blocks Are never accessed again. So it is not utilized well . We want the cache to be as efficient as possible

Cache Problem: Dead Blocks
fill hit hit hit last hit eviction Live Block will be referenced again before eviction Dead Block from the last reference until evicted live dead MRU We want to see actually how efficient is our cache. A live block is a block that will be referenced again And from the last reference to the time the block is Evicted it is dead. A dead block does not get referenced But it uses the valuable cache space. In this example A block is brought to cache. It gets accessed couple of time. After the last access it moves down the LRU stack and eventually Gets evicted. So from the last access till the eviction it was dead. The important thing is cache blocks are dead 59% of the time. LRU Cache set Cache blocks are dead on average 59% of the time

Reducing Dead Blocks: Virtual Victim Cache
Put victim blocks in the dead blocks We want to get rid of the dead blocks and use them to hold more live data. If we look at our cache we can se that there are live blocks and there are many dead blocks. Live blocks are in blue and dead blocks are in black. We propose to use the dead blocks To hold the victim blocks. So that whenever there is a replacement and we need to Evict a block, we will evict not the victim. Instead we will put it in a dead block. In this way all the dead blocks all over cache acts as a victim cache. We name this in cache victim blocks as Virtual Victim Cache. MRU LRU Cache Live block Dead block Victim block Dead blocks all over the cache acts as a victim cache

Contribution: Virtual Victim Cache
Skewed dead block predictor Victim placement and lookup Result: Improves predictor accuracy by 4.7% Reduces miss rate by 26% Improves performance by 12.1% The contribution of our work is we have proposed a modified version of Dead block predictor which we call skewed dead block predictor. We have proposed victim placement and lookup in dead blocks. Our skewed predictor improves accuracy by 4.7%. For single threaded workloads, our scheme reduces miss by 26% and Improves performance by 12.1%

Introduction Virtual Victim Cache Methodology Results Conclusion
In the rest of the talk I will present the virtual victim cache in details. Then I will talk about the methodology we have used in our experiment And show the results. Then I will conclude the talk with a summary.

Goal: use dead blocks to hold the victim blocks Mechanism Required:
Virtual Victim Cache Goal: use dead blocks to hold the victim blocks Mechanism Required: Identify which block is dead Lookup the victims Our goal is to use the dead blocks to hold the victim blocks. In order to do that we need two mechanisms. First we need to know Which blocks are dead so that we can place victims in them. Then we need to identify these victims so that we can use them.

Different Dead Block Predictors
Counting Based [ICCD05] Predicts dead after certain number of accesses Time Based [ISCA02] Predicts dead after certain number of cycles Trace Based [ISCA01] Predicts the last touch based on PC Cache Burst Based [MICRO08] Predicts when block moves out of the MRU Dead blocks can be identified using different property. We can predict a block dead depending on number of accesses, Number of cycles or the last touch. We have used trace based dead blocks predictor is our work. So we will discuss trace based predictor in details.

Trace-Based Dead Block Predictor [ISCA 01]
Predicts last touch based on sequence of instructions Encoding: truncated addition of instruction PCs Called signature Predictor table is indexed by signature 2 bit saturating counters Trace based dead block predictor uses PC sequence to identify dead blocks. The intuition behind this is, if we see that a PC sequence leads to a last access of a block, It is likely that the same PC sequence will again lead to the last access. So the trace based predictor adds and hashed the PC sequence into a fixed length Encoding called signature. This signature accesses the predictor table. The predictor table is a collection of saturating counters.

Trace-Based Dead Block Predictor [ISCA 01]
fill hit hit hit last hit eviction Predictor table live dead PC1 : ld a fill PC2 : st b 1 dead PC3 : ld a hit PC sequence PC4 : st a hit hit PC5 : ld a If we look at the same example, we can see that PC I,k,l,m and p Accesses the block. The last access is at PC p. So we hash the PCs To a signature. The predictor table is indexed by this signature. The predictor table entry in increased to mark that this signature Leads to a dead block. Talk about increasing and decreasing counters PC6 : ld e PC7 : ld f PC8 : st a hit, last touch signature =<PC1,PC3,PC4,PC5,PC8>

Skewed Trace Predictor
Index = hash(signature) Index1 = hash1(signature) Index2 = hash2(signature) confidence conf1 conf2 We have proposed to use a skewed organization of the predictor table. Instead of using one hash function, we use two hash functions to index two Predictor tables. Previously a block would have been predicted dead if the confidence was greater than Some predefined threshold. Now we will have two confidences counters and the block will Be predicted dead if sum of the confidence is greater than the threshold. dead if confidence >= threshold dead if conf1+conf2 >= threshold Reference trace predictor table Skewed trace predictor table

Skewed Trace Predictor
Uses two different hash functions Reduces conflict Improves accuracy Index2=hash2(sigX) sigX Index1= hash1(sigX) conflict Index1= hash1(sigY) The advantage of using two hash functions is that it reduces conflict in the table. Two signatures can hash to one entry when only one hash function is used. But it is very unlikely that Two signatures will hash to same entry with two different hash functions. Predictor tables sigY Index3=hash2(sigY) Conflict in both tables is less likely

Victim Placement and Lookup in VVC
Place victims in dead blocks of adjacent sets Any victim can be placed in any set Have to lookup each set for a hit Trade off between number of sets lookup latency So we can identify the dead blocks using dead block predictor. Now we want to Place the victims in the dead blocks of adjacent sets. Ideally a victim can be placed In any adjacent set and we will have the best cache efficiency. But then we have to Lookup every set to find that block. So there is a trade off between the number of adjacent sets considered and the lookup latency. We use only one adjacent set to minimize lookup latency

How to determine adjacent set?
Set that differ by only 1 bit Far enough not to be a hot set Original set Adjacent set In our scheme we have used only 1 adjacent set. The adjacent differs only by 1 bit. The only restriction is it should be far enough not to be a hot set. If we look at our cache we have the original set and we have the adjacent set. The victim is in the LRU position. We have some dead blocks in the adjacent set. So the victim will go to a dead block of the adjacent set. MRU LRU Cache

On a miss search the adjacent set
Victim Lookup On a miss search the adjacent set If found, bring it back to its original set miss Search Original set Move to original set Original set Search adjacent set Adjacent set Now when we access that block again first we will search the original set. It will be a miss. Then we will Search the adjacent set. Here we will find the block. Now we will move back the victim block to its original set. hit MRU LRU Cache

Virtual Victim Cache: Why it Works?
Reduces Conflict Misses Provides extra associativity to the hot set Reduces Capacity Misses Puts the LRU block in a dead block Fully associative cache would have replaced the LRU block Increasing live blocks effectively increases capacity Robust to False Positive Prediction VVC will find that block in the adjacent set, avoids the miss So why the Victual victim cache works? It actually reduces the conflict miss. Because it provides Extra associativity through the dead blocks. It also reduces capacity miss. VVC puts the LRU block In a dead block. A fully associative cache would have been Replaced that LRU block. VVC is also robust to false positive prediction. Even if some block Is incorrectly predicted dead, VVC does not get rid of that block. We can find it in the adjacent set and bring it to original set.

Introduction Virtual Victim Cache Methodology Results Conclusion
Now I will talk about the methodology we have used in our experiments.

Experimental Methodology
Simulator: Modified version of Simplescalar Benchmark: Spec CPU2000 and spec CPU2006 Parameter Configuration Issue Width L1 I Cache L1 D Cache L2 Cache Main Memory Cores Trace Encoding Predictor Table Entries Predictor Entry 4 64KB, 2-way LRU, 64B blocks, 1 cycle hit 64KB, 2-way LRU, 64B blocks, 3 cycle hit 2MB, 16-way LRU, 64B blocks, 12 cycle hit 270 cycle 15 bits 32768 2 bits We have used a modified version of simplescalar. Our benchmarks Are a subset of specCPU2000 and 2006. VVC is implemented in L2 cache and we have used 4 cores for our Multi threaded workloads.

Single Thread Speedup 1.3 2.6 1.7 1.7 1.3 In this graph we show the speedup for single threaded workloads. We compare VVC with fully associative cache and victim cache of size 64KB. We have chosen the victim cache size to match VVC overhead. We can see that VVC outperforms these two scheme in most the Cases. For benchmarks like vpr, parser, vortex VVC does not perform well. Mainly because these benchmarks do not have regular access pattern. So the predicator Can not predict dead blocks accurately. 0.9 Fully associative cache and 64KB victim cache both are unrealistic design

Single Thread Speedup 1.2 2.6 1.6 1.4 1.7 The accuracy of the predictor is more important in dead block replacement

Speedup for Multiple Threads
In this graph we can see the speedup for multi threaded workloads. The numbers in the x axis correspond to Each benchmark. So 175 means the benchmark is 175.vpr. We compare VVC with victim cache, Dynamic insertion policy and dead block replacement. We can see that VVC outperforms all of them. VVC improves throughput performance by 4%. We can also see that dead block replacement Performs poorly in presence of multiple threads. Because blocks are less predictable in Presence of multiple thread. But VVC performs well because it is robust to false positive Predictions. 0.88 0.88 0.89 0.84 Blocks become less predictable in presence of multiple threads

Tag Array Reads due to VVC
Virtual victim cache needs a second lookup to find A victim. But since most of the hits are satisfied by the 1st lookup. In this graph we show the number of tag array reads in the Y axis And he benchmarks in the X axis. On average VVC increases the number of lookup by 26%. In other words tag array reads in the baseline is 3.9% of the total instructions Where 4.9% of the total instructions in VVC. Tag array reads in the baseline cache is 3.9% of the total number of the instructions executed , versus 4.9% for the VVC

Conclusion Skewed predictor improves accuracy by 4.7%
Virtual Victim Cache achieves 12.1% speedup for single-threaded workloads 4% speedup for multiple-threaded workloads Future Work in Dead Block Prediction Improve accuracy Reduce overhead To summarize Virtual Victim cache achieves 12.1% speedup for Single threaded workloads and 4% speedup for multithread workloads. It improves cache efficiency by 26% for single threaded workloads and 62% For multithread workloads. Also it performs better than the dead block replacement because it is robust To false positive prediction. See our paper in MICRO 2010 

Thank you

Extra slides

Dead blocks as a Virtual Victim Cache
Placing victim blocks in to adjacent set Evicted blocks are placed in invalid/predicted dead block of the adjacent set If no such block is present victim blocks are placed in the LRU block Then the receiver block is moved to the MRU position Adaptive insertion is also used Cache lookup for previously evicted block original set lookup : miss adjacent set lookup : hit Block is refilled from the adjacent to original set Receiver block in the adjacent set is marked as invalid One bit keeps track of receiver blocks Tag match in original accesses ignores the receiver blocks

Reduction in Cache Area

Predictor Coverage and False Positive Rate
175.vpr 179.art 178.galgel 181.mcf 187.facerec 188.ammp 197.parser 255.vortex 256.bzip2 300.twolf 401.bzip2 429.mcf 450.soplex Amean 456.hmmer 473.astar 464.h264ref

Trace Based Dead Block Predictor
Memory instruction sequence going to cache set s Fill action signature tag & data hit action pc m : ld a m+n+o m+n m a hit action Update signature pc n : ld a m+n+o a pc o : st a m+n+o a Update the predictor pc p : ld b m+n+o a pc q : ld c m+n+o a <signature m+n+o> pc r : st d m+n+o a 1 pc s : ld e m+n+o a <signature m> pc t : ld f m+n+o a pc u : ld g evict action <signature m+n> pc v : ld h pc w : ld i

Speedup 2.5 2.6 2.6

Motivation Cache

False Positive Prediction
Shared cache contention results in more false positive predictions

Predictor Table Hardware Budget
With 8KB predictor, VVC achieves 5.4% speedup with original predictor where it achieves 12.1% speedup with skewed predictor

by 26% for single-threaded workloads
Cache Efficiency VVC improves cache efficiency by 62% for multiple-threaded workloads and by 26% for single-threaded workloads

Introduction Background Virtual Victim Cache Methodology Results Conclusion

Experimental Methodology
Dead Block Predictor parameter Parameter Configuration Trace encoding 16 Predictor table entries 32768 Predictor entry 2 bit Predictor overhead 8KB Cache overhead 64KB Total overhead 76KB Overhead is 3.4% of the total 2MB L2 cache space

Reducing Dead Blocks: Virtual Victim Cache
MRU LRU Cache Dead blocks all over the cache acts as a victim cache

Dead blocks across all over the cache acts as a victim cache
Virtual Victim Cache Place evicted blocks in dead blocks of other adjacent sets On a miss search the other adjacent sets for a match If that block is found in adjacent set, bring it back to its original set Dead blocks across all over the cache acts as a victim cache

Virtual Victim Cache: How it Works?
How to determine adjacent set? Set that differ by only 1 bit, in our case 4th bit Far enough not to be a hot set How to find receiver block in the adjacent set? Add 1 bit to receiver block Where to place the receiver block? Use dynamic insertion policy Choose either LRU or MRU position

Using Dead Blocks as a Virtual Victim Cache

Similar presentations

Presentation on theme: "Using Dead Blocks as a Virtual Victim Cache"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Dead Blocks as a Virtual Victim Cache

Similar presentations

Presentation on theme: "Using Dead Blocks as a Virtual Victim Cache"— Presentation transcript:

Similar presentations

About project

Feedback