Using Dead Blocks as a Virtual Victim Cache

Slides:



Advertisements
Similar presentations
Dead Block Replacement and Bypass with a Sampling Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Improving Cache Performance by Exploiting Read-Write Disparity
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo.
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
The Evicted-Address Filter
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Cache Replacement Championship
Cache Replacement Policy Based on Expected Hit Count
Translation Lookaside Buffer
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Improving Cache Performance using Victim Tag Stores
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
CSE 351 Section 9 3/1/12.
Lecture: Large Caches, Virtual Memory
Associativity in Caches Lecture 25
Multilevel Memories (Improving performance using alittle “cash”)
Basic Performance Parameters in Computer Architecture:
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Adaptive Cache Replacement Policy
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Cache Memories September 30, 2008
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Lecture: Cache Innovations, Virtual Memory
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers
Applying SVM to Data Bypass Prediction
CS 3410, Spring 2014 Computer Science Cornell University
Cache - Optimization.
rePLay: A Hardware Framework for Dynamic Optimization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A
Presentation transcript:

Using Dead Blocks as a Virtual Victim Cache Samira Khan, Daniel A. Jiménez, Doug Burger, Babak Falsafi

The Cache Utilization Wall Performance gap Processors getting faster Memory only getting larger Caches are not efficient Designed for fast lookup Contain too many useless blocks! We all know that there is huge performance gap between the processor and the memory. Large on chip cache hide can hide most of memory latency. It can access a block in few cycles, but a miss that goes memory Waits for hundreds of cycles for the data. We want our cache to be as efficient as possible. But it turns out that caches are designed for fast lookup. Actually more than half of the cache blocks Are never accessed again. So it is not utilized well . We want the cache to be as efficient as possible

Cache Problem: Dead Blocks fill hit hit hit last hit eviction Live Block will be referenced again before eviction Dead Block from the last reference until evicted live dead MRU We want to see actually how efficient is our cache. A live block is a block that will be referenced again And from the last reference to the time the block is Evicted it is dead. A dead block does not get referenced But it uses the valuable cache space. In this example A block is brought to cache. It gets accessed couple of time. After the last access it moves down the LRU stack and eventually Gets evicted. So from the last access till the eviction it was dead. The important thing is cache blocks are dead 59% of the time. LRU Cache set Cache blocks are dead on average 59% of the time

Reducing Dead Blocks: Virtual Victim Cache Put victim blocks in the dead blocks We want to get rid of the dead blocks and use them to hold more live data. If we look at our cache we can se that there are live blocks and there are many dead blocks. Live blocks are in blue and dead blocks are in black. We propose to use the dead blocks To hold the victim blocks. So that whenever there is a replacement and we need to Evict a block, we will evict not the victim. Instead we will put it in a dead block. In this way all the dead blocks all over cache acts as a victim cache. We name this in cache victim blocks as Virtual Victim Cache. MRU LRU Cache Live block Dead block Victim block Dead blocks all over the cache acts as a victim cache

Contribution: Virtual Victim Cache Skewed dead block predictor Victim placement and lookup Result: Improves predictor accuracy by 4.7% Reduces miss rate by 26% Improves performance by 12.1% The contribution of our work is we have proposed a modified version of Dead block predictor which we call skewed dead block predictor. We have proposed victim placement and lookup in dead blocks. Our skewed predictor improves accuracy by 4.7%. For single threaded workloads, our scheme reduces miss by 26% and Improves performance by 12.1%

Introduction Virtual Victim Cache Methodology Results Conclusion In the rest of the talk I will present the virtual victim cache in details. Then I will talk about the methodology we have used in our experiment And show the results. Then I will conclude the talk with a summary.

Goal: use dead blocks to hold the victim blocks Mechanism Required: Virtual Victim Cache Goal: use dead blocks to hold the victim blocks Mechanism Required: Identify which block is dead Lookup the victims Our goal is to use the dead blocks to hold the victim blocks. In order to do that we need two mechanisms. First we need to know Which blocks are dead so that we can place victims in them. Then we need to identify these victims so that we can use them.

Different Dead Block Predictors Counting Based [ICCD05] Predicts dead after certain number of accesses Time Based [ISCA02] Predicts dead after certain number of cycles Trace Based [ISCA01] Predicts the last touch based on PC Cache Burst Based [MICRO08] Predicts when block moves out of the MRU Dead blocks can be identified using different property. We can predict a block dead depending on number of accesses, Number of cycles or the last touch. We have used trace based dead blocks predictor is our work. So we will discuss trace based predictor in details.

Trace-Based Dead Block Predictor [ISCA 01] Predicts last touch based on sequence of instructions Encoding: truncated addition of instruction PCs Called signature Predictor table is indexed by signature 2 bit saturating counters Trace based dead block predictor uses PC sequence to identify dead blocks. The intuition behind this is, if we see that a PC sequence leads to a last access of a block, It is likely that the same PC sequence will again lead to the last access. So the trace based predictor adds and hashed the PC sequence into a fixed length Encoding called signature. This signature accesses the predictor table. The predictor table is a collection of saturating counters.

Trace-Based Dead Block Predictor [ISCA 01] fill hit hit hit last hit eviction Predictor table live dead PC1 : ld a fill PC2 : st b 1 dead PC3 : ld a hit PC sequence PC4 : st a hit hit PC5 : ld a If we look at the same example, we can see that PC I,k,l,m and p Accesses the block. The last access is at PC p. So we hash the PCs To a signature. The predictor table is indexed by this signature. The predictor table entry in increased to mark that this signature Leads to a dead block. Talk about increasing and decreasing counters PC6 : ld e PC7 : ld f PC8 : st a hit, last touch signature =<PC1,PC3,PC4,PC5,PC8>

Skewed Trace Predictor Index = hash(signature) Index1 = hash1(signature) Index2 = hash2(signature) confidence conf1 conf2 We have proposed to use a skewed organization of the predictor table. Instead of using one hash function, we use two hash functions to index two Predictor tables. Previously a block would have been predicted dead if the confidence was greater than Some predefined threshold. Now we will have two confidences counters and the block will Be predicted dead if sum of the confidence is greater than the threshold. dead if confidence >= threshold dead if conf1+conf2 >= threshold Reference trace predictor table Skewed trace predictor table

Skewed Trace Predictor Uses two different hash functions Reduces conflict Improves accuracy Index2=hash2(sigX) sigX Index1= hash1(sigX) conflict Index1= hash1(sigY) The advantage of using two hash functions is that it reduces conflict in the table. Two signatures can hash to one entry when only one hash function is used. But it is very unlikely that Two signatures will hash to same entry with two different hash functions. Predictor tables sigY Index3=hash2(sigY) Conflict in both tables is less likely

Victim Placement and Lookup in VVC Place victims in dead blocks of adjacent sets Any victim can be placed in any set Have to lookup each set for a hit Trade off between number of sets lookup latency So we can identify the dead blocks using dead block predictor. Now we want to Place the victims in the dead blocks of adjacent sets. Ideally a victim can be placed In any adjacent set and we will have the best cache efficiency. But then we have to Lookup every set to find that block. So there is a trade off between the number of adjacent sets considered and the lookup latency. We use only one adjacent set to minimize lookup latency

How to determine adjacent set? Set that differ by only 1 bit Far enough not to be a hot set Original set Adjacent set In our scheme we have used only 1 adjacent set. The adjacent differs only by 1 bit. The only restriction is it should be far enough not to be a hot set. If we look at our cache we have the original set and we have the adjacent set. The victim is in the LRU position. We have some dead blocks in the adjacent set. So the victim will go to a dead block of the adjacent set. MRU LRU Cache

On a miss search the adjacent set Victim Lookup On a miss search the adjacent set If found, bring it back to its original set miss Search Original set Move to original set Original set Search adjacent set Adjacent set Now when we access that block again first we will search the original set. It will be a miss. Then we will Search the adjacent set. Here we will find the block. Now we will move back the victim block to its original set. hit MRU LRU Cache

Virtual Victim Cache: Why it Works? Reduces Conflict Misses Provides extra associativity to the hot set Reduces Capacity Misses Puts the LRU block in a dead block Fully associative cache would have replaced the LRU block Increasing live blocks effectively increases capacity Robust to False Positive Prediction VVC will find that block in the adjacent set, avoids the miss So why the Victual victim cache works? It actually reduces the conflict miss. Because it provides Extra associativity through the dead blocks. It also reduces capacity miss. VVC puts the LRU block In a dead block. A fully associative cache would have been Replaced that LRU block. VVC is also robust to false positive prediction. Even if some block Is incorrectly predicted dead, VVC does not get rid of that block. We can find it in the adjacent set and bring it to original set.

Introduction Virtual Victim Cache Methodology Results Conclusion Now I will talk about the methodology we have used in our experiments.

Experimental Methodology Simulator: Modified version of Simplescalar Benchmark: Spec CPU2000 and spec CPU2006 Parameter Configuration Issue Width L1 I Cache L1 D Cache L2 Cache Main Memory Cores Trace Encoding Predictor Table Entries Predictor Entry 4 64KB, 2-way LRU, 64B blocks, 1 cycle hit 64KB, 2-way LRU, 64B blocks, 3 cycle hit 2MB, 16-way LRU, 64B blocks, 12 cycle hit 270 cycle 15 bits 32768 2 bits We have used a modified version of simplescalar. Our benchmarks Are a subset of specCPU2000 and 2006. VVC is implemented in L2 cache and we have used 4 cores for our Multi threaded workloads.

Single Thread Speedup 1.3 2.6 1.7 1.7 1.3 In this graph we show the speedup for single threaded workloads. We compare VVC with fully associative cache and victim cache of size 64KB. We have chosen the victim cache size to match VVC overhead. We can see that VVC outperforms these two scheme in most the Cases. For benchmarks like vpr, parser, vortex VVC does not perform well. Mainly because these benchmarks do not have regular access pattern. So the predicator Can not predict dead blocks accurately. 0.9 Fully associative cache and 64KB victim cache both are unrealistic design

Single Thread Speedup 1.2 2.6 1.6 1.4 1.7 The accuracy of the predictor is more important in dead block replacement

Speedup for Multiple Threads In this graph we can see the speedup for multi threaded workloads. The numbers in the x axis correspond to Each benchmark. So 175 means the benchmark is 175.vpr. We compare VVC with victim cache, Dynamic insertion policy and dead block replacement. We can see that VVC outperforms all of them. VVC improves throughput performance by 4%. We can also see that dead block replacement Performs poorly in presence of multiple threads. Because blocks are less predictable in Presence of multiple thread. But VVC performs well because it is robust to false positive Predictions. 0.88 0.88 0.89 0.84 Blocks become less predictable in presence of multiple threads

Tag Array Reads due to VVC Virtual victim cache needs a second lookup to find A victim. But since most of the hits are satisfied by the 1st lookup. In this graph we show the number of tag array reads in the Y axis And he benchmarks in the X axis. On average VVC increases the number of lookup by 26%. In other words tag array reads in the baseline is 3.9% of the total instructions Where 4.9% of the total instructions in VVC. Tag array reads in the baseline cache is 3.9% of the total number of the instructions executed , versus 4.9% for the VVC

Conclusion Skewed predictor improves accuracy by 4.7% Virtual Victim Cache achieves 12.1% speedup for single-threaded workloads 4% speedup for multiple-threaded workloads Future Work in Dead Block Prediction Improve accuracy Reduce overhead To summarize Virtual Victim cache achieves 12.1% speedup for Single threaded workloads and 4% speedup for multithread workloads. It improves cache efficiency by 26% for single threaded workloads and 62% For multithread workloads. Also it performs better than the dead block replacement because it is robust To false positive prediction. See our paper in MICRO 2010 

Thank you

Extra slides

Dead blocks as a Virtual Victim Cache Placing victim blocks in to adjacent set Evicted blocks are placed in invalid/predicted dead block of the adjacent set If no such block is present victim blocks are placed in the LRU block Then the receiver block is moved to the MRU position Adaptive insertion is also used Cache lookup for previously evicted block original set lookup : miss adjacent set lookup : hit Block is refilled from the adjacent to original set Receiver block in the adjacent set is marked as invalid One bit keeps track of receiver blocks Tag match in original accesses ignores the receiver blocks

Reduction in Cache Area

Predictor Coverage and False Positive Rate 175.vpr 179.art 178.galgel 181.mcf 187.facerec 188.ammp 197.parser 255.vortex 256.bzip2 300.twolf 401.bzip2 429.mcf 450.soplex Amean 456.hmmer 473.astar 464.h264ref

Trace Based Dead Block Predictor Memory instruction sequence going to cache set s Fill action signature tag & data hit action pc m : ld a m+n+o m+n m a hit action Update signature pc n : ld a m+n+o a pc o : st a m+n+o a Update the predictor pc p : ld b m+n+o a pc q : ld c m+n+o a <signature m+n+o> pc r : st d m+n+o a 1 pc s : ld e m+n+o a <signature m> pc t : ld f m+n+o a pc u : ld g evict action <signature m+n> pc v : ld h pc w : ld i

MPKI

IPC

Speedup 2.5 2.6 2.6

Motivation Cache

False Positive Prediction Shared cache contention results in more false positive predictions

Predictor Table Hardware Budget With 8KB predictor, VVC achieves 5.4% speedup with original predictor where it achieves 12.1% speedup with skewed predictor

by 26% for single-threaded workloads Cache Efficiency VVC improves cache efficiency by 62% for multiple-threaded workloads and by 26% for single-threaded workloads

Introduction Background Virtual Victim Cache Methodology Results Conclusion

Introduction Background Virtual Victim Cache Methodology Results Conclusion

Experimental Methodology Dead Block Predictor parameter Parameter Configuration Trace encoding 16 Predictor table entries 32768 Predictor entry 2 bit Predictor overhead 8KB Cache overhead 64KB Total overhead 76KB Overhead is 3.4% of the total 2MB L2 cache space

Reducing Dead Blocks: Virtual Victim Cache MRU LRU Cache Dead blocks all over the cache acts as a victim cache

Dead blocks across all over the cache acts as a victim cache Virtual Victim Cache Place evicted blocks in dead blocks of other adjacent sets On a miss search the other adjacent sets for a match If that block is found in adjacent set, bring it back to its original set Dead blocks across all over the cache acts as a victim cache

Virtual Victim Cache: How it Works? How to determine adjacent set? Set that differ by only 1 bit, in our case 4th bit Far enough not to be a hot set How to find receiver block in the adjacent set? Add 1 bit to receiver block Where to place the receiver block? Use dynamic insertion policy Choose either LRU or MRU position