Improving Cache Performance using Victim Tag Stores

Slides:

Advertisements

Similar presentations

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer

Advertisements

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

High Performing Cache Hierarchies for Server Workloads

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.

Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.

Improving Cache Performance by Exploiting Read-Write Disparity

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Exploiting Compressed Block Size as an Indicator of Future Reuse

International Symposium on Computer Architecture ( ISCA – 2010 )

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

The Evicted-Address Filter

1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

Cache Replacement Championship

Cache Replacement Policy Based on Expected Hit Count

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

CRC-2, ISCA 2017 Toronto, Canada June 25, 2017

Computer Architecture Lecture 18: Caches, Caches, Caches

Cache Performance Samira Khan March 28, 2017.

18-447: Computer Architecture Lecture 23: Caches

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

MemCache Widely used for high-performance Easy to use.

Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University

Computer Architecture: (Shared) Cache Management

18742 Parallel Computer Architecture Caching in Multi-core Systems

Prefetch-Aware Cache Management for High Performance Caching

Energy-Efficient Address Translation

Application Slowdown Model

CARP: Compression Aware Replacement Policies

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Session C1, Tuesday 10:40 AM Vivek Seshadri.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Using Dead Blocks as a Virtual Victim Cache

International Symposium on Computer Architecture ( ISCA – 2010 )

CARP: Compression-Aware Replacement Policies

Adapted from slides by Sally McKee Cornell University

Lecture 14: Large Cache Design II

Reducing DRAM Latency via

MICRO-2018 Gururaj Saileshwar1 Prashant Nair1 Prakash Ramrakhyani2

15-740/ Computer Architecture Lecture 14: Prefetching

Computer Architecture Lecture 15: Multi-Core Cache Management

15-740/ Computer Architecture Lecture 18: Caching Basics

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Computer Architecture Lecture 19a: Multi-Core Cache Management II

10/18: Lecture Topics Using spatial locality

Presentation transcript:

Improving Cache Performance using Victim Tag Stores Vivek Seshadri, Onur Mutlu, Todd C Mowry, Michael A Kozuch

Caches Exploit temporal and spatial locality Reduce latency of memory accesses Concerns Increasing off-chip memory latencies Multi-core system Shared last-level caches Increased demand on main memory Cache utilization more important

Problem Most of prior work Primarily focused on replacement policies Followed a single insertion policy for all blocks Comparatively little work on cache insertion policies Goal: Improve the cache insertion policy and consequently, system performance

Cache Insertion Policy Question: With what priority should a missed block be inserted? Lowest Priority Highest ? Highest priority: Block stays in the cache for a long time Lowest priority: Block gets evicted immediately Neither policy is good for all the blocks

Our Approach Predict and insert Prediction mechanism: Key Insight Predict the temporal locality of a missed block Good-locality blocks: Insert with high priority Bad-locality blocks: Insert with lowest priority Prediction mechanism: Key Insight Good-locality block: If prematurely evicted, will be accessed again soon after eviction Bad-locality block: Will not be accessed for a long time after getting evicted

(List of recently evicted block addresses) VTS-Cache: Mechanism Victim Tag Store (VTS) (List of recently evicted block addresses) Evicted block Address Cache Present in VTS? Yes No Prematurely evicted Good locality block Insert with high priority Bad locality block or first access Insert with lowest priority Cache miss Missed block address

Improving Performance Applications with large working sets Large reuse distance Almost all blocks will be predicted bad-locality Inserting with lowest priority => thrashing Use bimodal insertion policy instead Bimodal Insertion Policy: High priority with a low probability and lowest priority with a high probability Insert predicted bad-locality blocks with the bimodal insertion policy

(List of recently evicted block addresses) Improved VTS-Cache Victim Tag Store (VTS) (List of recently evicted block addresses) Evicted block Address Cache Present in VTS? Yes No Prematurely evicted Good locality block Insert with high priority Bad locality block or first access Insert with lowest priority - Insert with BIP Cache miss Missed block address

Improving Robustness LRU-friendly application LRU-friendly workloads LRU is the best performing policy Blocks that are just accessed have high locality VTS predicts bad-locality on first access Increases miss rate LRU-friendly workloads All applications in the workload are LRU-friendly Detect this situation using set-dueling Disable VTS for LRU-friendly workloads

Improved Robust VTS-Cache Victim Tag Store (VTS) (List of recently evicted block addresses) Evicted block Address Cache Present in VTS? Yes No VTS priority Prematurely evicted Good locality block Insert with high priority Bad locality block or first access Insert with lowest priority Always high priority - Insert with BIP Cache miss Missed block address

Practical Implementation Naïve implementation Set-associative similar to main tag store High storage overhead High lookup latency and power consumption Implementing VTS using a Bloom filter Bloom filter: Compact representation of a set Inserts always succeed (unlike hash tables) Test can have false positives (no correctness issues) Easy to implement in hardware Storage overhead: Less than 1.5% (of entire cache)

VTS-Cache: Final Look 1 Cache 3 2 Bloom Filter Counter Evicted block address Insert into filter, increment counter Cache Bloom Filter VTS priority Counter 3 When counter reaches max: Clear filter and reset counter Always high priority Missed block address Test the filter, return prediction 2

VTS-Cache: Benefits Cache Bloom Filter Bloom Filter Block eviction Bloom Filter Bloom Filter Counter Counter Cache miss 1. Only adds a Bloom filter and a counter Easy to implement. 2. Does not modify cache structure Easy to design and verify 3. Operates only on a cache miss Works well with other mechanisms

Methodology System Benchmarks Multi-programmed workloads Three level cache hierarchy 32 KB L1 cache (private), 256 KB L2 cache (private) Shared L3 cache (size depends on experiment) Benchmarks SPEC 2006, SPEC 2000, TPC-C, TPC-H, Apache Multi-programmed workloads Varying intensity and cache space sensitivity 208 2-core workloads 135 4-core workloads

Prior Approaches Policies that follow per-application insertion policy Thread-aware Dynamic Insertion Policy: TA-DIP (Qureshi – ISCA 2007, Jaleel – PACT 2008) Thread-aware Dynamic Re-reference Interval Prediction Policy: TA-DRRIP (Jaleel – ISCA 2010) Policies that follow per-block insertion policy Single-Usage Block Prediction: SUB-P (Seznec – ACSAC 2007) Run-time Cache Bypassing: RTB (Johnson – ISCA 1997) Adaptive Replacement Cache: ARC (Megiddo – FAST 2003)

Main Results 2-core (1 MB last-level cache) 15% better performance compared to LRU 6% compared to TA-DRRIP (best previous mechanism) 4-core (2 MB last-level cache) 21% better performance compared to LRU 8% compared to TA-DRRIP Single core (1 MB last-level cache) 5% better than LRU On par with TA-DRRIP

2-core Performance

4-core Performance

Block-level Insertion Policy Approaches

Conclusion VTS-Cache Provides significant performance improvements Predict temporal locality for missed block how recently the block was evicted? Determine insertion policy based on the prediction Good-locality: High priority Bad-locality: Bimodal insertion policy Provides significant performance improvements Mechanisms with per-application insertion policies Other mechanisms with block-level insertion policies

VTS with Bloom Filter + Counter Initialize Clear Bloom filter and counter On an eviction Inserted evicted block address into the filter Increment counter On a miss Test the Bloom filter for the missed address Return prediction When counter reaches max (preferred VTS size)

Single Core Results

Fairness Results: 2-core

Fairness: 4-core