1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.

Slides:

Advertisements

Similar presentations

CS Introduction to Operating Systems

Advertisements

Page Replacement Algorithms

Yannis Smaragdakis / 11-Jun-14 General Adaptive Replacement Policies Yannis Smaragdakis Georgia Tech.

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.

Improving Cache Performance by Exploiting Read-Write Disparity

Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.

Bloom Filters Kira Radinsky Slides based on material from:

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

LRU Replacement Policy Counters Method Example

Paging Algorithms Vivek Pai / Kai Li Princeton University.

Ecole Polytechnique, Nov 7, Online Job Scheduling Marek Chrobak University of California, Riverside.

Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Page Replacement Algorithms

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

Memory Management Page replacement algorithms, segmentation Tanenbaum, ch. 3 p Silberschatz, ch. 8, 9 p

Exploiting the cache capacity of a single-chip multicore processor with execution migration Pierre Michaud February 2004.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Author : Guangdeng Liao, Heeyeol Yu, Laxmi Bhuyan Publisher : Publisher : DAC'10 Presenter : Jo-Ning Yu Date : 2010/10/06.

Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Exploiting Compressed Block Size as an Indicator of Future Reuse

International Symposium on Computer Architecture ( ISCA – 2010 )

M. Tiwari, B. Agrawal, S. Mysore, J. Valamehr, T. Sherwood, CS & ECE of UCSB Reading Group Presentation by Theo.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

The Evicted-Address Filter

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.

Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.

1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.

Cache Replacement Championship

Cache Replacement Policy Based on Expected Hit Count

Improving Cache Performance using Victim Tag Stores

CSE 351 Section 9 3/1/12.

CRC-2, ISCA 2017 Toronto, Canada June 25, 2017

Cache Performance Samira Khan March 28, 2017.

18-447: Computer Architecture Lecture 23: Caches

Computer Architecture

Multilevel Memories (Improving performance using alittle “cash”)

18742 Parallel Computer Architecture Caching in Multi-core Systems

CS61C : Machine Structures Lecture 6. 2

CS61C : Machine Structures Lecture 6. 2

CARP: Compression Aware Replacement Policies

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Using Dead Blocks as a Virtual Victim Cache

International Symposium on Computer Architecture ( ISCA – 2010 )

ECE 445 – Computer Organization

A Case for MLP-Aware Cache Replacement

CARP: Compression-Aware Replacement Policies

Lecture 14: Large Cache Design II

Reuse-based Online Models for Caches

Contents Memory types & memory hierarchy Virtual memory (VM)

Operating Systems CMPSC 473

Cache - Optimization.

Lecture 9: Caching and Demand-Paged Virtual Memory

Presentation transcript:

1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010

Optimal replacement ? Offline (we know the future) ➔ Belady Online (we don’t know the future) ➔ problem without a solution –On random address sequences, all the online replacement policies perform equally on average 2 The best online replacement policy does not exist

In practice… We search a policy that performs well on as many applications as possible We hope that our benchmarks are representative But there is no guarantee that a replacement policy will always perform well 3

The DIP replacement policy Qureshi et al., ISCA 2007 Key idea #1: bimodal insertion (BIP) –LRU behaves badly on cyclic accesses ➔ try to correct this –On a miss, insert block in MRU position only with probability E=1/32, otherwise leave it in LRU position Key idea #2: set sampling –32 LRU sets, 32 BIP sets, use best policy in the other sets Beauty of DIP: just one counter ! 4

Proposed policy Incrementally derived from DIP –Start from a carefully tuned DIP Based on CLOCK instead of LRU –needs less storage than LRU Combines more than 2 different insertion policies –(new ?) method for multi-policy selection 5

Carefully tuned DIP Cache levels use unique line size ? ➔ OK –Otherwise a (small) filter would have been needed Don’t update replacement info on writes –The fact that a block is evicted from a cache level does not mean that the block is likely to be accessed soon If it is the case, it is chance, not a manifestation of temporal locality 28 SPEC 2006, CRC simulator, 16-way 1M L3 Speedup DIP / LRU ➔ avg: +2% ; max: +20% ; min: -4% 6

CLOCK DIP CLOCK policy –one use bit per block, one clock hand per cache set 16-way cache ➔ 16+4 = 20 bits per set –On access to a block (hit or insertion), set the use bit –On a miss, hand points to potential victim If use bit is set, reset it and increment the hand (mod 16), repeat till victim is found CLOCK BIP –On insertion, set the use bit with probability E=1/32 CLOCK DIP / DIP ➔ avg: +0.2% ; max: +1.2% ; min: -0.5% 7

Multi-policy selection mechanism DIP uses a single PSEL counter –Miss in LRU-dedicated set ➔ decrement PSEL –Miss in BIP-dedicated set ➔ increment PSEL Generalization: N policies, N counters P1,…,PN –Miss in set dedicated to policy j ➔ add N-1 to Pj, subtract 1 to all the other counters –Keep P1+P2+…+PN = 0 ➔ if a counter saturates, all counters stay unchanged –Best policy is the one with the smallest counter value 8

The 3P policy For a few benchmarks, neither LRU nor BIP perform well –For example, 473.astar exhibits access patterns that are approximately cyclic, but drifting relatively quickly We found that, on a few benchmarks, BIP with E=1/2 can outperform both LRU and BIP with E=1/32 ➔ 3 policies –All policies use the same hardware For E=1/2, it is possible to improve MLP –Instead of setting the use bit every other insertions, set the use bit for 64 consecutive insertions every 128 misses 3P / CLOCK DIP: avg: +0.5% ; max: +5.7% ; min: -2.1% 9

Shared-cache replacement Thread-unaware policies like DIP or 3P may be unfair –OK when threads have equal force, i.e., equal miss rates (in misses per cycle) –But fragile threads (low miss rate) are penalized when they share the cache with aggressive threads (high miss rate) BIP is good for containing aggressive threads Thread-aware bimodal insertion (TABIP): use normal insertion for fragile threads and bimodal insertion for aggressive threads 10

TABIP: identifying fragile threads Heuristic One TMISS counter per running thread Update TMISS counters the same way as policy-selection counters –E.g., 4 running threads –Thread k miss ➔ add 3 to TMISS [k], subtract 1 to TMISS of the other threads (keep sum of TMISS [i] null) Define fragile threads as threads whose TMISS is negative 11

The 4P policy 4P = 3P + CLOCK TABIP –Use 4 policy-selection counters instead of 3 28 SPEC 2006, CRC simulator, 16-way 4MB L3 100 fixed random 4-thread mixes ➔ perf for an app = arithmetic mean of CPIs for that app among the 400 CPIs Speedup 4P / LRU: avg: +3% ; max: +18% ; min: -4.5% Speedup 4P / 3P: avg: +1% ; max: +7% ; min: -3% 4P is fairer than 3P 12

Questions ? 13