Prefetch-Aware Cache Management for High Performance Caching

Slides:



Advertisements
Similar presentations
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer
Advertisements

CRUISE: Cache Replacement and Utility-Aware Scheduling
Bypass and Insertion Algorithms for Exclusive Last-level Caches
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
High Performing Cache Hierarchies for Server Workloads
Hierarchy-aware Replacement and Bypass Algorithms for Last-level Caches Mainak Chaudhuri Indian Institute of Technology, Kanpur & Jayesh Gaur 1, Nithiyanandan.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Achieving Non-Inclusive Cache Performance with Inclusive Caches
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
Prefetch-Aware Cache Management for High Performance Caching
Improving Cache Performance by Exploiting Read-Write Disparity
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
International Symposium on Computer Architecture ( ISCA – 2010 )
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
The Evicted-Address Filter
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Cache Replacement Policy Based on Expected Hit Count
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Improving Cache Performance using Victim Tag Stores
Reducing Memory Interference in Multicore Systems
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ASR: Adaptive Selective Replication for CMP Caches
Xiaodong Wang, Shuang Chen, Jeff Setter,
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Lecture: Cache Hierarchies
18742 Parallel Computer Architecture Caching in Multi-core Systems
Lecture: Cache Hierarchies
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Lecture 12: Cache Innovations
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
International Symposium on Computer Architecture ( ISCA – 2010 )
ECE Dept., University of Toronto
Lecture: Cache Innovations, Virtual Memory
CARP: Compression-Aware Replacement Policies
Adapted from slides by Sally McKee Cornell University
Lecture 14: Large Cache Design II
Reducing DRAM Latency via
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Lecture: Cache Hierarchies
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Prefetch-Aware Cache Management for High Performance Caching PA Man: Carole-Jean Wu¶, Aamer Jaleel*, Margaret Martonosi¶, Simon Steely Jr.*, Joel Emer*§ Princeton University¶ Intel VSSAD* MIT§ December 7, 2011 International Symposium on Microarchitecture

Memory Latency is Performance Bottleneck Many commonly studied memory optimization techniques Our work studies two: Prefetching For our workloads, prefetching alone improves performance by an avg. of 35% Intelligent Last-Level Cache (LLC) Management This work is the first that investigates [ISCA `10] [MICRO `10] [MICRO `11] 2 LLC management alone

L2 Prefetcher: LLC Misses CPU0 CPU1 CPU2 CPU3 L1I L1D L1I L1D L1I L1D L1I L1D Miss L2 L2 L2 L2 PF PF PF PF When prefetching a specific address the first time, …. 2 types of requests: prefetch & demand requests going to the LLC. LLC Miss . . .

L2 Prefetcher: LLC Hits CPU0 CPU1 CPU2 CPU3 Miss L2 L2 L2 L2 PF PF PF L1I L1D L1I L1D L1I L1D L1I L1D Miss L2 L2 L2 L2 PF PF PF PF LLC Hit . . .

Prefetching Intelligent LLC Management Let’s see what happens when applying the 2 commonly used memory latency optimization techniques together,

Observation 1: For Not-Easily-Prefetchable Applications… Observation 1: Cache pollution causes unexpected performance degradation despite intelligent LLC Management

Observation 2: For Prefetching-Friendly Applications Observation 2: Prefetched data in LLC diminishes the performance gains from intelligent LLC management. 6.5%+ 3.0%+ Is halved. SPEC CPU2006 No Prefetching SPEC CPU2006 Prefetching 4

Design Dimensions for Prefetcher/Cache Management Prefetcher Cache Interference Reduced Perf. Gains from Intelligent LLC Management Hardware Overhead Adaptive prefetch filters/buffers Prefetch pollution estimation Perf. counter-based prefetcher manager ✔ ✗ Some (new hw.) Synergistic management for prefetchers and intelligent LLC management ✔ ✗ Moderate (pf. bit/line) ✔ ✗ Software

PACMan: Prefetch-Aware Cache Management Research Question 1: For applications suffering from prefetcher cache pollution, can PACMan minimize such interference? Research Question 2: For applications already benefiting from prefetching, can PACMan improve performance even more? The two important observations for the interaction between intelligent LLC management and hardware prefetching lead to our work for prefetch-aware cache management (called PACMan).

Talk Outline Motivation PACMan: Prefetch-Aware Cache Management PACMan-M PACMan-H PACMan-HM PACMan-Dyn Performance Evaluation Conclusion

Opportunities for a More Intelligent Cache Management Policy A cache line’s state is naturally updated when Inserting an incoming cache line @ cache miss Updating a cache line’s state @ cache hit Re-Reference Interval Prediction (RRIP) ISCA `10 Cache line is inserted Cache line is evicted Cache line is re-referenced Imme- diate 1 Inter- mediate 2 far 3 distant PACMan treats demand and prefetch requests differently at cache insertion and hit promotion No victim is found Cache line is re-referenced Cache line is re-referenced 11 14

PACMan-M: Treat Prefetch Requests Differently at Cache Misses Reducing prefetcher cache pollution at cache line insertion Cache line is inserted Cache line is evicted Prefetch Demand Cache line is re-referenced Imme- diate 1 Inter- mediate 2 far 3 distant Cache line is re-referenced Cache line is re-referenced 14

PACMan-H: Treat Prefetch Requests Differently at Cache Hits Retaining more “valuable” cache lines at cache hit promotion Cache line is re-referenced Cache line is inserted Cache line is evicted Prefetch Hit Demand Hit Imme- diate 1 Inter- mediate 2 far 3 distant Similar to PACMan-M, PACMan-H deprioritizes prefetch requests over demand requests that hit in the cache. Cache lines referenced by demand requests are “more valuable”  PACMan-H retains these lines Prefetch Hit Demand Hit Prefetch Hit Demand Hit Cache line is re-referenced Cache line is re-referenced 16

PACMan-HM = PAMan-H + PACMan-M Cache line is inserted Cache line is evicted Cache line is re-referenced Prefetch Miss Demand Miss Prefetch Hit Demand Hit Imme- diate 1 Inter- mediate 2 far 3 distant Prefetch Hit Demand Hit Prefetch Hit Demand Hit Cache line is re-referenced Cache line is re-referenced

PACMan-Dyn dynamically chooses between static PACMan policies Set Dueling SDM Baseline + PACMan-H Cnt policy1 SDM Baseline + PACMan-M Cnt policy2 MIN SDM Baseline + PACMan-HM Cnt policy3 index Follower Sets Policy Selection . 19

Evaluation Methodology CMP$im simulation framework 4-way OOO processor 128-entry ROB 3-level cache hierarchy L1 inst. and data caches: 32KB, 4-way, private, 1-cycle L2 unified cache: 256KB, 8-way, private, 10-cycle L3 last-level cache: 1MB per core, 16-way, shared, 30-cycle Main memory: 32 outstanding requests, 200-cycle Streamer prefetcher – 16 stream detectors DRRIP-based LLC: 2-bit RRIP counter

PACMan-HM Outperforms PACMan-H and PACMan-M While PACMan policies improve performance overall, static PACMan policies can hurt some applications i.e. bwaves and gemsFDTD

PACMan-Dyn: Better and More Predictable Performance Gains PACMan-Dyn performs the best (overall) while providing more consistent performance gains.

PACMan: Prefetch-Aware Cache Management Research Question 1: For applications suffering from prefetcher cache pollution, can PACMan minimize such interference? Research Question 2: For applications already benefiting from prefetching, can PACMan improve performance even more?

PACMan Combines Benefits of Intelligent LLC Management and Prefetching Prefetch-Induced LLC Interference Prefetching Friendly 22% better 15% better

Other Topics in the Paper PACMan-Dyn-Local/Global for multiprog. workloads An avg. of 21.0% perf. improvement PACMan cache size sensitivity PACMan for inclusive, non-inclusive, and exclusive cache hierarchies PACMan’s impact on memory bandwidth

PACMan Conclusion First synergistic approach for prefetching and intelligent LLC management Prefetch-aware cache insertion and update ~21% performance improvement Minimal hardware storage overhead PACMan’s Fine-Grained Prefetcher Control Reduces performance variability from prefetching

Prefetch-Aware Cache Management for High Performance Caching PA Man: Carole-Jean Wu¶, Aamer Jaleel*, Margaret Martonosi¶, Simon Steely Jr.*, Joel Emer*§ Princeton University¶ Intel VSSAD* MIT§ December 7, 2011 International Symposium on Microarchitecture