CRC-2, ISCA 2017 Toronto, Canada June 25, 2017

Slides:

Advertisements

Similar presentations

Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary

Advertisements

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer

High Performing Cache Hierarchies for Server Workloads

Hierarchy-aware Replacement and Bypass Algorithms for Last-level Caches Mainak Chaudhuri Indian Institute of Technology, Kanpur & Jayesh Gaur 1, Nithiyanandan.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.

Prefetch-Aware Cache Management for High Performance Caching

Improving Cache Performance by Exploiting Read-Write Disparity

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.

1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.

1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

Chapter 21 Virtual Memoey: Policies Chien-Chung Shen CIS, UD

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

International Symposium on Computer Architecture ( ISCA – 2010 )

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

The Evicted-Address Filter

Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.

1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Cache Replacement Championship

Cache Replacement Policy Based on Expected Hit Count

Improving Cache Performance using Victim Tag Stores

The 2nd Cache Replacement Championship (CRC-2)

Lecture: Large Caches, Virtual Memory

Cache Performance Samira Khan March 28, 2017.

ASR: Adaptive Selective Replication for CMP Caches

Associativity in Caches Lecture 25

Multilevel Memories (Improving performance using alittle “cash”)

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

Lecture: Cache Hierarchies

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

Lecture: Cache Hierarchies

Prefetch-Aware Cache Management for High Performance Caching

CARP: Compression Aware Replacement Policies

Milad Hashemi, Onur Mutlu, Yale N. Patt

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Using Dead Blocks as a Virtual Victim Cache

International Symposium on Computer Architecture ( ISCA – 2010 )

Distributed Systems CS

Lecture: Cache Innovations, Virtual Memory

Lecture 15: Large Cache Design III

CARP: Compression-Aware Replacement Policies

Adapted from slides by Sally McKee Cornell University

Lecture 14: Large Cache Design II

15-740/ Computer Architecture Lecture 14: Prefetching

Lecture: Cache Hierarchies

Chapter 1 Computer System Overview

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A

10/18: Lecture Topics Using spatial locality

Presentation transcript:

SHIP++: Enhancing Signature-Based Hit Predictor for Improved Cache Performance CRC-2, ISCA 2017 Toronto, Canada June 25, 2017 Vinson Young, Georgia Tech Chia-Chen Chou, Georgia Tech Aamer Jaleel, NVIDIA Moinuddin K. Qureshi, Georgia Tech 25 minutes total per slot. 20 minute presentation, 4 minute question, 1 minute change.

Importance of Replacement Policy Increasing # of cores increase memory load Improving cache hit rate reduces memory load for cheap Improve access latency  improves performance Reduce memory accesses  improves power and performance LRU is commonly used

Problems with LRU Replacement Working set larger than the cache causes thrashing miss miss miss miss miss LLCsize Wsize The first problem with LRU replacement is when the working set is larger than the cache. In such scenarios, LRU causes cache thrashing and always results in misses! The second problem is when references to non-temporal data, called scans, discards the frequently referenced working set from the cache. Let me illustrate. When the working set is smaller than the LLC it receives cache hits. Successive references to the working set continue to receive cache hits. However, after a one-time reference to a long stream of data, re-references to the working set after the scan result in a miss under LRU replacement. Successive re-references after re-fetching the data from memory result in hits until the next scan. And the problem repeats . After every scan, the frequently referenced working set always misses! Why is this important? Well, our studies show that scans occur frequently in many commercial workloads. -wu References to non-temporal data (scans) discards frequently referenced working set hit hit hit miss hit miss miss scan scan scan LLCsize Wsize scans occur frequently in commercial workloads

Desired Behavior from Cache Replacement Working set larger than the cache  Preserve some of working set in the cache hit hit hit hit hit miss miss miss miss miss Wsize LLCsize [ DIP (ISCA’07), DRRIP (ISCA’10) achieves this effect ] Under both these scenarios, the desired behavior from cache replacement is as follows: If working set is larger than cache, preserve some of it in the cache. In the presence of recurring scans, the replacement policy should preserve the frequently referenced working set in the cache. -wu Recurring scans  Preserve frequently referenced working set in the cache hit scan [ SRRIP (ISCA’10) achieves this effect ]

Dynamic Re-Reference Interval Prediction ( DRRIP ) (SRRIP) Scan-Resistant ( BRRIP ) Thrash-Resistant insertion insertion Imme- diate 1 Inter- mediate 2 far 3 distant No Victim No Victim No Victim Since our work builds on top of the re-reference interval framework, I would like to quickly review the RRIP replacement policy. Like LRU which holds the LRU position with each cache line, RRIP replaces the notion of the “LRU” position with a prediction of the likely re-reference interval of a cache line. For example, with 2-bit RRIP, there are four possible re-reference intervals. If a line has re-reference interval 0, it implies the line will be re-referenced soon. If a line has re-reference interval 3, it implies the line will be re-referenced in the distant future. In between distant and immediate there is intermediate and far re-reference intervals. When selecting a victim, RRIP always selects lines that have a distant re-reference interval for eviction. If no line is found, the states of all lines in the set are incremented until a line with distant re-reference interval is found. When inserting new lines in the cache, scan-resistant SRRIP dynamically tries to learn the re-reference interval of a line by initially inserting ALL lines with “far” re-reference interval. This is done in an effort to dynamically learn the blocks re-reference interval. If the line has no locality, it will be quickly discarded. However, if the line has locality, on the next re-reference the line is moved to have immediate state, hence preserving it in the cache for a longer time. -wu re-reference eviction re-reference re-reference [ Jaleel et al., ISCA’10 ]

Dynamic Re-Reference Interval Prediction ( DRRIP ) (SRRIP) Scan-Resistant ( BRRIP ) Thrash-Resistant insertion insertion Imme- diate 1 Inter- mediate 2 far 3 distant No Victim No Victim No Victim re-reference eviction re-reference re-reference [ Jaleel et al., ISCA’10 ]

Signature-based Hit Predictor (SHiP) PC-classified Re-use PC-classified Scan insertion insertion Imme- diate 1 Inter- mediate 2 far 3 distant No Victim No Victim No Victim re-reference eviction re-reference re-reference [ Wu et al., MICRO’11 ]

Observe Signature Re-Reference Behavior Observe re-reference pattern in the baseline cache Load/Store Address Cache Tag Replacement State Coherence State LLC

Observe Signature Re-Reference Behavior Observe re-reference pattern in the baseline cache Gathering Signature: Was line re-referenced after cache insertion ( 1-bit ) “Signature” responsible for cache insertion ( 14-bits ) Signature Load/Store Address LLC reuse bit signature_insert metadata

Learn Signature Re-Reference Behavior Signature History Counter Table (SHCT)( 16K, 3-bit counters ) SHCT Learning with SHCT Cache Hit  SHCT[signature_insert]++ 000 SHCTR Evict (re-use=0)  SHCT[signature_insert]-- Non-zero

Predicting Signature Re-Reference Behavior Learn signature re-reference behavior Signature History Counter Table (SHCT)( 16K, 3-bit counters ) Predicting with SHCT SHCT SHCTR = 0, predict NOT re-referenced. Install state=3 000 Leverage SHCT to improve confidence of install SHCTR SHCTR != 0, predict signature re-referenced. Install state=2 Non-zero

SHiP Improvements 3 improvements under no-prefetching High-Confidence Install Balanced SHCT Training Write-back-aware Install 2 improvements under prefetching Prefetch-aware Training Prefetch-aware State-Update

Improvement 1: High-Confidence Installs Previous: SHiP always installs with state 2 or 3 Observation: RRIP requires re-use before promoting to state 0. But, some workloads benefit from keeping re-use lines longer Solution: Leverage SHCT to confidently install at state 0. Install with state 0, when SHCTR saturated at 7 Leverage SHCT to improve confidence of install

Improvement 1: High-Confidence Installs SHCtr == 7 Re-use 0 < SHctr < 7 Scans SHCtr == 0 insertion insertion insertion Imme- diate 1 Inter- mediate 2 far 3 distant No Victim No Victim No Victim Insight: Keep high confidence lines longer re-reference eviction re-reference re-reference [ Jaleel et al., ISCA’10 ]

Improvement 2: Balanced SHCT Training Previous: SHCT Learns on all hits and evictions Observation: Small number of high-access-frequency lines saturate CTRs (mcf and sphinx) Solution: Learn from only first-hit and evictions

Improvement 2: Balanced SHCT Training Learning with SHCT SHCT Cache Hit (re-use=0)  SHCT[signature_insert]++ 000 SHCTR Evict (re-use=0)  SHCT[signature_insert]-- Non-zero

Improvement 3: Writeback-Aware Installs Previous: No differentiation for Writebacks Observation: Writebacks not in critical path and signal end of a context. Can be bypassed. Solution: Install writebacks at state 3 (why? Model requires install of writebacks)

Improvement 3: Writeback-Aware Installs High-confidence SHCtr == 7 Re-use 0 < SHctr < 7 Scans + Writebacks (SHCtr == 0) || is_wb insertion insertion insertion Imme- diate 1 Inter- mediate 2 far 3 distant No Victim No Victim No Victim Insight: Keep high confidence lines longer re-reference eviction re-reference re-reference [ Jaleel et al., ISCA’10 ]

Results (under no prefetching) 38 64 26 High-confidence installs Helps calculix, Gems, zeusmp Better SHCT training (first-hit + eviction) Helps mcf and sphinx SHiP++ achieves 6.2% Speedup over LRU (SHiP is 3.9%)

Improvement 4: Prefetch-Aware Training Previous: No differentiation for Prefetches Observation: Demand may have re-use, but prefetched lines may not have re-use Solution: Learn separately in different halves of SHCT. Use Signature = (PC << 1) + is_pf

Improvement 4: Prefetch-Aware Training SHCT Learning with SHCT Prefetch half of SHCT SHCTR Cache Hit (re-use=0)  SHCT[signature<<1 | is_pf]++ 000 Demand half of SHCT SHCTR Evict (re-use=0)  SHCT[signature<<1 | is_pf]-- Non-zero

Improvement 4: Prefetch-Aware Training Predicting with SHCT Learning with SHCT Predict re-use for prefetch, separately SHCTR Cache Hit (re-use=0)  SHCT[signature<<1 | is_pf]++ 000 Predict re-use SHCTR Evict (re-use=0)  SHCT[signature<<1 | is_pf]-- Non-zero

Improvement 5: Prefetch-Aware State-Update Previous: No differentiation for Prefetch Observation: Prefetches are staying in caches for a long time. First-access to prefetched line is demand access. Baseline SHiP promotes and keeps accurate prefetches past usefulness Solution: Ignore state-update for first access to prefetched line. Update for subsequent accesses

Improvement 5: Prefetch-Aware State-Update High-confidence SHCtr == 7 Re-use 0 < SHctr < 7 Scans + Writebacks (SHCtr == 0) || is_wb insertion insertion On first-access to prefetched: unset is_pf; no state-update; insertion Imme- diate 1 Inter- mediate 2 far 3 distant No Victim No Victim No Victim Insight: Keep high confidence lines longer re-reference && ! is_pf eviction re-reference && ! is_pf re-reference && ! is_pf [ Jaleel et al., ISCA’10 ]

Results (under prefetching) 21 65 Sphinx and mcf. Learns their prefetches are not accurate and installs them at low priority. SHiP++ achieves 4.6% Speedup over LRU (SHiP is 2.3%)

Summary SHiP++: improve PC-based classifier for re-use / no-re-use PC’s High-Confidence Install Balanced SHCT Training Write-back-aware Install Prefetch-aware Training Prefetch-aware State-Update 6.2 % speedup (base config), 4.6 % speedup (prefetch config)

THANK YOU