Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Slides:

Advertisements

Similar presentations

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Advertisements

Bypass and Insertion Algorithms for Exclusive Last-level Caches

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

Insertion Policy Selection Using Decision Tree Analysis Samira Khan, Daniel A. Jiménez University of Texas at San Antonio.

|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.

Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.

Improving Cache Performance by Exploiting Read-Write Disparity

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Defining Anomalous Behavior for Phase Change Memory

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Exploiting Compressed Block Size as an Indicator of Future Reuse

International Symposium on Computer Architecture ( ISCA – 2010 )

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Parallel and Distributed Simulation Time Parallel Simulation.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.

The Evicted-Address Filter

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Cache Replacement Policy Based on Expected Hit Count

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Improving Cache Performance using Victim Tag Stores

CRC-2, ISCA 2017 Toronto, Canada June 25, 2017

Memshare: a Dynamic Multi-tenant Key-value Cache

CS161 – Design and Architecture of Computer

Cache Performance Samira Khan March 28, 2017.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

ASR: Adaptive Selective Replication for CMP Caches

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

18742 Parallel Computer Architecture Caching in Multi-core Systems

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

CARP: Compression Aware Replacement Policies

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Using Dead Blocks as a Virtual Victim Cache

CARP: Compression-Aware Replacement Policies

Adapted from slides by Sally McKee Cornell University

Lecture 14: Large Cache Design II

Page Cache and Page Writeback

Presentation transcript:

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez

Summary Read misses are more critical than write misses Read misses can stall processor, writes are not on the critical path Problem: Cache management does not exploit read-write disparity Goal: Design a cache that favors reads over writes to improve performance Lines that are only written to are less critical Prioritize lines that service read requests Key observation: Applications differ in their read reuse behavior in clean and dirty lines Idea: Read-Write Partitioning Dynamically partition the cache between clean and dirty lines Protect the partition that has more read hits Improves performance over three recent mechanisms 2

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 3

Motivation time STALL Rd A Wr BRd C Buffer/writeback B Cache management does not exploit the disparity between read-write requests Read and write misses are not equally critical Read misses are more critical than write misses Read misses can stall the processor Writes are not on the critical path 4

Key Idea Rd A Wr B Rd B Wr C Rd D Favor reads over writes in cache Differentiate between read vs. only written to lines Cache should protect lines that serve read requests Lines that are only written to are less critical Improve performance by maximizing read hits An Example D B C A Read-Only Read and WrittenWrite-Only 5

An Example C B D Rd A M STALL WR B M Rd B H Wr C M Rd D M STALL D C A A D B A D B B A C C B D B A D Rd A H Wr B H Rd B H WR C M Rd D M STALL D B A A D B B A C B A D A D B Write B Write C LRU Replacement Policy Write C Write B Read-Biased Replacement Policy Replace C Rd A Wr B Rd B Wr C Rd D 6 2 stalls per iteration 1 stall per iteration Evicting lines that are only written to can improve performance cycles saved Dirty lines are treated differently depending on read requests

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 7

Reuse Behavior of Dirty Lines 8 Not all dirty lines are the same Write-only Lines Do not receive read requests, can be evicted Read-Write Lines Receive read requests, should be kept in the cache Evicting write-only lines provides more space for read lines and can improve performance

Reuse Behavior of Dirty Lines On average 37.4% lines are write-only, 9.4% lines are both read and written 9 Applications have different read reuse behavior in dirty lines

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 10

Read-Write Partitioning Goal: Exploit different read reuse behavior in dirty lines to maximize number of read hits Observation: – Some applications have more reads to clean lines – Other applications have more reads to dirty lines Read-Write Partitioning: – Dynamically partitions the cache in clean and dirty lines – Evict lines from the partition that has less read reuse Improves performance by protecting lines with more read reuse 11

Read-Write Partitioning Applications have significantly different read reuse behavior in clean and dirty lines 12

Read-Write Partitioning Utilize disparity in read reuse in clean and dirty lines Partition the cache into clean and dirty lines Predict the partition size that maximizes read hits Maintain the partition through replacement – DIP [Qureshi et al. 2007] selects victim within the partition Cache Sets 13 Dirty Lines Clean Lines Predicted Best Partition Size 3 Replace from dirty partition

Predicting Partition Size Predicts partition size using sampled shadow tags – Based on utility-based partitioning [Qureshi et al. 2006] Counts the number of read hits in clean and dirty lines Picks the partition (x, associativity – x) that maximizes number of read hits P E D E L M A S S S T T E R S C O U NT E R S C O U N S S H H A A D D O O W W S S H H A A D D O O W W T T A A G G S S T T A A G G S S MRU MRU-1 LRU LRU+1 … MRU MRU-1 LRU LRU+1 … 14 DirtyClean Maximum number of read hits

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 15

Methodology CMP$im x86 cycle-accurate simulator [Jaleel et al. 2008] 4MB 16-way set-associative LLC 32KB I+D L1, 256KB L2 200-cycle DRAM access time 550m representative instructions Benchmarks: – 10 memory-intensive SPEC benchmarks – 35 multi-programmed applications 16

Comparison Points DIP, RRIP: Insertion Policy [Qureshi et al. 2007, Jaleel et al. 2010] – Avoid thrashing and cache pollution Dynamically insert lines at different stack positions – Low overhead – Do not differentiate between read-write accesses SUP+: Single-Use Reference Predictor [Piquet et al. 2007] – Avoids cache pollution Bypasses lines that do not receive re-references – High accuracy – Does not differentiate between read-write accesses Does not bypass write-only lines – High storage overhead, needs PC in LLC 17

Comparison Points: Read Reference Predictor (RRP) A new predictor inspired by prior works [Tyson et al. 1995, Piquet et al. 2007] Identifies read and write-only lines by allocating PC – Bypasses write-only lines Writebacks are not associated with any PC 18 PC P: Rd A Wb A Marks P as a PC that allocates a line that is never read again Associates the allocating PC in L1 and passes PC in L2, LLC High storage overhead PC Q: Wb A No allocating PC Allocating PC from L1 Time

Single Core Performance Differentiating read vs. write-only lines improves performance over recent mechanisms KB48.4KB RWP performs within 3.4% of RRP, But requires 18X less storage overhead

4 Core Performance Differentiating read vs. write-only lines improves performance over recent mechanisms % +8% More benefit when more applications are memory intensive

Average Memory Traffic Increases writeback traffic by 2.5%, but reduces overall memory traffic by 16% 21

Dirty Partition Sizes 22 Partition size varies significantly for some benchmarks

Dirty Partition Sizes 23 Partition size varies significantly during the runtime for some benchmarks

Outline Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 24

Conclusion Problem: Cache management does not exploit read-write disparity Goal: Design a cache that favors read requests over write requests to improve performance – Lines that are only written to are less critical – Protect lines that serve read requests Key observation: Applications differ in their read reuse behavior in clean and dirty lines Idea: Read-Write Partitioning – Dynamically partition the cache in clean and dirty lines – Protect the partition that has more read hits Results: Improves performance over three recent mechanisms 25

Thank you 26

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez

Extra Slides 28

Reuse Behavior of Dirty Lines in LLC 29 Different read reuse behavior in dirty lines Read Intensive/Non-Write Intensive – Most accesses are reads, only a few writes – Example: 483.xalancbmk Write Intensive – Generates huge amount of intermediate data – Example: 456.hmmer Read-Write Intensive – Iteratively reads and writes huge amount of data – Example: 450.soplex

Read Intensive/Non-Write Intensive Most accesses are reads, only a few writes 483.xalancbmk: Extensible stylesheet language transformations (XSLT) processor XML Input HTML/XML/TXT Output XSLT Code XSLT Processor 92% accesses are reads 99% write accesses are stack operation 30

Non-Write Intensive %ebp %esi elementAt push %ebp mov %esp, %ebp push %esi pop %esi pop %ebp ret … Logical View of Stack WRITEBACK Read and Written Write-Only % lines in LLC are write-only These dirty lines should be evicted

Write Intensive Generates huge amount of intermediate data 456.hmmer: Searches a protein sequence database Hidden Markov Model of Multiple Sequence Alignments Best Ranked Matches Viterbi algorithm, uses dynamic programming Only 0.4% writes are from stack operations 32 Database

Write Intensive 92% lines in LLC are write-only These lines can be evicted WRITEBACK 2D Table Write-Only Read and Written 33

Read-Write Intensive Iteratively reads and writes huge amount of data 450.soplex: Linear programming solver Inequalities and objective function Miminum/maximum of objective function Simplex algorithm, iterates over a matrix Different operations over the entire matrix at each iteration 34

Read-Write Intensive 19% lines in LLC write-only, 10% lines are read-written Read and written lines should be protected WRITEBACK READ Pivot Column Pivot Row Read and Written Write-Only 35

Read Reuse of Clean-Dirty Lines in LLC 36 On average 37.4% blocks are dirty non-read and 42% blocks are clean non-read

Single Core Performance 5% speedup over the whole SPECCPU2006 benchmarks KB48.4KB

Speedup On average 14.6% speedup over baseline 5% speedup over the whole SPECCPU2006 benchmarks

Writeback Traffic to Memory 39 Increases traffic by 17% over the whole SPECCPU2006 benchmarks

4-Core Performance 40

Change of IPC with Static Partitioning