The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Slides:

Advertisements

Similar presentations

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Improving Cache Performance by Exploiting Read-Write Disparity

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches 2 P & H Chapter 5.2 (writes), 5.3, 5.5.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.17.

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Exploiting Compressed Block Size as an Indicator of Future Reuse

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

The Evicted-Address Filter

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

CSCI206 - Computer Organization & Programming

Translation Lookaside Buffer

Improving Cache Performance using Victim Tag Stores

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Lecture: Large Caches, Virtual Memory

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

ASR: Adaptive Selective Replication for CMP Caches

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.

Application Slowdown Model

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Session C1, Tuesday 10:40 AM Vivek Seshadri.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

Lecture: Cache Innovations, Virtual Memory

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.

Miss Rate versus Block Size

CS 3410, Spring 2014 Computer Science Cornell University

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.

Page Cache and Page Writeback

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry

The Dirty-Block Index Summary Problem: Dirty bit organization in caches does not match queries – Inefficiency and performance loss The Dirty-Block Index (DBI) – Remove dirty bits from cache tag store – DRAM row-oriented organization of dirty bits Efficiently respond to queries – Get all dirty blocks of a DRAM row; Is block B dirty? Enables efficient implementation of many optimizations – DRAM-aware writeback, bypassing cache lookup, reducing ECC cost, … Improves performance while reducing overall cache area – 28% performance over baseline, 6% over state-of-the-art (8-core) – 8% cache area reduction 2

The Dirty-Block Index Information: Organization and Query 3 Organization Mismatch leads to inefficiency Query Get all the files belonging to males with first name starting with “Q”. Get all files between 2013 and ? ? ? ? ?

The Dirty-Block Index Mismatch between Organization and Query 4 A B C Z … Sorted by title Get all the books written by author X Bad organization for the query

The Dirty-Block Index Metadata: Information About a Cache Block 5 Block AddressV Valid Bit D Dirty Bit (Writeback cache) Sh Sharing Status (Multi-cores) Error Correction (Reliability) ECC Repl Replacement Policy (Set-associative cache)

The Dirty-Block Index Block-Oriented Metadata Organization 6 Valid Bit Dirty Bit (Writeback cache) Sharing Status (Multi-cores) Error Correction (Reliability) VBlock AddressDShReplECC Replacement Policy (Set-associative cache)

The Dirty-Block Index Block-Oriented Metadata Organization 7 VBlock AddressDShReplECC Cache Tag Store Tag Entry Simple to Implement Scalable Any metadata query requires an expensive tag store lookup Is this the best organization?

The Dirty-Block Index Block-Oriented Metadata Organization 8 VBlock AddressDShReplECC Cache Tag Store Tag Entry Simple to Implement Scalable Any metadata query requires an expensive tag store lookup Is this the best organization?

The Dirty-Block Index Focus of This Work 9 VBlock AddressDShReplECC Cache Tag Store Tag Entry D Dirty Bit Is putting the dirty bit in the tag entry the best approach? Queried by many operations and optimizations

The Dirty-Block Index Outline Introduction Shortcomings of Block-Oriented Organization The Dirty-Block Index (DBI) Optimizations Enabled by DBI Evaluation Conclusion 10

The Dirty-Block Index DRAM-Aware Writeback 11 Last-Level Cache Memory Controller DRAM Channel Write Buffer 1. Buffer writes and flush them in a burst 2. Row buffer hits are faster and more efficient than row misses Row Buffer Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS ]

The Dirty-Block Index DRAM-Aware Writeback 12 Dirty Block Proactively write back all other dirty blocks from the same DRAM row Last-Level Cache Significantly increases the DRAM write row hit rate Get all dirty blocks of DRAM row ‘R’ Memory Controller RRRRR Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS ]

The Dirty-Block Index Shortcoming of Block-Oriented Organization 13 Get all dirty blocks of DRAM row ‘R’

The Dirty-Block Index 14 Get all dirty blocks of DRAM row ‘R’ Cache Tag Store Set of blocks co-located in DRAM ~8KB = 128 cache blocks Is block 1 of Row R dirty? Is block 2 of Row R dirty? Is block 3 of Row R dirty? Is block 128 of Row R dirty? … Shortcoming of Block-Oriented Organization

The Dirty-Block Index 15 Get all dirty blocks of DRAM row ‘R’ Cache Tag Store Shortcoming of Block-Oriented Organization Requires many expensive (possibly unnecessary) tag lookups Significantly increases tag store contention Inefficient

The Dirty-Block Index Many Cache Optimizations/Operations 16 DRAM-aware Writeback Bulk DMA Bypassing Cache Lookup Load Balancing Memory Accesses Cache Flushing DRAM Write Scheduling Metadata for Dirty Blocks

The Dirty-Block Index Queries for the Dirty Bit Information 17 DRAM-aware Writeback Bulk DMA Bypassing Cache Lookup Load Balancing Memory Accesses Cache Flushing DRAM Write Scheduling Metadata for Dirty Blocks Get all dirty blocks that belong to a coarse-grained region Is block ‘B’ dirty? Block-based dirty bit organization is inefficient for both queries

The Dirty-Block Index Outline Introduction Shortcomings of Block-Oriented Organization The Dirty-Block Index (DBI) Optimizations Enabled by DBI Evaluation Conclusion 18

The Dirty-Block Index 19 VBlock AddressShReplECC Cache Tag Store Tag Entry D DBI DRAM row-oriented organization of dirty bits

The Dirty-Block Index 20 VBlock AddressShReplECC Cache Tag Store Tag Entry DBI DDDD Dirty bit vector (one bit per block) DRAM row address V DBI entry valid bit DBI Entry

The Dirty-Block Index DBI Semantics 21 A block in the cache is dirty if and only if 1. The DBI has a valid entry for the DRAM row that contains the block, and 2. The dirty bit for the block in the bit vector of the corresponding DBI entry is set

The Dirty-Block Index DBI Semantics by Example 22 DBI DBI entry valid bit DBI Entry Dirty Block Even if it is present in the cache, it is not dirty. DRAM row address Dirty bit vector (one bit per block)

The Dirty-Block Index Benefits of DBI 23 Get all dirty blocks of DRAM row ‘R’ Is block ‘B’ dirty? A single lookup to Row R in the DBI DBI is faster than the tag store Compared to 128 lookups with existing organization

The Dirty-Block Index Outline Introduction Shortcomings of Block-Oriented Organization The Dirty-Block Index (DBI) Optimizations Enabled by DBI Evaluation Conclusion 24

The Dirty-Block Index DRAM-Aware Writeback 25 1 Dirty Block Proactively write back all other dirty blocks from the same DRAM row 1000R11010 Look up the cache only for these blocks Last-Level Cache DBI Virtual Write Queue [ISCA 2010], DRAM-Aware Writeback [TR-HPS ] DBI achieves the benefit of DRAM-aware writeback without increasing contention for the tag store!

The Dirty-Block Index Bypassing Cache Lookups 26 2 Cache Tag Store If an access is likely to miss, we can bypass the tag lookup! Miss Predictor Read No Yes Forward to next level Dirty Block DBI Yes No 1. No false negatives 2. Write through Mostly-No Monitors [HPCA 2003], SkipCache [PACT 2012] Reduces access latency/energy; Reduces tag store contention Not desirable DBI seamlessly enables simpler and more aggressive miss predictors!

The Dirty-Block Index Reducing ECC Overhead 27 3 ECC-Cache [IAS 2009], Memory-mapped ECC [ISCA 2009], ECC-FIFO [SC 2009] Dirty block – Requires error correction Clean block – Requires only error detection Dirty Cache ECC EDC ECC for dirty blocks in some other structure. Complex mechanism to identify location of ECC.

The Dirty-Block Index Reducing ECC Overhead 28 3 Cache EDC DBI ECC tracks far fewer blocks than the cache! DBI enables a simpler mechanism to reduce ECC cost. 8% reduction in overall cache area! ECC-Cache [IAS 2009], Memory-mapped ECC [ISCA 2009], ECC-FIFO [SC 2009] Dirty block – Requires error correction Clean block – Requires only error detection

The Dirty-Block Index DBI – Other Optimizations Load balancing memory accesses in hybrid memory Better DRAM write scheduling Fast cache flushing Bulk DMA coherence … 29 (Discussed in paper)

The Dirty-Block Index Outline Introduction Shortcomings of Block-Oriented Organization The Dirty-Block Index (DBI) Optimizations Enabled by DBI Evaluation Conclusion 30

The Dirty-Block Index Evaluation Methodology 2.67 GHz, single issue, OoO, 128-entry instruction window Cache Hierarchy – 32 KB private L1 cache, 256 KB private L2 cache – 2MB/core Shared L3 cache DDR DRAM – 1 channel, 1 rank, 8 banks, 8KB row buffer, FR-FCFS, open row policy SPEC CPU2006, STREAM Multi-core – core, core, and core workloads – Multiple metrics for performance and fairness 31

The Dirty-Block Index Mechanisms Dynamic Insertion Policy (Baseline) (ISCA 2007, PACT 2008) DRAM-Aware Writeback (DAWB) (TR-HPS UT Austin) Virtual Write Queue (ISCA 2010) Skip Cache (PACT 2012) Dirty-Block Index + No Optimization + Aggressive Writeback + Cache Lookup Bypass + Both Optimizations (DBI+Both) 32 Difficult to combine

The Dirty-Block Index Effect on Writes and Tag Lookups 33 DBI achieves almost all the benefits of DAWB with significantly lower tag store contention

The Dirty-Block Index System Performance 34 13%0% 23%4% 35%6% 28%6% Reduced tag store contention due to DBI translates to significant performance improvement

The Dirty-Block Index Other Results in Paper Detailed cache area analysis (with and without ECC) DBI power consumption analysis Effect of individual optimizations Other multi-core performance/fairness metrics Sensitivity to DBI parameters Sensitivity to cache size/replacement policy 35

The Dirty-Block Index Conclusion The Dirty-Block Index – Key Idea: DRAM-row oriented dirty-bit organization Enables efficient implementation of several optimizations – DRAM-Aware writeback, cache lookup bypass, Reducing ECC cost – 28% performance over baseline, 6% over best previous work – 8% reduction in overall cache area Wider applicability – Can be applied to other caches – Can be applied to other metadata (e.g., coherence) 36

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry

The Dirty-Block Index Backup Slides 38

The Dirty-Block Index Cache Coherence 39 MOESI Exclusive modifiedShared modified Exclusive unmodifiedShared Unmodified Invalid D

The Dirty-Block Index Operation of a Cache with DBI 40 Cache Tag Store DBI 1. Read Access 2. Writeback 3. Cache Eviction 4. DBI Eviction Look up tag store Update tag store. Update DBI to indicate the block is dirty. Check DBI. Write back if block is dirty Write back all blocks marked dirty by the entry

The Dirty-Block Index DBI Design Parameters 41 DBI 1000R11010 DBI Size (α) Total number of blocks tracked by the DBI Represented as a fraction of number of blocks in cache DBI Granularity (g) Number of blocks tracked by each entry

The Dirty-Block Index DBI Design Parameters – Example 42 1MB Cache 64B Blocks DBI α = ¼ g = 64 Cache tracks blocks DBI tracks 4096 blocks Each entry tracks 64 blocks DBI has 64 entries

The Dirty-Block Index Effect on Writes and Tag Lookups 43

The Dirty-Block Index System Performance 44