Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

A Case for Refresh Pausing in DRAM Memory Systems

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Skewed Compressed Cache

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.

A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Moshovos © 1 ReCast: Boosting L2 Tag Line Buffer Coverage “for Free” Won-Ho Park, Toronto Andreas Moshovos, Toronto Babak Falsafi, CMU

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Exploiting Compressed Block Size as an Indicator of Future Reuse

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Redundant Memory Mappings for Fast Access to Large Memories

The Evicted-Address Filter

Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

A Robust Main-Memory Compression Scheme (ISCA 06) Magnus Ekman and Per Stenström Chalmers University of Technolog, Göteborg, Sweden Speaker: 雋中.

A Case for Toggle-Aware Compression for GPU Systems

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Adaptive Cache Partitioning on a Composite Core

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Address-Value Delta (AVD) Prediction

CARP: Compression-Aware Replacement Policies

Lei Zhao, Youtao Zhang, Jun Yang

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1

2

3

Communication vs. Computation Keckler Micro 2011 Improving cache utilization is critical for energy-efficiency! ~200X

Compressed Cache: Compress and Compact Blocks +Higher effective cache size +Small area overhead +Higher system performance +Lower system energy Previous work limit compression effectiveness: -Limited number of tags -High internal fragmentation -Energy expensive re-compaction

6 Non-Contiguous Sub-Blocks Previous work limit compression effectiveness: -Limited number of tags -High Internal Fragmentation -Energy expensive re-compaction Decoupled Super-Blocks

7

Outline 8  Motivation  Compressed caching  Our Proposals: Decoupled compressed cache  Experimental Results  Conclusions

9 Uncompressed Caching A fixed one-to-one tag/data mapping Tags Data

10 Compressed Caching Compress cache blocks. Tags Data Compact compressed blocks, to make room.Add more tags to increase effective capacity.

11 Compression (1) Compression: how to compress blocks? There are different compression algorithms. Not the focus of this work. But, which algorithm matters! 64 bytes 20 bytes Compressor

Compression Potentials 12 High compression ratio  potentially large normalized effective cache capacity Compression Ratio = Original Size / Compressed Size Cycles to Decompress Compression Algorithm We use C-PACK+Z for the rest of the talk!

13 Compaction (2) Compaction: how to store and find blocks? Critical to achieve the compression potentials. This work focuses on compaction. Tags Data Fixed Sized Compressed Cache (FixedC) [Kim’02, WMPI, Yang Micro 02] Internal Fragmentation!

14 Compaction (2) Compaction: how to store and find blocks? Tags Data Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002] Sub-block

Previous Compressed Caches 15 (Limit 1) Limited Tag/Metadata – High Area Overhead by Adding 4X/more Tags (Limit 2) Internal Fragmentation – Low Cache Capacity Utilization 10B 16B Potential: Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks

(Limit 3) Energy-Expensive Re-Compaction 16 3X higher LLC dynamic energy! Tags Data VSC requires energy-expensive re-compaction. Update BB needs 2 sub-blocks

Outline  Motivation  Compressed caching  Our Proposals: Decoupled compressed cache  Experimental Results  Conclusions 17

Decoupled Compressed Cache 18 (1) Exploiting Spatial Locality Low Area Overhead (2) Decoupling tag/data mapping Eliminate energy expensive re-compaction Reduce internal fragmentation (3) Co-DCC: Dynamically co-compacting super-blocks Further reduce internal fragmentation

(1) Exploiting Spatial Locality Neighboring blocks co-reside in LLC %

(1) Exploiting Spatial Locality DCC tracks LLC blocks at Super-Block granularity. 20 4X Tags Tags Data 2X Tags Quad (Q): A, B, C, D Singleton (S): E Super-Block Tag Q state A state B state C state D Super Tags Up to 4X blocks with low area overheads!

(2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction. 21 Quad (Q): A, B, C, D Singleton (S): E Super Tags Quad (Q): A, B, C, D Singleton (S): E Flexible Allocation Update B

(2) Decoupling tag/data mapping 22 Back pointers identify the owner block of each sub-block. Quad (Q): A, B, C, D Singleton (S): E Super Tags Quad (Q): A, B, C, D Singleton (S): E Data Back Pointers Tag IDBlk ID

(3) Co-compacting super-blocks Co-DCC dynamically co ‑ compacts super-blocks.  Reducing internal fragmentation 23 A sub-block Quad (Q): A, B, C, D

Outline  Motivation  Compressed caching  Our Proposals: Decoupled compressed cache  Experimental Results  Conclusions 24

Experimental Methodology 25  Integrated DCC with AMD Bulldozer Cache. – We model the timing and allocation constraints of sequential regions at LLC in detail. – No need for an alignment network.  Verilog implementation and synthesis of the tag match and sub-block selection logic. – One additional cycle of latency due to sub-block selection.

Experimental Methodology 26  Full-system simulation with a simulator based on GEMS.  Wide range of applications with different level of cache sensitivities: – Commercial workloads: apache, jbb, oltp, zeus – Spec-OMP: ammp, applu, equake, mgrid, wupwise – Parsec: blackscholes, canneal, freqmine – Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar- bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus- bzip, omnetpp-lbm CoresEight OOO cores, 3.2 GHz L1I$/L1D$Private, 32-KB, 8-way L2$Private, 256-KB, 8-way L3$Shared, 8-MB, 16-way, 8 banks Main Memory4GB, 16 Banks, 800 MHz bus frequency DDR3

Effective LLC Capacity 27 ComponentsFixedC/VSC-2XDCCCo-DCC Tag Array 6.3%2.1%11.3% Back Pointer Array 04.4%5.4% (De-)Compressors 1.8% Total Area Overhead 8.1%8.3%18.5% Normalized LLC Area Baseline 2X Baseline VSC DCC Co-DCC FixedC Normalized Effective LLC Capacity ComponentsFixedC/VSC-2X Tag Array 6.3% Back Pointer Array 0 (De-)Compressors 1.8% Total Area Overhead 8.1% ComponentsFixedC/VSC-2XDCC Tag Array 6.3%2.1% Back Pointer Array 04.4% (De-)Compressors 1.8% Total Area Overhead 8.1%8.3%

(Co-)DCC Performance (Co-)DCC boost system performance significantly.

(Co-)DCC Energy Consumption (Co-)DCC reduce system energy by reducing number of accesses to the main memory.

Summary 30 Analyze the limits of compressed caching Limited number of tags Internal fragmentation Energy-expensive re-compaction Decoupled Compressed Cache Improving performance and energy of compressed caching Decoupled super-blocks Non-contiguous sub-blocks Co-DCC further reduces internal fragmentation Practical designs [details in the paper]

 (De-)Compression overhead  DCC data array organization with AMD Bulldozer  DCC Timing  DCC Lookup  Applications  Co-DCC design  LLC effective capacity  LLC miss rate  Memory dynamic energy  LLC dynamic energy 31 Backup

(De-)Compression Overhead 32 Parameters CompressorDecompressor Pipeline Depth62 Latency (cycles) Power Consumption (mW)

DCC Data Array Organization AMD Bulldozer 33

DCC Timing 34

DCC Lookup 1.Access Super Tags and Back Pointers in parallel 2.Find the matched Back Pointers 3.Read corresponding sub-blocks and decompress 35 Quad (Q): A, B, C, D Singleton (S): E Super Tags DataBack Pointers Read C Q 1010 S

Applications 36 Spec2006 (m1-m8) bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm Sensitive to Cache Capacity and Latency Sensitive to Cache Capacity Cache Insensitive Sensitive to Cache Latency

Co-DCC Design 37

LLC Effective Cache Capacity 38

LLC Miss Rate 39

Memory Dynamic Energy 40

LLC Dynamic Energy 41