Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1

Communication vs. Computation Keckler Micro 2011 Improving cache utilization is critical for energy-efficiency! ~200X

Compressed Cache: Compress and Compact Blocks +Higher effective cache size +Small area overhead +Higher system performance +Lower system energy Previous work limit compression effectiveness: -Limited number of tags -High internal fragmentation -Energy expensive re-compaction

6 Non-Contiguous Sub-Blocks Previous work limit compression effectiveness: -Limited number of tags -High Internal Fragmentation -Energy expensive re-compaction Decoupled Super-Blocks

Outline 8  Motivation  Compressed caching  Our Proposals: Decoupled compressed cache  Experimental Results  Conclusions

9 Uncompressed Caching A fixed one-to-one tag/data mapping Tags Data

10 Compressed Caching Compress cache blocks. Tags Data Compact compressed blocks, to make room.Add more tags to increase effective capacity.

11 Compression (1) Compression: how to compress blocks? There are different compression algorithms. Not the focus of this work. But, which algorithm matters! 64 bytes 20 bytes Compressor

Compression Potentials 12 High compression ratio  potentially large normalized effective cache capacity. 1.5 2.8 3.9 Compression Ratio = Original Size / Compressed Size Cycles to Decompress Compression Algorithm We use C-PACK+Z for the rest of the talk!

13 Compaction (2) Compaction: how to store and find blocks? Critical to achieve the compression potentials. This work focuses on compaction. Tags Data Fixed Sized Compressed Cache (FixedC) [Kim’02, WMPI, Yang Micro 02] Internal Fragmentation!

14 Compaction (2) Compaction: how to store and find blocks? Tags Data Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002] Sub-block

Previous Compressed Caches 15 (Limit 1) Limited Tag/Metadata – High Area Overhead by Adding 4X/more Tags (Limit 2) Internal Fragmentation – Low Cache Capacity Utilization 10B 16B 2.6 2.3 2.0 1.7 Potential: 3.9 3.1 Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks

(Limit 3) Energy-Expensive Re-Compaction 16 3X higher LLC dynamic energy! Tags Data VSC requires energy-expensive re-compaction. Update BB needs 2 sub-blocks

Outline  Motivation  Compressed caching  Our Proposals: Decoupled compressed cache  Experimental Results  Conclusions 17

Decoupled Compressed Cache 18 (1) Exploiting Spatial Locality Low Area Overhead (2) Decoupling tag/data mapping Eliminate energy expensive re-compaction Reduce internal fragmentation (3) Co-DCC: Dynamically co-compacting super-blocks Further reduce internal fragmentation

(1) Exploiting Spatial Locality Neighboring blocks co-reside in LLC. 19 89%

(1) Exploiting Spatial Locality DCC tracks LLC blocks at Super-Block granularity. 20 4X Tags Tags Data 2X Tags Quad (Q): A, B, C, D Singleton (S): E Super-Block Tag Q state A state B state C state D Super Tags Up to 4X blocks with low area overheads!

(2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction. 21 Quad (Q): A, B, C, D Singleton (S): E Super Tags Quad (Q): A, B, C, D Singleton (S): E Flexible Allocation Update B

(2) Decoupling tag/data mapping 22 Back pointers identify the owner block of each sub-block. Quad (Q): A, B, C, D Singleton (S): E Super Tags Quad (Q): A, B, C, D Singleton (S): E Data Back Pointers Tag IDBlk ID

(3) Co-compacting super-blocks Co-DCC dynamically co ‑ compacts super-blocks.  Reducing internal fragmentation 23 A sub-block Quad (Q): A, B, C, D

Outline  Motivation  Compressed caching  Our Proposals: Decoupled compressed cache  Experimental Results  Conclusions 24

Experimental Methodology 25  Integrated DCC with AMD Bulldozer Cache. – We model the timing and allocation constraints of sequential regions at LLC in detail. – No need for an alignment network.  Verilog implementation and synthesis of the tag match and sub-block selection logic. – One additional cycle of latency due to sub-block selection.

Experimental Methodology 26  Full-system simulation with a simulator based on GEMS.  Wide range of applications with different level of cache sensitivities: – Commercial workloads: apache, jbb, oltp, zeus – Spec-OMP: ammp, applu, equake, mgrid, wupwise – Parsec: blackscholes, canneal, freqmine – Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar- bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus- bzip, omnetpp-lbm CoresEight OOO cores, 3.2 GHz L1I$/L1D$Private, 32-KB, 8-way L2$Private, 256-KB, 8-way L3$Shared, 8-MB, 16-way, 8 banks Main Memory4GB, 16 Banks, 800 MHz bus frequency DDR3

Effective LLC Capacity 27 ComponentsFixedC/VSC-2XDCCCo-DCC Tag Array 6.3%2.1%11.3% Back Pointer Array 04.4%5.4% (De-)Compressors 1.8% Total Area Overhead 8.1%8.3%18.5% 1 2 12 3 Normalized LLC Area Baseline 2X Baseline VSC DCC Co-DCC FixedC Normalized Effective LLC Capacity ComponentsFixedC/VSC-2X Tag Array 6.3% Back Pointer Array 0 (De-)Compressors 1.8% Total Area Overhead 8.1% ComponentsFixedC/VSC-2XDCC Tag Array 6.3%2.1% Back Pointer Array 04.4% (De-)Compressors 1.8% Total Area Overhead 8.1%8.3%

(Co-)DCC Performance 28 0.93 0.96 0.95 0.90 0.86 (Co-)DCC boost system performance significantly.

(Co-)DCC Energy Consumption 29 0.93 0.96 0.97 0.91 0.88 (Co-)DCC reduce system energy by reducing number of accesses to the main memory.

Summary 30 Analyze the limits of compressed caching Limited number of tags Internal fragmentation Energy-expensive re-compaction Decoupled Compressed Cache Improving performance and energy of compressed caching Decoupled super-blocks Non-contiguous sub-blocks Co-DCC further reduces internal fragmentation Practical designs [details in the paper]

 (De-)Compression overhead  DCC data array organization with AMD Bulldozer  DCC Timing  DCC Lookup  Applications  Co-DCC design  LLC effective capacity  LLC miss rate  Memory dynamic energy  LLC dynamic energy 31 Backup

(De-)Compression Overhead 32 Parameters CompressorDecompressor Pipeline Depth62 Latency (cycles)169 0.016 Power Consumption (mW)25.8419.01

DCC Data Array Organization AMD Bulldozer 33

DCC Timing 34

DCC Lookup 1.Access Super Tags and Back Pointers in parallel 2.Find the matched Back Pointers 3.Read corresponding sub-blocks and decompress 35 Quad (Q): A, B, C, D Singleton (S): E Super Tags DataBack Pointers Read C Q 1010 S 1111 1111 1111

Applications 36 Spec2006 (m1-m8) bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm Sensitive to Cache Capacity and Latency Sensitive to Cache Capacity Cache Insensitive Sensitive to Cache Latency

Co-DCC Design 37

LLC Effective Cache Capacity 38

LLC Miss Rate 39

Memory Dynamic Energy 40

LLC Dynamic Energy 41

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

Similar presentations

Presentation on theme: "Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

Similar presentations

Presentation on theme: "Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison."— Presentation transcript:

Similar presentations

About project

Feedback