Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression2 Overview  Design of high performance processors Processor speed improves faster than memory Processor speed improves faster than memory  Memory latency dominates performance Need more effective cache designs Need more effective cache designs  On-chip cache compression + Increases effective cache size - Increases cache hit latency  Does cache compression help or hurt?

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression3 Does Cache Compression Help or Hurt?

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression6 Does Cache Compression Help or Hurt?  Adaptive Compression determines when compression is beneficial

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression7 Outline  Motivation  Cache Compression Framework Compressed Cache Hierarchy Compressed Cache Hierarchy Decoupled Variable-Segment Cache Decoupled Variable-Segment Cache  Adaptive Compression  Evaluation  Conclusions

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression8 Compressed Cache Hierarchy Instruction Fetcher Fetcher L2 Cache (Compressed) L1 D-Cache (Uncompressed) Load-StoreQueue L1 I-Cache (Uncompressed) L1 Victim Cache CompressionPipeline DecompressionPipeline UncompressedLineBypass From Memory To Memory

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression9 Address B Decoupled Variable-Segment Cache  Objective: pack more lines into the same space Data Area Address A Tag Area  2-way set-associative with 64-byte lines  Tag Contains Address Tag, Permissions, LRU (Replacement) Bits

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression10 Address B Decoupled Variable-Segment Cache  Objective: pack more lines into the same space Data Area Address A Tag Area Address C Address D Add two more tags

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression11 Address B Decoupled Variable-Segment Cache  Objective: pack more lines into the same space Data Area Address A Tag Area Address C Address D Add Compression Size, Status, More LRU bits

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression12 Address B Decoupled Variable-Segment Cache  Objective: pack more lines into the same space Data Area Address A Tag Area Address C Address D Divide Data Area into 8-byte segments

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression13 Decoupled Variable-Segment Cache  Objective: pack more lines into the same space Data Area Tag Area Address B Address A Address C Address D Data lines composed of 1-8 segments

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression14 Addr B compressed 2 Decoupled Variable-Segment Cache  Objective: pack more lines into the same space Data Area Addr A uncompressed 3 Addr C compressed 6 Addr D compressed 4 Tag Area Compression Status Compressed Size Tag is present but line isn’t

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression15 Outline  Motivation  Cache Compression Framework  Adaptive Compression Key Insight Key Insight Classification of L2 accesses Classification of L2 accesses Global compression predictor Global compression predictor  Evaluation  Conclusions

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression16 Adaptive Compression  Use past to predict future  Key Insight: LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts LRU Stack [Mattson, et al., 1970] indicates for each reference whether compression helps or hurts Benefit(Compression ) > Cost(Compression ) Do not compress future lines Compress Yes No

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression17 Cost/Benefit Classification  Classify each cache reference  Four-way SA cache with space for two 64-byte lines Total of 16 available segments Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression18 An Unpenalized Hit  Read/Write Address A LRU Stack order = 1 ≤ 2  Hit regardless of compression Uncompressed Line  No decompression penalty Neither cost nor benefit Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression19 A Penalized Hit  Read/Write Address B LRU Stack order = 2 ≤ 2  Hit regardless of compression Compressed Line  Decompression penalty incurred Compression cost Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression20 An Avoided Miss  Read/Write Address C LRU Stack order = 3 > 2  Hit only because of compression Compression benefit: Eliminated off-chip miss Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression21 An Avoidable Miss  Read/Write Address D Line is not in the cache but tag exists at LRU stack order = 4 Missed only because some lines are not compressed Potential compression benefit Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4 Sum(CSize) = 15 ≤ 16

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression22 An Unavoidable Miss  Read/Write Address E LRU stack order > 4  Compression wouldn’t have helped Line is not in the cache and tag does not exist Neither cost nor benefit Addr A uncompressed 3 Addr B compressed 2 LRU Stack Data Area Addr C compressed 6 Addr D compressed 4

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression23 Compression Predictor  Estimate: Benefit(Compression) – Cost(Compression)  Single counter : Global Compression Predictor (GCP) Saturating up/down 19-bit counter Saturating up/down 19-bit counter  GCP updated on each cache access Benefit: Increment by memory latency Benefit: Increment by memory latency Cost: Decrement by decompression latency Cost: Decrement by decompression latency Optimization: Normalize to decompression latency = 1 Optimization: Normalize to decompression latency = 1  Cache Allocation Allocate compressed line if GCP  0 Allocate compressed line if GCP  0 Allocate uncompressed lines if GCP < 0 Allocate uncompressed lines if GCP < 0

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression24 Outline  Motivation  Cache Compression Framework  Adaptive Compression  Evaluation Simulation Setup Simulation Setup Performance Performance  Conclusions

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression25 Simulation Setup  Simics full system simulator augmented with: Detailed OoO processor simulator [TFSim, Mauer, et al., 2002] Detailed OoO processor simulator [TFSim, Mauer, et al., 2002] Detailed memory timing simulator [Martin, et al., 2002] Detailed memory timing simulator [Martin, et al., 2002]  Workloads: Commercial workloads: Commercial workloads:  Database servers: OLTP and SPECJBB  Static Web serving: Apache and Zeus SPEC2000 benchmarks: SPEC2000 benchmarks:  SPECint: bzip, gcc, mcf, twolf  SPECfp: ammp, applu, equake, swim

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression26 System configuration  A dynamically scheduled SPARC V9 uniprocessor  Configuration parameters: L1 Cache Split I&D, 64KB each, 2-way SA, 64B line, 2- cycles/access L2 Cache Unified 4MB, 8-way SA, 64B line, 20cycles+decompression latency per access Memory 4GB DRAM, 400-cycle access time, 128 outstanding requests Processor pipeline 4-wide superscalar, 11-stage pipeline: fetch (3), decode(3), schedule(1), execute(1+), retire(3) Reorder buffer 64 entries

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression27 Simulated Cache Configurations  Always: All compressible lines are stored in compressed format Decompression penalty for all compressed lines Decompression penalty for all compressed lines  Never: All cache lines are stored in uncompressed format Cache is 8-way set associative with half the number of sets Cache is 8-way set associative with half the number of sets Does not incur decompression penalty Does not incur decompression penalty  Adaptive: Our adaptive compression scheme

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression28 Performance SpecINT SpecFPCommercial

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression29 Performance

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression30 Performance 35% Speedu p 18% Slowdown

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression31 Performance Adaptive performs similar to the best of Always and Never Bug in GCP update

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression32 Effective Cache Capacity

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression33 Cache Miss Rates Penalized Hits Per Avoided Miss 6709 489 12.3 4.7 0.09 2.52 12.28 14.38 Misses Per 1000 Instructions

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression34 Adapting to L2 Sizes 0.93 5.7 6503 326000 0.93 5.7 6503 326000 104.8 36.9 0.09 0.05 Misses Per 1000 Instructions Penalized Hits Per Avoided Miss

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression35 Conclusions  Cache compression increases cache capacity but slows down cache hit time Helps some benchmarks (e.g., apache, mcf) Helps some benchmarks (e.g., apache, mcf) Hurts other benchmarks (e.g., gcc, ammp) Hurts other benchmarks (e.g., gcc, ammp)  Our Proposal: Adaptive compression Uses (LRU) replacement stack to determine whether compression helps or hurts Uses (LRU) replacement stack to determine whether compression helps or hurts Updates a single global saturating counter on cache accesses Updates a single global saturating counter on cache accesses  Adaptive compression performs similar to the better of Always Compress and Never Compress

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression36 Backup Slides  Frequent Pattern Compression (FPC) Frequent Pattern Compression (FPC) Frequent Pattern Compression (FPC)  Decoupled Variable-Segment Cache Decoupled Variable-Segment Cache Decoupled Variable-Segment Cache  Classification of L2 Accesses Classification of L2 Accesses Classification of L2 Accesses  (LRU) Stack Replacement (LRU) Stack Replacement (LRU) Stack Replacement  Cache Miss Rates Cache Miss Rates Cache Miss Rates  Adapting to L2 Sizes – mcf Adapting to L2 Sizes Adapting to L2 Sizes  Adapting to L1 Size Adapting to L1 Size Adapting to L1 Size  Adapting to Decompression Latency – mcf Adapting to Decompression Latency Adapting to Decompression Latency  Adapting to Decompression Latency – ammp Adapting to Decompression Latency Adapting to Decompression Latency  Phase Behavior – gcc Phase Behavior Phase Behavior  Phase Behavior – mcf Phase Behavior Phase Behavior  Can We Do Better Than Adaptive? Can We Do Better Than Adaptive? Can We Do Better Than Adaptive?

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression37 Decoupled Variable-Segment Cache  Each set contains four tags and space for two uncompressed lines  Data area divided into 8-byte segments  Each tag is composed of: Address tag Address tag Permissions Permissions CStatus : 1 if the line is compressed, 0 otherwise CStatus : 1 if the line is compressed, 0 otherwise CSize: Size of compressed line in segments CSize: Size of compressed line in segments LRU/replacement bits LRU/replacement bits Same as uncompressed cache

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression38 Frequent Pattern Compression  A significance-based compression algorithm  Related Work: X-Match and X-RL Algorithms [Kjelso, et al., 1996] X-Match and X-RL Algorithms [Kjelso, et al., 1996] Address and data significance-based compression [Farrens and Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000] Address and data significance-based compression [Farrens and Park, 1991, Citron and Rudolph, 1995, Canal, et al., 2000]  A 64-byte line is decompressed in five cycles  More details in technical report: “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online).

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression39 Frequent Pattern Compression (FPC)  A significance-based compression algorithm combined with zero run-length encoding Compresses each 32-bit word separately Compresses each 32-bit word separately Suitable for short (32-256 byte) cache lines Suitable for short (32-256 byte) cache lines Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero- padded half-word, two SE half-words, repeated byte Compressible Patterns: zero runs, sign-ext. 4,8,16-bits, zero- padded half-word, two SE half-words, repeated byte A 64-byte line is decompressed in a five-stage pipeline A 64-byte line is decompressed in a five-stage pipeline  More details in technical report: “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online). “Frequent Pattern Compression: A Significance-Based Compression Algorithm for L2 Caches,” Alaa R. Alameldeen and David A. Wood, Dept. of Computer Sciences Technical Report CS-TR-2004-1500, April 2004 (available online).

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression40 Classification of L2 Accesses  Cache hits: Unpenalized hit: Hit to an uncompressed line that would have hit without compression Unpenalized hit: Hit to an uncompressed line that would have hit without compression - Penalized hit: Hit to a compressed line that would have hit without compression + Avoided miss: Hit to a line that would NOT have hit without compression  Cache misses: + Avoidable miss: Miss to a line that would have hit with compression Unavoidable miss: Miss to a line that would have missed even with compression Unavoidable miss: Miss to a line that would have missed even with compression

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression41  Differentiate penalized hits and avoided misses? Only hits to top half of the tags in the LRU stack are penalized hits Only hits to top half of the tags in the LRU stack are penalized hits  Differentiate avoidable and unavoidable misses?  Is not dependent on LRU replacement Any replacement algorithm for top half of tags Any replacement algorithm for top half of tags Any stack algorithm for the remaining tags Any stack algorithm for the remaining tags (LRU) Stack Replacement

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression42 Cache Miss Rates

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression43 Adapting to L2 Sizes 11.6 4.4 12.6 2x10 6 11.6 4.4 12.6 2x10 6 98.9 88.1 12.4 0.02 Misses Per 1000 Instructions Penalized Hits Per Avoided Miss

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression44 Adapting to L1 Size

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression45 Adapting to Decompression Latency

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression46 Adapting to Decompression Latency

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression47 Phase Behavior Predictor Value (K) Cache Size (MB)

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression48 Phase Behavior Predictor Value (K) Cache Size (MB)

ISCA 2004Alaa Alameldeen – Adaptive Cache Compression49 Can We Do Better Than Adaptive?  Optimal is an unrealistic configuration: Always with no decompression penalty

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

Similar presentations

Presentation on theme: "Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

Similar presentations

Presentation on theme: "Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project."— Presentation transcript:

Similar presentations

About project

Feedback