Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison

Outline IntroductionMotivation Adaptive Cache Compression Evaluation Methodology Reported performance Review conclusion Critics/Suggestions

Introduction Increasing performance gap between processors and memory calls for faster memory access. Increasing performance gap between processors and memory calls for faster memory access. Cache memories – reduce average memory latency latency Cache Compression – improves performance of cache memories memories Adaptive Cache Compression – Theme of this discussion

Motivation Cache compression can improve effectiveness of cache memories (increase effective cache capacity) Increasing effective cache capacity reduces miss rate Performance will improve !

Adaptive Cache Compression An Overview Dynamically optimize cache performance Use the past to predict the future  How likely is compression going to help, hurt, or make no difference to next reference?  Feedback from previous compression helps to decide whether to compress the next write to cache

Adaptive Cache Compression Implementation 2-level cache hierarchy L1 cache (data, instruction) uncompressed L2 cache is unified and optionally compressed Decompression/ Compression used/skipped as necessary Pros: L1 cache performance not affected Cons: Compression/Decompression introduces latency

Adaptive Cache Compression L2 cache detail 8-way set associative Use a compression information tag stored with each address tag 32 segments (8 bytes each) in each set An uncompressed line comprises 8 segments (4 uncompressed lines max in each set) (4 uncompressed lines max in each set) Compressed lines are 1 to 7 segments in length Max number of lines in each set =8 Least recently used (LRU) lines evicted Compacting may be used to make room for a new line

Adaptive Cache Compression: To compress or not to compress? While compression eliminates L2 misses, it increases the latency of L2 hits (more frequent). However, penalty for L2 misses is usually large and extra latency due to decompression is usually small. Compression helps if: ( avoided L2 misses ) x (L2 miss penalty) > ( penalized L2 hits ) x ( decompression penalty) Example: For a 5 cycle decompression penalty and 400 cycle cycle L2 miss penalty, compression wins if it eliminates at least one L2 miss for every 400/5=80 penalized L2 hits

Adaptive Cache Compression Classification of Cache References Classifications of hits  Unpenalized hit (e.g. reference to address A) (e.g. reference to address A)  Penalized hit (e.g. reference to address C)  Avoided miss (e.g. reference to address E) Classifications of misses  Avoidable miss ( e.g. reference to address G)  Unavoidable miss ( e.g. reference to address H) Evicted

Adaptive Cache Compression Hardware use in decision-making Global Compression Predictor  estimates the recent cost or benefit of compression  On a penalized hit, the controller biases against compression by decrementing the counter ( subtractedvalue=decompression penalty) ( subtractedvalue=decompression penalty)  On an avoided or avoidable miss, the controller increments the counter by the L2 miss penalty.  The controller uses the GCP when allocating a line in the L2 cache  Positive value -> compression has helped, so now compress  Negative value -> compression has been penalizing, so don’t compress compress  Size of GCP determines sensitivity to changes  In this paper, 19-bit used ( saturates at 262143 or -262144 )

Adaptive Cache Compression Sensitivity Effectiveness depend on the workload’s size, cache’s size and latencies Effectiveness depend on the workload’s size, cache’s size and latencies Sensitive to L2 cache size (effective for small L2 cache) Sensitive to L1 cache size (observe trade-offs) Adapting to benchmark phase - changes in phase behaviour may hurt adaptive policy - changes in phase behaviour may hurt adaptive policy - takes time to adapt - takes time to adapt

Evaluation Methodology Host system: dynamically-scheduled SPARC V9 uniprocessor Target system: superscalar processor with out-of-order execution Simulation Parameters:

Evaluation Methodology (continued) Simulator: Simics full-system simulator, extended with a detailed processor simulator (TFSim), and a detailed memory processor simulator (TFSim), and a detailed memory system timing simulator. system timing simulator.Workloads:  multi-threaded commercial workloads from the Wisconsin Commercial workload suite workload suite  eight of the SPECcpu2000 benchmarks  Integer benchmarks (bzip, gcc, mcf, twolf)  Floating benchmarks (ammp, applu, equake, swim) Workloads selected to cover a wide range of compressibility properties, miss rates and working set sizes.

Evaluation methodology (continued) To understand the utility of adaptive compression, 2 extreme policies ( Never compress, and always compress were compared with ) ‘Never’ strives to reduce hit latency ‘Always’ strives to reduce miss rate ‘Adaptive’ strives to optimize.

Reported Performance (Average cache capacity) Figure: Average cache capacity during benchmark runs (4MB uncompressed)

Reported Performance (cache miss rate) Figure: L2 cache miss rate (normalized to “Never” miss rate)

Reported Performance (Runtime) Figure: Runtime for the three compression alternatives (normalized to “Never”)

Reported Performance ( sensitivity of adaptive compression to benchmark phase changes) Top: temporal changes in Global Compression Predictor values Bottom: effective cache size

Review Conclusion Compressing all compressible cache lines only improves memory-intensive applications. Applications with low miss rate / compressibility suffer. Optimization achieved by adaptive scheme are:  Up to 26% speedup (over uncompressed scheme) for memory-intensive, highly-compressible benchmarks  Performance degradation for other benchmarks < 0.4%

Critics/Suggestions Data inconsistency:17% improvement in performance for memory- intensive commercial workloads claimed on page 2 but 26% claimed on page 11. Miscalculation on page 4  The sum of the compressed sizes at stack depths 1 through 7 totals 29. 29.  However, this miss cannot be avoided because the sum of compressed sizes exceeds the total number of segments (i.e. 35 > 32 ). All in all, the proposed technique doesn’t seem to enhance performance significantly with respect to ‘always’.

Thank you !

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Similar presentations

Presentation on theme: "Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Similar presentations

Presentation on theme: "Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison."— Presentation transcript:

Similar presentations

About project

Feedback