Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Using Compression to Improve Chip Multiprocessor Performance Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison.

Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.

Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Cache Memory and Performance

ASR: Adaptive Selective Replication for CMP Caches

Multilevel Memories (Improving performance using alittle “cash”)

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Reducing Memory Reference Energy with Opportunistic Virtual Caching

CARP: Compression-Aware Replacement Policies

Improving Multiple-CMP Systems with Token Coherence

Simulating a $2M Commercial Server on a $2K PC

Exploring Core Designs for Chip Multiprocessors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison * Work largely done while at UW-Madison

2 Overview  Addressing memory wall in CMPs  Compression  Cache Compression: Increases effective cache size –Increases latency, uses additional cache tags  Link Compression: Increases effective pin bandwidth  Hardware Prefetching +Decreases average cache latency? –Increases bandwidth demand  contention –Increases working set size  cache pollution  Contribution #1: Adaptive Prefetching  Uses additional tags (like cache compression)  Avoids useless and harmful prefetches  Contribution #2: Prefetching&compression interact positively  e.g., Link compression helps prefetching’s bandwidth demand

3 Outline  Overview  Compressed CMP Design –Compressed Cache Hierarchy –Link Compression  Adaptive Prefetching  Evaluation  Conclusions

4 Compressed Cache Hierarchy Shared L2 Cache (Compressed) Compression Decompression UncompressedLineBypass Processor 1 L1 Cache (Uncompressed) Memory Controller (Could compress/decompress) …………………… Processor n L1 Cache (Uncompressed) To/From other chips/memory

5 Cache Compression: Decoupled Variable- Segment Cache (ISCA 2004)  Example cache set Addr B compressed 2 Data Area Addr A uncompressed 3 Addr C compressed 6 Addr D compressed 4 Tag Area Compression Status Compressed Size Tag is present but line isn’t

6 Link Compression  On-chip Memory Controller transfers compressed messages  Data Messages –1-8 sub-messages (flits), 8- bytes each  Off-chip memory controller combines flits and stores to memory Memory Controller Processors / L1 Caches L2 Cache CMP To/From Memory

7 Outline  Overview  Compressed CMP Design  Adaptive Prefetching  Evaluation  Conclusions

8 Hardware Stride-Based Prefetching  Evaluation based on IBM Power 4 implementation  Stride-based prefetching + L1 prefetching hides L2 latency + L2 prefetching hides memory latency - Increases working set size  cache pollution - Increases on-chip & off-chip bandwidth demand  contention  Prefetching Taxonomy [Srinivasan et al. 2004]: –Useless prefetches increase traffic not misses –Harmful prefetches replace useful cache lines

9 Prefetching in CMPs Zeus

10 Prefetching in CMPs  Can we avoid useless & harmful prefetches? Prefetching Hurts Performance Zeus

11 Prefetching in CMPs  Adaptive prefetching avoids hurting performance Zeus

12 Adaptive Prefetching Addr B compressed 2 Data Area Addr A uncompressed 3 Addr C compressed 6 Addr D compressed 4 Tag Area Compression Status Compressed Size Tag is present but line isn’t PF Bit: Set on PF line allocation, reset on access √ √  How do we identify useful, useless, harmful prefetches?

13 Addr B compressed 2 Useful Prefetch Addr A uncompressed 3 Addr C compressed 6 Addr D compressed 4 √  Read/Write Address B  PF bit set  PF used before replacement  Access resets PF bit √

14 Addr E compressed 6 Addr B compressed 2 Useless Prefetch Addr A uncompressed 3 Addr C compressed 6 Addr D compressed 4  Replace Address C with Address E  PF bit set  PF unused before replacement  PF did not change #misses, only increased traffic √

15 Addr B compressed 2 Harmful Prefetch Addr A uncompressed 3 Addr E compressed 6 Addr D compressed 4 √  Read/Write Address D  Tag found  Could have been a hit  Heuristic: If PF bit of other line set  Avoidable miss  Prefetch(es) incurred an extra miss

16 Adaptive Prefetching : Summary  Uses extra compression tags & PF bit per tag –PF bit set when prefetched line is allocated, reset on access –Extra tags still maintain address tag, status  Use counter for #startup prefetches per cache: 0 – MAX –Increment for useful prefetches –Decrement for useless & harmful prefetches  Classification of Prefetches: –Useful Prefetch: PF bit set on a cache hit –Useless Prefetch: Replacing a line whose PF bit is set –Harmful Prefetch: Cache miss matches invalid line’s tag, valid line tag has PF bit set

17 Outline  Overview  Compressed CMP Design  Adaptive Prefetching  Evaluation –Workloads and System Configuration –Performance –Compression and Prefetching Interactions  Conclusions

18 Simulation Setup  Workloads: –Commercial workloads [Computer’03, CAECW’02] : OLTP: IBM DB2 running a TPC-C like workload SPECJBB Static Web serving: Apache and Zeus –SPEComp benchmarks: apsi, art, fma3d, mgrid  Simulator: –Simics full system simulator; augmented with: –Multifacet General Execution-driven Multiprocessor Simulator (GEMS) [Martin, et al., 2005,

19 System Configuration  8-core CMP  Cores: single-threaded, out-of-order superscalar with an 11-stage pipeline, 64 entry IW, 128 entry ROB, 5 GHz clock frequency  L1 Caches: 64K instruction, 64K data, 4-way SA, 320 GB/sec total on-chip bandwidth (to/from L1), 3-cycle latency  Shared L2 Cache: 4 MB, 8-way SA (uncompressed), 15-cycle uncompressed latency  Memory: 400 cycles access latency, 20 GB/sec memory bandwidth  Prefetching –8 unit/negative/non-unit stride streams for L1 and L2 for each processor –Issue 6 L1 prefetches (max) on L1 miss –Issue 25 L2 prefetches (max) on L2 miss

20 Performance

21 Performance

22 Performance

23 Performance  Significant Improvements when Combining Prefetching and Compression

24 Performance: Adaptive Prefetching  Adaptive PF avoids significant performance losses

25 Compression and Prefetching Interactions  Positive Interactions: –L1 prefetching hides part of decompression overhead –Link compression reduces increased bandwidth demand because of prefetching –Cache compression increases effective L2 size, L2 prefetching increases working set size  Negative Interactions: –L2 prefetching and L2 compression can eliminate the same misses  Is the overall interaction positive or negative?

26 Interactions Terminology  Assume a base system S with two architectural enhancements A and B, All systems run program P Speedup(A) = Runtime(P, S) / Runtime(P, A) Speedup(B) = Runtime(P, S) / Runtime (P, B) Speedup(A, B) = Speedup(A) x Speedup(B) x (1 + Interaction(A,B))

27 Interaction Between Prefetching and Compression  Interaction is positive for all benchmarks except apsi

28 Conclusions  Compression can help CMPs –Cache compression improves CMP performance –Link Compression reduces pin bandwidth demand  Hardware prefetching can hurt performance –Increases cache pollution –Increases bandwidth demand  Contribution #1: Adaptive prefetching reduces useless & harmful prefetches –Gets most of benefit from prefetching –Avoids large performance losses  Contribution #2: Compression&prefetching interact positively

29 Google “Wisconsin GEMS” Works w/ Simics (free to academics)  commercial OS/apps SPARC out-of-order, x86 in-order, CMPs, SMPs, & LogTM GPL release of GEMS used in four HPCA 2007 papers Detailed Processor Model Opal Simics Microbenchmarks Random Tester Deterministic Contended locks Trace flie

30 Backup Slides

31 Frequent Pattern Compression (FPC)  A significance-based compression algorithm –Compresses each 32-bit word separately –Suitable for short ( byte) cache lines –Compressible Patterns: zeros, sign-ext. 4,8,16-bits, zero-padded half-word, two SE half-words, repeated byte –A 64-byte line is decompressed in a five-stage pipeline  More details in Alaa’s thesis

32 Positive and Negative Interactions

33 Interaction Between Adaptive Prefetching and Compression  Less BW impact  Interaction slightly -ve (except art, mgrid)

34 Positive Interaction: Pin Bandwidth  Compression saves bandwidth consumed by prefetching

35 Negative Interaction: Avoided Misses  Small Fraction of misses is avoided by both

36 Sensitivity to Processor Count

37 Compression Speedup

38 Compression Bandwidth Demand

39 Commercial Workloads & Adaptive Prefetching