Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
Skewed Compressed Cache
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Lecture 19: Virtual Memory
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Energy-Efficient Data Compression for Modern Memory Systems Gennady Pekhimenko ACM Student Research Competition March, 2015.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
A Robust Main-Memory Compression Scheme Magnus Ekman and Per Stenström Chalmers University of Technology Göteborg, Sweden.
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Virtual Memory 1 1.
Exploiting Compressed Block Size as an Indicator of Future Reuse
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
The Evicted-Address Filter
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
Sunpyo Hong, Hyesoon Kim
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
A Robust Main-Memory Compression Scheme (ISCA 06) Magnus Ekman and Per Stenström Chalmers University of Technolog, Göteborg, Sweden Speaker: 雋中.
A Case for Toggle-Aware Compression for GPU Systems
Computer Architecture Lecture 12: Virtual Memory I
CS161 – Design and Architecture of Computer
Improving Cache Performance using Victim Tag Stores
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Reza Yazdani Albert Segura José-María Arnau Antonio González
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University
Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads
ECE 445 – Computer Organization
Energy-Efficient Address Translation
Spare Register Aware Prefetching for Graph Algorithms on GPUs
CSCI206 - Computer Organization & Programming
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Part V Memory System Design
CARP: Compression Aware Replacement Policies
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Session C1, Tuesday 10:40 AM Vivek Seshadri.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
CARP: Compression-Aware Replacement Policies
Virtual Memory Hardware
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Virtual Memory 1 1.
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch

Executive Summary Main memory is a limited shared resource Observation: Significant data redundancy Idea: Compress data in main memory Problem: How to avoid inefficiency in address computation? Solution: Linearly Compressed Pages (LCP): fixed-size cache line granularity compression Increases memory capacity (62% on average) Decreases memory bandwidth consumption (24%) Decreases memory energy consumption (4.9%) Improves overall performance (13.9%)

Potential for Data Compression Significant redundancy in in-memory data: 0x00000000 0x0000000B 0x00000003 0x00000004 … How can we exploit this redundancy? Main memory compression helps Provides effect of a larger memory without making it physically larger

Challenges in Main Memory Compression Address Computation Mapping and Fragmentation

Challenge 1: Address Computation Cache Line (64B) Uncompressed Page L0 L1 L2 . . . LN-1 Address Offset 64 128 (N-1)*64 Talk about the last cache line Compressed Page L0 L1 L2 . . . LN-1 Address Offset ? ? ?

Challenge 2: Mapping & Fragmentation Virtual Page (4KB) Virtual Address Physical Address Physical Page (? KB) Fragmentation

Outline Motivation & Challenges Shortcomings of Prior Work LCP: Key Idea LCP: Implementation Evaluation Conclusion and Future Work

Key Parameters in Memory Compression Ratio Address Comp. Latency Decompression Latency Complexity and Cost

Shortcomings of Prior Work Compression Mechanisms Ratio Address Comp. Latency Decompression Latency Complexity and Cost IBM MXT [IBM J.R.D. ’01]   2X 64 cycles Doubling the memory access latency 64 processor cycles Complex 32 MB LLC with 1KB cache blocks

Shortcomings of Prior Work (2) Compression Mechanisms Ratio Address Comp. Latency Decompression Latency Complexity And Cost IBM MXT [IBM J.R.D. ’01]   Robust Main Memory Compression [ISCA’05]  More details on complexity and cost

Shortcomings of Prior Work (3) Compression Mechanisms Ratio Address Comp. Latency Decompression Latency Complexity And Cost IBM MXT [IBM J.R.D. ’01]   Robust Main Memory Compression [ISCA’05]  LCP: Our Proposal Address computation latency – hidden, but leads to high energy cost

Linearly Compressed Pages (LCP): Key Idea Uncompressed Page (4KB: 64*64B) 64B 64B 64B 64B . . . 64B 128 In LCP, we restrict the compression algorithm to provide the same fixed compressed size for all cache lines. The primary advantage of this restriction is that now address computation is a simple linear scaling of the cache line offset before compression. For example, for the 3rd cache line – offset was 128, after compression with 4:1 compression – it is simply 32 bytes offset within LCP page. The fixed-size restriction also leads to a compressed data region being fixed in size (e.g., 1KB with 4:1 compression). 4:1 Compression . . .  Fixed compressed size 32 LCP effectively solves challenge 1: address computation Compressed Data (1KB)

LCP: Key Idea (2) . . . . . . M E Uncompressed Page (4KB: 64*64B) 64B 4:1 Compression idx Unfortunately, not all data is compressible, hence we need (1) keep exceptional cache lines in a separate exception region in the uncompressed form. (2) a dedicated metadata region that stores (a) a bit per cache line -- to represent the fact that some cache lines are compressed and some are not, (b) as well as the corresponding index that maps to the exception storage elements occupied by the corresponding uncompressed cache line. … . . . M E E0 Compressed Data (1KB) Metadata (64B) Exception Storage

But, wait … . . . . . . M E Uncompressed Page (4KB: 64*64B) 64B 64B 4:1 Compression Our solution to this inefficiency is MD cache. 32KB in size, located at the memory controller. In the common case of the MD hit (more than 90% of cases), the access to the metadata is fully avoided. . . . M E How to avoid 2 accesses ? Compressed Data (1KB)  Metadata (MD) cache

Key Ideas: Summary  Fixed compressed size per cache line  Metadata (MD) cache

Outline Motivation & Challenges Shortcomings of Prior Work LCP: Key Idea LCP: Implementation Evaluation Conclusion and Future Work

LCP Overview Page Table entry extension PTE Page Table entry extension compression type and size (fixed encoding) OS support for multiple page sizes 4 memory pools (512B, 1KB, 2KB, 4KB) Handling uncompressible data Hardware support memory controller logic metadata (MD) cache 512B 1KB 2KB 4KB

Page Table Entry Extension c-bit (1b) c-type (3b) Page Table Entry c-size (2b) c-base (3b) c-bit (1b) – compressed or uncompressed page c-type (3b) – compression encoding used c-size (2b) – LCP size (e.g., 1KB) c-base (3b) – offset within a page

Physical Memory Layout PA1 + 512*4 Page Table 4KB PA0 2KB 2KB PA1 4 1KB 1KB 1KB 1KB 512 bytes – minimum page size … PA2 1 … 512B 512B ... 512B 4KB … PA2 + 512*1

Memory Request Flow Initial Page Compression Cache Line Read Cache Line Writeback

Initial Page Compression (1/3) Cache Line Writeback (3/3) Cache Line Read (2/3) Memory Request Flow (2) 4KB LD $Line $Line Disk Last-Level Cache Compress/ Decompress 1KB LD Memory Controller 2KB Core What Metadata has in it? MD Cache is the common case TLB 1KB $Line DRAM MD Cache Processor 1. Initial Page Compression 2. Cache Line Read 3. Cache Line Writeback

Handling Page Overflows Happens after writebacks, when all slots in the exception storage are already taken Two possible scenarios: Type-1 overflow: requires larger physical page size (e.g., 2KB instead of 1KB) Type-2 overflow: requires decompression and full uncompressed physical page (e.g., 4KB) … M E0 E1 E2 Compressed Data Exception Storage Happens infrequently - once per ~2M instructions $ line

Compression Algorithms Key requirements: Low hardware complexity Low decompression latency High effective compression ratio Frequent Pattern Compression [ISCA’04] Uses simplified dictionary-based compression Base-Delta-Immediate Compression [PACT’12] Uses low-dynamic range in the data

Base-Delta Encoding [PACT’12] 32-byte Uncompressed Cache Line 0xC04039C0 0xC04039C8 0xC04039D0 … 0xC04039F8 0xC04039C0 Base 12-byte Compressed Cache Line Mention real application – mcf This is the basic idea behind BDI 0x00 0x08 0x10 … 0x38 BDI [PACT’12] has two bases: zero base (for narrow values) arbitrary base (first non-zero value in the cache line) 1 byte 1 byte 1 byte  Fast Decompression: vector addition  Simple Hardware: arithmetic and comparison 20 bytes saved  Effective: good compression ratio

LCP-Enabled Optimizations Memory bandwidth reduction: Zero pages and zero cache lines Handled separately in TLB (1-bit) and in metadata (1-bit per cache line) 64B 64B 64B 64B 1 transfer instead of 4

Outline Motivation & Challenges Shortcomings of Prior Work LCP: Key Idea LCP: Implementation Evaluation Conclusion and Future Work

Methodology Simulator: x86 event-driven based on Simics Workloads (32 applications) SPEC2006 benchmarks, TPC, Apache web server System Parameters L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] 512kB - 16MB L2 caches DDR3-1066, 1 memory channel Metrics Performance: Instructions per cycle, weighted speedup Capacity: Effective compression ratio Bandwidth: Bytes per kilo-instruction (BPKI) Energy: Memory subsystem energy

Evaluated Designs Design Description Baseline Baseline (no compression) RMC Robust main memory compression[ISCA’05] (RMC) and FPC[ISCA’04] LCP-FPC LCP framework with FPC LCP-BDI LCP framework with BDI[PACT’12] LZ Lempel-Ziv compression (per page)

Effect on Memory Capacity 32 SPEC2006, databases, web workloads, 2MB L2 cache LCP-based designs achieve competitive average compression ratios with prior work

Effect on Bus Bandwidth 32 SPEC2006, databases, web workloads, 2MB L2 cache Better LCP-based designs significantly reduce bandwidth (24%) (due to data compression)

Effect on Performance LCP-based designs significantly improve performance over RMC

Effect on Memory Subsystem Energy 32 SPEC2006, databases, web workloads, 2MB L2 cache Better LCP framework is more energy efficient than RMC

Effect on Page Faults 32 SPEC2006, databases, web workloads, 2MB L2 cache LCP framework significantly decreases the number of page faults (up to 23% on average for 768MB)

Other Results and Analyses in the Paper Analysis of page overflows Compressed page size distribution Compression ratio over time Number of exceptions (per page) Detailed single-/multicore evaluation Comparison with stride prefetching performance and bandwidth

Conclusion Old Idea: Compress data in main memory Problem: How to avoid inefficiency in address computation? Solution: A new main memory compression framework called LCP (Linearly Compressed Pages) Key idea: fixed-size for compressed cache lines within a page Evaluation: Increases memory capacity (62% on average) Decreases bandwidth consumption (24%) Decreases memory energy consumption (4.9%) Improves overall performance (13.9%)

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch

Backup Slides

Large Pages (e.g., 2MB or 1GB) Splitting large pages into smaller 4KB sub-pages (compressed individually) 64-byte metadata chunks for every sub-page 2KB 2KB … 2KB M 2KB

Physically Tagged Caches Virtual Address Core Critical Path Address Translation TLB Physical Address tag data L2 Cache Lines tag data tag data

Changes to Cache Tagging Logic data Cache Lines tag data tag data Before: After: p-base p-base c-idx p-base – physical page base address c-idx – cache line index within the page

Analysis of Page Overflows

Frequent Pattern Compression Idea: encode cache lines based on frequently occurring patterns, e.g., first half of a word is zero 0x00000001 001 0x0001 0x00000000 000 011 0xFFFFFFFF 0xFF 111 0xABCDEFFF 0xABCDEFFF Frequent Patterns: 000 – All zeros 001 – First half zeros 010 – Second half zeros 011 – Repeated bytes 100 – All ones … 111 – Not a frequent pattern 0x00000001 001 0x0001 0x00000000 000 0xFFFFFFFF 011 0xFF 0xABCDEFFF 111 0xABCDEFFF

GPGPU Evaluation Gpgpu-sim v3.x Card: NVIDIA GeForce GTX 480 (Fermi) Caches: DL1: 16 KB with 128B lines L2: 786 KB with 128B lines Memory: GDDR5

Effect on Bandwidth Consumption

Effect on Throughput

Physical Memory Layout PA1 + 512*4 Page Table 4KB PA0 2KB 2KB PA1 4 1KB 1KB 1KB 1KB c-base 512 bytes – minimum page size … PA2 1 … 512B 512B ... 512B 4KB … PA2 + 512*1

Page Size Distribution

Compression Ratio Over Time

IPC (1-core)

Weighted Speedup

Bandwidth Consumption

Page Overflows

Stride Prefetching - IPC

Stride Prefetching - Bandwidth

Future Work LCP + prefetching GPU evaluation Use prefetching as a “hint” producer to detect locality LCP feedback: don’t generate prefetch requests to the cache lines LCP can bring for free (adaptive) LCP dynamic restructuring based on spatial pattern information GPU evaluation New compressed memory designs New compression algorithms