Skewed Compressed Cache

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Review of Mem. HierarchyCSCE430/830 Review of Memory Hierarchy CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U.
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.
CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.
Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.
Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Sampling Dead Block Prediction for Last-Level Caches
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
Exploiting Compressed Block Size as an Indicator of Future Reuse
M. Tiwari, B. Agrawal, S. Mysore, J. Valamehr, T. Sherwood, CS & ECE of UCSB Reading Group Presentation by Theo.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
Memory Management Continued Questions answered in this lecture: What is paging? How can segmentation and paging be combined? How can one speed up address.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Introduction to computer architecture April 7th. Access to main memory –E.g. 1: individual memory accesses for j=0, j++, j
A Case for Toggle-Aware Compression for GPU Systems
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University
Cache Memory Presentation I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Cache - Optimization.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Skewed Compressed Cache MICRO 2014. Somayeh Sardashti, David A. Wood Computer Sciences Department University of Wisconsin-Madison 안녕하십니까. Micro 2014 SCC 발표를 진행하게 된 안재형 이라고합니다. 꾸벅

SCC Off-chip access -> latency, BW, Power LLC size already big => Effective capacity

Cache Compression Observation : many cache lines have low dynamic range data.

SCC Designing a Compressed Cache. (1) a compression algorithm to compress blocks (2) a compaction mechanism to fit compressed blocks in the cache. *In general, SCC is independent of the compression algorithm in use. = 초창기 캐쉬는 한가지 압축률만 고수한 대신 빠른 lookup과 상대적으로 적은 metadata overhead. 하지만 압축률이 너무 낮아서 Effective capacity가 낮음. Internal fragmentation = extra metadata, indirection

Motivation. How can we design a compressed cache? (Design goal) 1. tightly compacting variable-size compressed blocks. 2. keeping tag and other metadata overhead low 3. allowing fast lookups. => Previous compressed cache designs failed to achieve all these goals.

Compressed Cache Taxonomy 64 -> 16 0-4 VSC compression size IIC-C, DCC – pointer Superblock tag ex) 16 tags per set in a 16-way associative cache. Each tag tracks a 4-block superblock and can map up to 4 cache blocks. How to provide additional tags How to find the corresponding block given a matching tag.

SCC Key Observation. 1) spatial locality 2) compression locality ( neighboring blocks tend to reside in the cache at the same time) 2) compression locality ( neighboring blocks tend to compress similarly )

SCC 48bits PA tag data CF = 2b00 64Byte = 16W CF = 2b01 32B CF = 2b10 Superblock tag CF = 2b11 8B 8B 48bits PA

SSC

SuperBlock Cache

16-way set-associative Cache Address 48bits 47 Cache Block subblock 4-way set associative

SuperBlock 1 Superblock = 8 contiguous blocks = 64Bytes x 8 = 512B Large block 과 다른 점은 한 block만 읽어온다는 점과 각 block에 대해서 valid bit 등 따로 관리한다는 점. 6bits -> 64B 47 11 10 9 8 6 5 Block ID Byte Select

Way group Selection xor Superblock tag Write시에는 Compression factor을 알기 때문에 Way Group을 고를 수 있지만, Read시에는 해당 데이터의 compression factor을 모른다. A10A9와 각 waygroup의 역연산을 통해서 압축률을 알 수 있다. 예를들어서 A의 tag를 찾았을경우 (superblock tag 및 valid bit) way group을 통해 압축률을 알아낼 수 있다. ============================= store – write, cache miss Load – look up 47 11 10 9 8 6 5 xor Superblock tag

47 11 10 9 8 6 5 압축된 블락의 위치를 슈퍼블락 순서에 맞게 배치함으로써 쉽게 찾아낼 수 있도록.

예: 47 11 10 9 8 6 5 = 왜 굳이 hash function? Skewed cache를 위해. 예: = 왜 굳이 hash function? Skewed cache를 위해. = A47-A11은 skewed를 위해. 슈퍼블락마다 다르게 하기 위해서. 문제는 인접한 슈퍼블락이면? = 압축율이 같으면 같은 인덱스를 가지도록 하려고. 최대한 packing하려고. + 압축이 덜될 경우 conflict miss없이 최대한 퍼지게 하기 위해서. 애당초 way group을 통해서 = 왜 A9,A10은 고려안하는지? 4set associative라서

예) = 왜 굳이 hash function? Skewed cache를 위해. = A47-A11은 skewed를 위해. 슈퍼블락마다 다르게 하기 위해서. 문제는 인접한 슈퍼블락이면? = 압축율이 같으면 같은 인덱스를 가지도록 하려고. 최대한 packing하려고. + 압축이 덜될 경우 conflict miss없이 최대한 퍼지게 하기 위해서. 애당초 way group을 통해서 = 왜 A9,A10은 고려안하는지? 4set associative라서 47 11 10 9 8 6 5

2-way Skewed Cache. Index each way with a different hash function, spreading out accesses and reducing conflicting misses.

SCC 16-way cache with 8 cache sets into 4 way groups. = 주소와 CF로 저장되는 way group과 set index가 정해지므로, 따로 압축률에 관한 metadata가 필요하지 않다. 위치가 알려줌. = 편의상 각 way group 젤 앞 way에 할당되는 걸로 보여줌. 원래는 각 way group내에서 skewed cache = X보면 block off-set그대로. 고로 data array 어디에 있는지 추적하기위한 metadata 필요 없음. Datapath단순해지고 룩업도 빠르고 영역도 적게 차지하고 디자인도 쉽고 여전히 여러 압축률 지원함. = way group마다 나뉘므로 실제 16-way 다 활용못함. Conflict miss 증가, 고로 skewed cache 16-way cache with 8 cache sets into 4 way groups. 64Byte cache block, 8-block Superblocks. (1,2,4 or 8 subblocks) Separate sparse super-block tag

SCC * 97% of updated blocks fit in original place. = 처음 look-up시 CF모르므로. 모든 way group 살펴봐야하는데 이 때 index를 알려면 CF를 알아야하므로, 위 계산식으로 도출 후 Lookup. (예. Fig2.A) = tag있고 valid면 hit. Sub-block위치를 위에서 구한 CF와 식(3) Byte Offset을 가지고 구해서 읽어옴. = L1 혹은 L2에서는 uncompressed 였다가 LLC에서 compress시키므로 CF 알고있다 write시에는. Cache miss 시에는 모를텐데? => memory에서 LLC로 데려올 때 compressed 되므로 CF 알고 있음. Way group은 정해지며 이 때 해당 set이 전부 차있으면, eviction 필요. Way group내에서 LRU로 고름. = Ex) write-back to an inclusive LLC = DCC랑 다른점. DCC는 다른 SB에 속하는 blocks eviction할 우려. + SB 쫓겨날 때 속하는 모든 블락도 같이 쫓겨나야함. 고로 구현단순. = * 97% of updated blocks fit in original place.

Area Overhead 1bit FIxedC 3bits per block VSC to locate BPE per subblock in a set. SCC only needs LRU for the tags. No extra data to locate subblocks. Only tag addresses,LRU state, per-block coherence states. Baseline : conventional 16-way 8MB LLC FixedC : doubles the # of tags. Compression only to half the size. VSC : 0-4 16B subblocks DCC4-16 : 0-4 16B subblocks SCC8-8 : 0-8 8B subblocks

Methodology GEMS simulator, CACTI6.5 (area, power at 32nm) Run mixes of multi-programmed workloads from memory bound and compute bound SPEC CPU 2006 benchmarks. Different applications from SPECOMP,PARSEC,commercial workloads, SPEC CPU 2006. Run mixes of multi-programmed workloads from memory bound and compute bound SPEC CPU 2006 benchmarks. = increasing order for the Baseline. = warmed up caches. Average the multiple runs. Baseline : conventional 16-way 8MB LLC 2XBaseline : conventional 32-way 16MB LLC

Evaluation-MPKI 2X Baseline – average 15% improvement SCC – avg. 13%

Evaluation-Energy SCC improves system energy up to 20%. Avg. 6%

Conclusion SCC achieves performance comparable to that of a conventional cache with twice the capacity and associativity with less area overhead 1.5%. (DCC - 6.8%) = Area overhead : SCC 1.5% vs DCC 6.8% Lower design complexity. = Replacement mechanism is simpler than DCC SCC’s‎replacement‎mechanism‎is‎much‎simpler‎than‎that‎ needed by DCC. In DCC, allocating space for a block can trigger the eviction of several blocks, sometimes belonging to different super-blocks. In case of a super-block miss, all blocks associated with the victim super-block tag must be evicted, unlike SCC that evicts only blocks belonging to a particular data entry. In addition, in DCC, blocks belonging to other super-blocks may need to be evicted too. Thus, determining which block or super-block is best to replace in DCC is very complex. SCC also never needs to evict a block on a super-block hit, while DCC may. SCC will allocate the missing block in its corresponding data entry, which is guaranteed to have enough space since the compression factor is used as part of the search criteria. In DCC, a super-block hit does not guarantee that there is any free space in the data array.

FixedC

VSC

DCC

SSC

Sector Cache

2-way Skewed Cache. Index each way with a different hash function, spreading out accesses and reducing conflicting misses.

Cache Compression [Goal] Fast (low decompression latency) Simple (avoid complex hardware changes) Effective (good compression ratio)

Motivation Off-chip memory latency is high. -> larger cache reduce misses at the cost of bigger area and power. Off-chip memory access requires high enerygy. -> larger cache reduce accesses to Off-chip memory. Off-chip interconnects bandwidth is limited. -> larger cache Last level