EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

High Performing Cache Hierarchies for Server Workloads

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD Fundamental Latency Trade-offs in Architecting DRAM Caches MICRO 2012.

MICRO-45December 4, 2012 Research Jaewoong Sim Gabriel H. Loh Hyesoon Kim Mike O’Connor Mithuna Thottethodi.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

The Memory Hierarchy II CPSC 321 Andreas Klappenecker.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Lecture 19: Virtual Memory

1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Exploiting Locality in DRAM Xiaodong Zhang College of William and Mary.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.

The Evicted-Address Filter

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

3/1/2002CSE Virtual Memory Virtual Memory CPU On-chip cache Off-chip cache DRAM memory Disk memory Note: Some of the material in this lecture are.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

CMSC 611: Advanced Computer Architecture

Memory COMPUTER ARCHITECTURE

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Energy-Efficient Address Translation

Lecture 21: Memory Hierarchy

CMSC 611: Advanced Computer Architecture

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture: DRAM Main Memory

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Adapted from slides by Sally McKee Cornell University

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

CS 3410, Spring 2014 Computer Science Cornell University

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Presentation transcript:

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University of Wisconsin-Madison Hill’s work largely performed while on sabbatical at [1].

1| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public EXECUTIVE SUMMARY  Good use of stacked DRAM is cache, but: –Tags in stacked DRAM believed too slow –On-chip tags too large (e.g., 96 MB for 1 GB stacked DRAM cache)  Solution put tags in stacked DRAM, but: –Faster Hits: Schedule together tag & data stacked DRAM accesses –Faster Miss: On-chip MissMap bypasses stacked DRAM on misses  Result (e.g., 1 GB stacked DRAM cache w/ 2 MB on-chip MissMap) –29-67% faster than naïve tag+data in stacked DRAM –Within 88-97% of stacked DRAM cache w/ impractical on-chip tags

2| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public OUTLINE  Motivation  Fast Hits via Compound Access Scheduling  Fast Misses via MissMap  Experimental Results  Related Work and Summary

3| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public CHIP STACKING IS HERE Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary cores DRAM layers silicon interposer cores DRAM layers “Vertical” “Horizontal” ISSCC’11: “A 1.2V 12.8Gb/s 2Gb Mobile Wide-I/O DRAM with 4x128 I/Os Using TSV-Based Stacking” 256 MB

4| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public HOW TO USE STACKED MEMORY?  Complete Main Memory –Few GB too small for all but some embedded systems  OS-Managed NUMA Memory –Page-size fragmentation an issue –Requires OS-HW cooperation (across companies)  Cache w/ Conventional Block (Line) Size (e.g., 64B) –But on-chip tags for 1 GB cache is impractical 96 MB! (TAKE 1)  Sector/subblock Cache –Tag w/ 2KB block (sector) + state bits w/ each 64B subblock –Tags+state fits on-chip, but fragmentation issues (see paper) Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary

5| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public TAG+DATA IN DRAM (CONVENTIONAL BLOCKS – TAKE 2)  Use 2K-Stacked-DRAM pages but replace 32 64B blocks with –29 tags (48b) + 29 blocks –But previously dismissed as too slow Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary DRAM Latency Data Returned DRAM Tag LookupDRAM Latency Tag Updated Request latency Total bank occupancy Row Decoder Sense Amps Row Buffer 32 x 64-byte cachelines = 2048 bytes 29 ways of data Tags

6| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public IMPRACTICAL IDEAL & OUR RESULT FORECAST Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary Ideal SRAM Tags CAS + MissMap Comp. Acc. Sched. Tags in DRAM Compound Access Scheduling + MissMap  Approximate impractical on-chip SRAM tags Methods Later; Avg of Web-Index, SPECjbb05, TPC-C, & SPECweb05

7| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public OUTLINE  Motivation  Fast Hits via Compound Access Scheduling  Fast Misses via MissMap  Experimental Results  Related Work and Summary

8| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public FASTER HITS (CONVENTIONAL BLOCKS – TAKE 3) Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary DRAM Latency Data Returned SRAM Tag Lookup DRAM Latency DRAM Tag Lookup DRAM Latency Tag Updated Row Buffer Hit Latency Data Returned DRAM Tag Lookup ACTRDData Xfer Tag Check RDData Xfer WRData Xfer Tag Updated not to scale CPU-side SRAM tags Tags in DRAM ACTRDPRE t RCD Data Xfer t CAS t RAS Data Returned Compound Access Scheduling

9| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public COMPOUND ACCESS SCHEDULING  Reserve the bank for data access; guarantee row buffer hit –Approximately trading an SRAM lookup for a row-buffer hit:  On a miss, unnecessarily holds bank open for the tag-check latency –Prevents tag lookup on another row in same bank –Effective penalty is minimal since t RAS must elapse before closing this row, so bank will be unavailable anyway Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary SRAM ACT RD ACT RD tags data

10| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public OUTLINE  Motivation  Fast Hits via Compound Access Scheduling  Fast Misses via MissMap  Experimental Results  Related Work and Summary

11| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public FASTER MISSES (CONVENTIONAL BLOCKS – TAKE 4)  Want to avoid delay & power of stacked DRAM access on miss  Impractical on-chip tags answer –Q1 “Present:” Is block in stacked DRAM cache? –Q2 “Where:” Where in stacked DRAM cache (set/way)?  New on-chip MissMap –Approximate impractical tags for practical cost –Answer Q1 “Present” –But NOT Q2 “Where” Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary

12| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public MISSMAP  On-chip structures to answer Q1: Is block in stacked DRAM cache?  MissMap Requirements –Add block in miss; remove block on victimization –No false negatives: If says, “not present”  must be not present –False positives allowed: If says, “present”  may (rarely) miss  Sounds like a Bloom Filter?  But our implementation is precise – no false negatives or positives –Extreme subblocking with over-provisioning Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary Lookup address in MissMap Lookup tag in DRAM cache Miss: go to main memory Hit: get data from DRAM cache hit miss(miss) hit

13| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public MISSMAP IMPLEMENTATION  Key 1: Extreme Subblocking Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary tag bit vector 1KB memory segment 64B Tag+16 bits tracks 1KB of memory (e.g.) MissMap entry DRAM $ X[7] Installing a line in the DRAM $ X[7] Y[3] Evicting a line from the DRAM $ Y[3]

14| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public MISSMAP IMPLEMENTATION  Key 2: Over-provisioning  Key 3: Answer Q1 “Present” NOT Q2 “Where”  36b tag + 64b vector = 100b  NOT 36b tag + 5*64b vector = 356b (3.6x) Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary Tags Data Not cached due to fragmentation 1KB Subblocked cache Tags only for cached (large) blocks Poor cache efficiency Example: 2MB MissMap 4KB pages Each entry is ~12.5 bytes (36b tag, 64b vector) 167,000 entries total Best case, tracks ~640MB Few bits likely set

15| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public OUTLINE  Motivation  Fast Hits via Compound Access Scheduling  Fast Misses via MissMap  Experimental Results  Related Work and Summary

16| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public METHODOLOGY (SEE PAPER FOR DETAILS)  Workloads (footprint) –Web-Index (2.98 GB) // SPECjbb05 (1.20 GB) –TPC-C (1.03 GB) // SPECweb05 (1.02 GB)  Base Target System –8 3.2 GHz cores with 1 IPC peak w/ 2-cycle 2-way 32KB I$ + D$ –10-cyc 8-way 2MB L2 for 2 cores + 24-cyc 16-way 8MB shared L3 –Off-chip DRAM: DDR3-1600, 2 channels  Enhanced Target System –12-way 6MB shared L3 + 2MB MissMap –Stacked DRAM: 4 channels, 2x freq (~½ latency), 2x bus width  gem5 simulation infrastructure (= Wisconsin GEMS + Michigan M5) Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary

17| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public KEY RESULT: COMPOUND SCHEDULING + MISSMAP WORK Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary Ideal SRAM Tags CAS + MissMap Comp. Acc. Sched. Tags in DRAM Compound Access Scheduling + MissMap  Approximate impractical on-chip SRAM tags

18| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public 2 ND KEY RESULT: OFF-CHIP CONTENTION REDUCED Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary  For requests that miss, main memory is more responsive Fewer requests  lower queuing delay Fewer requests  More row-buffer hits  lower DRAM latency Fewer requests  More row-buffer hits  lower DRAM latency Base 128MB 256MB 512MB 1024MB

19| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public OTHER RESULTS IN PAPER  Impact on all off-chip DRAM traffic (activate, read, write, precharge)  Dynamic active memory footprint of the DRAM cache  Additional traffic due to MissMap evictions  Cacheline vs. MissMap lifetimes  Sensitivity to how L3 is divided between data and the MissMap  Sensitivity to MissMap segment size  Performance against sub-blocked caches Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary

20| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public OUTLINE  Motivation  Fast Hits via Compound Access Scheduling  Fast Misses via MissMap  Experimental Results  Related Work and Summary

21| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public RELATED WORK  Stacked DRAM as main memory –Mostly assumes all of main memory can be stacked [Kgil+ ASPLOS’06, Liu+ IEEE D&T’05, Loh ISCA’08, Woo+ HPCA’10]  As a large cache –Mostly assumes tag-in-DRAM latency too costly [Dong+ SC’10, Ghosh+ MICRO’07, Jiang+ HPCA’10, Loh MICRO’09, Zhao+ ICCD’07]  Other stacked approaches (NVRAM, hybrid technologies, etc.) –[Madan+ HPCA’09, Zhang/Li PACT’09]  MissMap related –Subblocking [Liptay IBMSysJ’68, Hill/Smith ISCA’84, Seznec ISCA’94, Rothman/Smith ICS’99] –“Density Vector” for prefetch suppression [Lin+ ICCD’01] –Coherence optimization [Moshovos+ HPCA’01, Cantin+ ISCA’05] Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary

22| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public EXECUTIVE SUMMARY  Good use of stacked DRAM is cache, but: –Tags in stacked DRAM believed too slow –On-chip tags too large (e.g., 96 MB for 1 GB stacked DRAM cache)  Solution put tags in stacked DRAM, but: –Faster Hits: Schedule together tag & data stacked DRAM accesses –Faster Miss: On-chip MissMap bypasses stacked DRAM on misses  Result (e.g., 1 GB stacked DRAM cache w/ 2 MB on-chip MissMap) –29-67% faster than naïve tag+data in stacked DRAM –Within 88-97% of stacked DRAM cache w/ impractical on-chip tags Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary

23| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2011 Advanced Micro Devices, Inc. All rights reserved.

24| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public BACKUP SLIDES

25| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public UNIQUE PAGES IN L4 VS. MISSMAP REACH Ex. 70% of the time a 256MB cache held ~90,000 or fewer unique pages

26| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public IMPACT ON OFF-CHIP DRAM ACTIVITY

27| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public MISSMAP EVICTION TRAFFIC  Many MissMap evictions correspond to clean pages (e.g., no writeback traffic from the L4)  By the time a MissMap entry is evicted, most of its cachelines have are long past dead/evicted.

28| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public SENSITIVITY TO MISSMAP VS. DATA ALLOCATION OF L3  2MB MissMap + 6MB Data provides good performance  3MB MissMap + 5MB Data slightly better, but can hurt server workloads that are more sensitive to L3 capacity.

29| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public SENSITIVITY TO MISSMAP SEGMENT SIZE  4KB segment size works the best  Our simulations make use of physical addresses, so consecutive virtual pages can be mapped to arbitrary physical pages

30| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public COMPARISON TO SUB-BLOCKED CACHE  Beyond 128MB, overhead is greater than MissMap  At largest sizes (512MB, 1GB), sub-blocked cache delivers similar performance to our approach, but at substantially higher cost

31| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public BENCHMARK FOOTPRINTS  TPC-C: ~80% of accesses served by hottest 128MB worth of pages  SPECWeb05: ~80% accesses served by 256MB  SPECjbb05: ~80% accesses served by 512MB  Web-Index: huge active footprint

32| Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches | Dec 7, 2011 | Public MISSMAP IMPLEMENTATION  Key 1: Extreme Subblocking for space efficiency Motivation Fast Hits via Compound Access Scheduling Fast Misses via MissMap Experimental Results Related Work and Summary segment tag bit vector 1KB memory segment 64B Tag+16 bits tracks 1KB of memory (this example) Tag+16 bits tracks 1KB of memory (this example) DRAM $ X[7] Installing a line in the DRAM $ X[7] DRAM $ Y[3] Evicting a line from the DRAM $ Y[3] MissMap entry