1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.
Improving Cache Performance by Exploiting Read-Write Disparity
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Lecture 17: Virtual Memory, Large Caches
1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
International Symposium on Computer Architecture ( ISCA – 2010 )
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.
The Evicted-Address Filter
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.
15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Improving Cache Performance using Victim Tag Stores
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Lecture: Large Caches, Virtual Memory
18-447: Computer Architecture Lecture 23: Caches
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
ASR: Adaptive Selective Replication for CMP Caches
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
18742 Parallel Computer Architecture Caching in Multi-core Systems
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
Lecture 13: Large Cache Design I
Lecture 12: Cache Innovations
Lecture: Cache Innovations, Virtual Memory
Lecture 15: Large Cache Design III
CARP: Compression-Aware Replacement Policies
Lecture 14: Large Cache Design II
Lecture: Cache Hierarchies
Lecture 9: Caching and Demand-Paged Virtual Memory
Presentation transcript:

1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

2 Shared Vs. Private SHR: No replication of blocks SHR: Dynamic allocation of space among cores SHR: Low latency for shared data in LLC (no indirection thru directory) SHR: No interconnect traffic or tag replication to maintain directories PVT: More isolation and better quality-of-service PVT: Lower wire traversal when accessing LLC hits, on average PVT: Lower contention when accessing some shared data PVT: No need for software support to maintain data proximity

3 Innovations for Private Caches: Cooperation Cooperative Caching, Chang and Sohi, ISCA’06 P C P C P C P C P C P C P C P C D Prioritize replicated blocks for eviction with a given probability; directory must track and communicate a block’s “replica” status “Singlet” blocks are sent to sibling caches upon eviction (probabilistic one-chance forwarding); blocks are placed in LRU position of sibling

4 Dynamic Spill-Receive Dynamic Spill-Receive, Qureshi, HPCA’09 Instead of forcing a block upon a sibling, designate caches as Spillers and Receivers and all cooperation is between Spillers and Receivers Every cache designates a few of its sets as being Spillers and a few of its sets as being Receivers (each cache picks different sets for this profiling) Each private cache independently tracks the global miss rate on its S/R sets (either by watching the bus or at the directory) The sets with the winning policy determine the policy for the rest of that private cache – referred to as set-dueling

5 CPU Issues to be addressed for Non-Uniform Cache Access: Mapping Migration Search Replication Innovations for Shared Caches: NUCA

6 Static and Dynamic NUCA Static NUCA (S-NUCA)  The address index bits determine where the block is placed; sets are distributed across banks  Page coloring can help here to improve locality Dynamic NUCA (D-NUCA)  Ways are distributed across banks  Blocks are allowed to move between banks: need some search mechanism  Each core can maintain a partial tag structure so they have an idea of where the data might be (complex!)  Every possible bank is looked up and the search propagates (either in series or in parallel) (complex!)

7 Beckmann and Wood, MICRO’04 Latency 13-17cyc Latency 65 cyc Data must be placed close to the center-of-gravity of requests

Alternative Layout From Huh et al., ICS’05: Paper also introduces the notion of sharing degree A bank can be shared by any number of cores between N=1 and 16. Will need support for L2 coherence as well

9 Victim Replication, Zhang & Asanovic, ISCA’05 Large shared L2 cache (each core has a local slice) On an L1 eviction, place the victim in local L2 slice (if there are unused lines) The replication does not impact correctness as this core is still in the sharer list and will receive invalidations On an L1 miss, the local L2 slice is checked before fwding the request to the correct slice P C P C P C P C P C P C P C P C

10 Page Coloring Block offsetSet IndexTag Bank number with Set-interleaving Bank number with Page-to-Bank Page offsetPhysical page number CACHE VIEW OS VIEW

11 Cho and Jin, MICRO’06 Page coloring to improve proximity of data and computation Flexible software policies Has the benefits of S-NUCA (each address has a unique location and no search is required) Has the benefits of D-NUCA (page re-mapping can help migrate data, although at a page granularity) Easily extends to multi-core and can easily mimic the behavior of private caches

12 Page Coloring Example P C P C P C P C P C P C P C P C Awasthi et al., HPCA’09 propose a mechanism for hardware-based re-coloring of pages without requiring copies in DRAM memory They also formalize the cost functions that determine the optimal home for a page

13 R-NUCA, Hardavellas et al., ISCA’09 A page is categorized as “shared instruction”, “private data”, or “shared data”; the TLB tracks this and prevents access of a different kind Depending on the page type, the indexing function into the shared cache is different  “Private data” only looks up the local bank  “Shared instruction” looks up a region of 4 banks  “Shared data” looks up all the banks

14 Rotational Interleaving Can allow for arbitrary group sizes and a numbering that distributes load

15 Basic Replacement Policies LRU: least recently used LFU: least frequently used (requires small saturating cntrs) pseudo-LRU: organize ways as a tree and track which sub-tree was last accessed NRU: every block has a bit; the bit is reset to 0 upon touch; when evicting, pick a block with its bit set to 1; if no block has a 1, make every bit 1

16 Why the Basic Policies Fail Access types that pollute the cache without yielding too many hits: streaming (no reuse), thrashing (distant reuse) Current hit rates are far short of those with an oracular replacement policy (Belady): evict the block whose next access is most distant A large fraction of the cache is useless – blocks that have serviced their last hit and are on the slow walk from MRU to LRU

17 Insertion, Promotion, Victim Selection Instead of viewing the set as a recency stack, simply view it as a priority list; in LRU, priority = recency When we fetch a block, it can be inserted in any position in the list When a block is touched, it can be promoted up the priority list in one of many ways When a block must be victimized, can select any block (not necessarily the tail of the list)

18 MIP, LIP, BIP, and DIP Qureshi et al., ISCA’07 MIP: MRU insertion policy (the baseline) LIP: LRU insertion policy; assumes that blocks are useless and should be kept around only if touched twice in succession BIP: Bimodal insertion policy; put most blocks at the tail; with a small probability, insert at head; for thrashing workloads, it can retain part of the working set and yield hits on them DIP: Dynamic insertion policy: pick the better of MIP and BIP; decide with set-dueling

19 RRIP Jaleel et al., ISCA’10 Re-Reference Interval Prediction: in essence, insert blocks near the end of the list than at the very end Implement with a multi-bit version of NRU: zero counter on touch, evict block with max counter, else increment every counter by one RRIP can be easily implemented by setting the initial counter value to max-1 (does not require list management)

20 UCP Qureshi et al., MICRO’06 Utility Based Cache Partitioning: partition ways among cores based on estimated marginal utility of each additional way to each core Each core maintains a shadow tag structure for the L2 cache that is populated only by requests from this core; the core can now estimate hit rates if it had W ways of L2 Every epoch, stats are collected and ways re-assigned Can reduce shadow tag storage overhead by using set sampling and partial tags

21 TADIP Jaleel et al., PACT’08 Thread-aware DIP: each thread dynamically decides to use MIP or BIP; threads that use BIP get a smaller partition of cache Better than UCP because even for a thrashing workload, part of the working set gets to stay in cache Need lots of set dueling monitors, but no need for extra shadow tags

22 PIPP Xie and Loh, ISCA’09 Promotion/Insertion pseudo partitioning: incoming blocks are inserted in arbitrary positions in the list and on every touch, they are gradually promoted up the list with a given probability Applications with a large partition are inserted near the head of the list and promoted aggressively Partition sizes are decided with marginal utility estimates In a few sets, a core gets to use N-1 ways and count hits to each way; other threads only get to use the last way

23 Aggressor VT Liu and Yeung, PACT’09 In an oracle policy, 80% of the evictions belong to a thrashing aggressor thread Hence, if the LRU block belongs to an aggressor thread, evict it; else, evict the aggressor thread’s LRU block with a probability of either 99% or 50% At the start of each phase change, sample behavior for that thread in one of three modes: non-aggr, aggr-99%, aggr-50%; pick the best performing mode

24 Set Partitioning Can also partition sets among cores by assigning page colors to each core Needs little hardware support, but must adapt to dynamic arrival/exit of tasks

25 Overview

26 Title Bullet