1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.

Slides:



Advertisements
Similar presentations
Dead Block Replacement and Bypass with a Sampling Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.
Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
Lecture 7: PCM, Cache coherence
Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.
The Evicted-Address Filter
1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.
MICRO-48, 2015 Computer System Lab, Kim Jeong Won.
CSE 502: Computer Architecture
Lecture: Large Caches, Virtual Memory
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,
Lecture: Cache Hierarchies
Lecture 14: Large Cache Design III
CS222/CS122C: Principles of Data Management Lecture #3 Heap Files, Page Formats, Buffer Manager Instructor: Chen Li.
18742 Parallel Computer Architecture Caching in Multi-core Systems
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Cache Hierarchies
Lecture 13: Large Cache Design I
Lecture 12: Cache Innovations
Lecture: DRAM Main Memory
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture: DRAM Main Memory
Lecture: Cache Innovations, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Lecture 6: Reliability, PCM
Lecture 15: Large Cache Design III
CARP: Compression-Aware Replacement Policies
Lecture 14: Large Cache Design II
Lecture 20: OOO, Memory Hierarchy
Lecture 22: Cache Hierarchies, Memory
15-740/ Computer Architecture Lecture 14: Prefetching
Lecture: Cache Hierarchies
CS222P: Principles of Data Management Lecture #3 Buffer Manager, PAX
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Presentation transcript:

1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks

2 Temporal Memory Streaming Wenisch et al., ISCA’05 When a thread incurs a series of misses to blocks (PQRS, not necessarily contiguous), other threads are likely to incur a similar series of misses Each thread maintains its miss log in a circular buffer in memory; the directory for P also keeps track of a pointer to multiple log entries of P When a thread has a miss on P, it contacts the directory and the directory provides the log pointers; the thread receives multiple streams and starts prefetching Log access and prefetches are off the critical path

3 Spatial Memory Streaming Somogyi et al., ISCA’06 Threads often enter a new region (page) and touch a few arbitrary blocks in that region A predictor is indexed with the PC of the first access to that region and the offset of the first access; the predictor returns a bit vector indicating the blocks accessed within that region Can even prefetch for regions that have not been touched before!

4 Feedback Directed Prefetching Srinath et al., HPCA’07 A stream prefetcher has two parameters: P: prefetch distance: how far ahead of the start do we prefetch N: prefetch degree: how much do we advance the start when there is a hit in the stream Can vary these two parameters based on pref effectiveness Accuracy: a bit tracks if a prefetched block was touched Timeliness: was the block touched while in the MSHR? Pollution: track recent evictions (Bloom filter) and see if they are re-touched; also guides insertion policy

5 Dead Block Prediction Can keep track of the number of accesses to a line during its previous residence; the block is deemed to be dead after that many accesses Kharbutli, Solihin, IEEE TOC’08 To reduce noise, an access can be considered as a block’s move to the MRU position Liu et al., MICRO 2008 Earlier DBPs used a trace of PCs to capture when a block has completed its use DBP is used for energy savings, replacement policies, and cache bypassing

6 Distill Cache Qureshi, HPCA 2007 Half the ways are traditional (LOC); when a block is evicted from the LOC, only the touched words are stored in a word-organized cache that has many narrow ways Incurs a fair bit of complexity (more tags for the WOC, collection of word touches in L1s, blocks with holes, etc.) Does not need a predictor; actions are based on the block’s behavior during current residence Useless word identification is orthogonal to cache compression

7 Traditional Networks Huh et al. ICS’05, Beckmann MICRO’04 Example designs for contiguous L2 cache regions

8 Explorations for Optimality Muralimanohar et al., ISCA’07

9 3D Designs, Li et al., ISCA’06 D-NUCA: first search in cylinder, then multicast search everywhere Data is migrated close to requester, but need not jump across layers

10 Halo Network Jin et al., HPCA’07 D-NUCA: Sets are distributed across columns; Ways are distributed across rows

11 Halo Network

12 Nahalal Guz et al., CAL’07

13 Nahalal Block is initially placed in core’s private bank and then swapped into the shared bank if frequently accessed by other cores Parallel search across all banks

14 Title Bullet