CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.
February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP III Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency II Steve Ko Computer Sciences and Engineering University at.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Address Translation and Protection Steve Ko Computer Sciences and Engineering University at.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Pipelining III Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590 Computer Architecture Virtual Memory II
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency I Steve Ko Computer Sciences and Engineering University at Buffalo.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
February 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
CS 152 Computer Architecture and Engineering Lecture 21: Directory-Based Cache Protocols Scott Beamer (substituting for Krste Asanovic) Electrical Engineering.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of.
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
February 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering.
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 10 CS252 Graduate Computer Architecture Spring 2014 Lecture 10: Memory Krste Asanovic
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory – Part 2 Benjamin Lee Electrical and Computer Engineering Duke University
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 14 Virtual Memory Benjamin Lee Electrical and Computer Engineering Duke University
February 9, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Virtual Memory Benjamin Lee Electrical and Computer Engineering Duke University
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Computer Architecture and Parallel Computing 并行结构与计算 Lecture 3 - Memory Peng LIU ( 刘鹏 ) College of Information Science & Electronic Engineering Zhejiang.
Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang.
Cache Memories CENG331 - Computer Organization Instructor: Murat Manguoglu(Section 1) Adapted from: and
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CENG709 Computer Architecture and Operating Systems Lecture 7 - Memory Hierarchy-II Murat Manguoglu Department of Computer Engineering Middle East Technical.
5.2 Eleven Advanced Optimizations of Cache Performance
Krste Asanovic Electrical Engineering and Computer Sciences
Bojian Zheng CSCD70 Spring 2018
Krste Asanovic Electrical Engineering and Computer Sciences
Chapter 5 Memory CSE 820.
Krste Asanovic Electrical Engineering and Computer Sciences
Electrical and Computer Engineering
Krste Asanovic Electrical Engineering and Computer Sciences
Siddhartha Chatterjee
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 7 – Memory III Krste Asanovic Electrical Engineering and Computer.
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Cache Performance Improvements
CSE 486/586 Distributed Systems Cache Coherence
Lecture 14: Cache Performance
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo

CSE 490/590, Spring Last time… Types of cache misses: 3 C’s –Compulsory, Capacity, and Conflict Write policies –Write through vs. write back –No write allocate vs. write allocate Multi-level cache hierarchies reduce miss penalty –Inclusive versus exclusive caching policy –Can change design tradeoffs of L1 cache if known to have L2 Prefetching –Speculate future I & D accesses and fetch them into caches –Usefulness & timeliness

CSE 490/590, Spring 2011 Write-Back Cache Accesses Write-back cache –Writes only go to cache (make dirty lines) –Upon evict, update memory 0 mem access –Write hit 1 mem access –Read miss on a clean line 2 mem accesses –Read miss on a dirty line Variable cycles per read/write, might complicate the pipeline control 3

CSE 490/590, Spring Hardware Instruction Prefetching Instruction prefetch in Alpha AXP –Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1) –Requested block placed in cache, and next block in instruction stream buffer –If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2) L1 Instruction Unified L2 Cache RF CPU Stream Buffer Prefetched instruction block Req block Req block

CSE 490/590, Spring Hardware Data Prefetching Prefetch-on-miss: –Prefetch b + 1 upon miss on b One Block Lookahead (OBL) scheme –Initiate prefetch for block b + 1 when block b is accessed –Why is this different from doubling block size? –Can extend to N-block lookahead Strided prefetch –If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access

CSE 490/590, Spring Software Prefetching for(i=0; i < N; i++) { prefetch( &a[i + 1] ); prefetch( &b[i + 1] ); SUM = SUM + a[i] * b[i]; } What property do we require of the cache for prefetching to work ?

CSE 490/590, Spring Software Prefetching Issues Timing is the biggest issue, not predictability –If you prefetch very close to when the data is required, you might be too late –Prefetch too early, cause pollution –Estimate how long it will take for the data to come into L1, so we can set P appropriately – Why is this hard to do? for(i=0; i < N; i++) { prefetch( &a[i + P ] ); prefetch( &b[i + P ] ); SUM = SUM + a[i] * b[i]; } Must consider cost of prefetch instructions

CSE 490/590, Spring Compiler Optimizations Restructuring code affects the data block access sequence –Group data accesses together to improve spatial locality –Re-order data accesses to improve temporal locality Prevent data from entering the cache –Useful for variables that will only be accessed once before being replaced –Needs mechanism for software to tell hardware not to cache data (“no-allocate” instruction hints or page table bits) Kill data that will never be used again –Streaming data exploits spatial locality but not temporal locality –Replace into dead cache locations

CSE 490/590, Spring Loop Interchange for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } }

CSE 490/590, Spring Loop Fusion for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i]; for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } What type of locality does this improve?

CSE 490/590, Spring for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } Matrix Multiply, Naïve Code Not touchedOld accessNew access xj i yk i zj k

CSE 490/590, Spring for(jj=0; jj < N; jj=jj+B) for(kk=0; kk < N; kk=kk+B) for(i=0; i < N; i++) for(j=jj; j < min(jj+B,N); j++) { r = 0; for(k=kk; k < min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } Matrix Multiply with Cache Tiling What type of locality does this improve? yk i zj k xj i

CSE 490/590, Spring CSE 490/590 Administrivia Midterm on Friday, 3/4 Project 1 deadline: Friday, 3/11 Guest lectures possibly this month Course early-evaluation today

CSE 490/590, Spring Acknowledgements These slides heavily contain material developed and copyright by –Krste Asanovic (MIT/UCB) –David Patterson (UCB) And also by: –Arvind (MIT) –Joel Emer (Intel/MIT) –James Hoe (CMU) –John Kubiatowicz (UCB) MIT material derived from course UCB material derived from course CS252