1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Computer Design 2007 – Caches 1 Dr. Lihu Rappoport Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh MAMAS – Computer.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)

The University of Adelaide, School of Computer Science

Caches Vincent H. Berk October 21, 2005

EE Architecture of Digital Systems Lecture 3 Cache Memory

CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

ENGS 116 Lecture 141 Caches and Main Memory Vincent H. Berk November 5 th, 2008 Reading for Today: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

CES 524 May 6 Eleven Advanced Cache Optimizations (Ch 5) parallel architectures (Ch 4) Slides adapted from Patterson, UC Berkeley.

1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.

Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

February 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Lecture 13 Cache Storage System

1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.

Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

CS136, Advanced Architecture Cache and Memory Performance.

Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.

Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.

Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

COMP 206 Computer Architecture and Implementation Unit 8b: Cache Misses Siddhartha Chatterjee Fall 2000.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院通信邮箱：提交作业邮箱： 2014 年 1.

CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

Lecture 12: Design with Genetic Algorithms Memory I

Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory

/ Computer Architecture and Design

COMP 740: Computer Architecture and Implementation

CPS220 Computer System Organization Lecture 18: The Memory Hierarchy

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CSC 4250 Computer Architectures

The University of Adelaide, School of Computer Science

Lecture 9: Memory Hierarchy (3)

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院课件、作业、讨论网址：

11 Advanced Cache Optimizations

5.2 Eleven Advanced Optimizations of Cache Performance

CPE 631 Lecture 06: Cache Design

CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued) Feb 12, 2002 Prof. David Culler.

CS252 Graduate Computer Architecture Lecture 4 Cache Design

January 24, 2001 Prof. John Kubiatowicz

Lecture 14: Reducing Cache Misses

CS203A Graduate Computer Architecture Lecture 13 Cache Design

CPE 631 Lecture 05: CPU Caches

Memory Hierarchy.

Siddhartha Chatterjee

/ Computer Architecture and Design

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

Cache Performance Improvements

10/18: Lecture Topics Using spatial locality

Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)

2Outline  Improving Cache Performance Reducing misses -- contd. from previous lecture Reducing misses -- contd. from previous lecture Reducing miss penalty Reducing miss penalty Reading: HP3 Sections

3 5. Reduce Misses by Hardware Prefetching  Instruction prefetching Alpha fetches 2 blocks on a miss Alpha fetches 2 blocks on a miss Extra block placed in stream buffer Extra block placed in stream buffer On miss check stream buffer On miss check stream buffer  Works with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches  Prefetching relies on extra memory bandwidth that can be used without penalty e.g., up to 8 prefetch stream buffers in the UltraSPARC III e.g., up to 8 prefetch stream buffers in the UltraSPARC III

4 6. Reducing Misses by Software Prefetching  Data prefetch Compiler inserts special “prefetch” instructions into program Compiler inserts special “prefetch” instructions into program  Load data into register (HP PA-RISC loads)  Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9) A form of speculative execution A form of speculative execution  don’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” to prgm Most effective prefetches are “semantically invisible” to prgm  does not change registers or memory  cannot cause a fault/exception  if they would fault, they are simply turned into NOP’s  Issuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses? Is cost of prefetch issues < savings in reduced misses?

5 7. Reduce Misses by Compiler Optzns.  Instructions Reorder procedures in memory so as to reduce misses Reorder procedures in memory so as to reduce misses Profiling to look at conflicts Profiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks  Data Merging Arrays Merging Arrays  Improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange Loop Interchange  Change nesting of loops to access data in order stored in memory Loop Fusion Loop Fusion  Combine two independent loops that have same looping and some variables overlap Blocking Blocking  Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

6 Merging Arrays Example  Reduces conflicts between val and key  Addressing expressions are different /* Before */ int val[SIZE]; int key[SIZE]; /* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE];

7 Loop Interchange Example  Sequential accesses instead of striding through memory every 100 words /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

8 Loop Fusion Example  Before: 2 misses per access to a and c  After: 1 miss per access to a and c /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j]; /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}

9 Blocking Example  Two Inner Loops: Read all NxN elements of z[] Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Write N elements of 1 row of x[]  Capacity Misses a function of N and Cache Size 3 NxN  no capacity misses; otherwise... 3 NxN  no capacity misses; otherwise...  Idea: compute on BxB submatrix that fits /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; } /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }

10 Blocking Example (contd.)  Age of accesses White means not touched yet White means not touched yet Light gray means touched a while ago Light gray means touched a while ago Dark gray means newer accesses Dark gray means newer accesses

11 Blocking Example (contd.)  Work with BxB submatrices smaller working set can fit within the cache smaller working set can fit within the cache  fewer capacity misses /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; } /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }

12 Blocking Example (contd.)  Capacity misses go from (2N 3 + N 2 ) to (2N 3 /B +N 2 )  B = “blocking factor”  What happens to conflict misses?

13 Reducing Conflict Misses by Blocking  Conflict misses in non-FA caches vs. block size Lam et al [1991] found that a blocking factor of 24 had a fifth the misses vs. 48 despite the fact that both fit in cache Lam et al [1991] found that a blocking factor of 24 had a fifth the misses vs. 48 despite the fact that both fit in cache

14 Summary: Compiler Optimizations to Reduce Cache Misses

15 Reducing Miss Penalty 1. Read Priority over Write on Miss: Write through: Write through:  Using write buffers: RAW conflicts with reads on cache misses  If simply wait for write buffer to empty might increase read miss penalty by 50% (old MIPS 1000)  Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Write Back?  Read miss replacing dirty block  Normal: Write dirty block to memory, and then do the read  Instead copy the dirty block to a write buffer, then do the read, and then do the write  CPU stall less since restarts as soon as read completes

16 Valid Bits Fetching Subblocks to Reduce Miss Penalty  Don’t have to load full block on a miss  Have bits per subblock to indicate valid

17 3. Early Restart and Critical Word First  Don’t wait for full block to be loaded before restarting CPU Early Restart —As soon as the requested word of the block arrrives, send it to the CPU and let the CPU continue execution Early Restart —As soon as the requested word of the block arrrives, send it to the CPU and let the CPU continue execution Critical Word First —Request the missed word first from memory and send it to the CPU as soon as it arrives Critical Word First —Request the missed word first from memory and send it to the CPU as soon as it arrives  let the CPU continue while filling the rest of the words in the block.  also called “wrapped fetch” and “requested word first”  Generally useful only in large blocks  Spatial locality a problem tend to want next sequential word, so not clear if benefit by early restart tend to want next sequential word, so not clear if benefit by early restart

18 4. Non-blocking Caches  Non-blocking cache or lockup-free cache allows the data cache to continue to supply cache hits during a miss “Hit under miss” “Hit under miss”  reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU “Hit under multiple miss” or “miss under miss” “Hit under multiple miss” or “miss under miss”  may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses

19 Value of Hit Under Miss for SPEC  FP programs on average: AMAT= > > > 0.26  Int programs on average: AMAT= > > > 0.19  8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss Integer Floating Point “Hit under i Misses”

20 5. Miss Penalty Reduction: L2 Cache L2 Equations: AMAT = Hit Time L1 + Miss Rate L1  Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2  Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1  (Hit Time L2 + Miss Rate L2  Miss Penalty L2 ) Definitions: Local miss rate — misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) Global miss rate —misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L1  Miss Rate L2 )

21 Reducing Misses: Which Apply to L2 Cache?  Reducing Miss Rate 1. Reduce Misses via Larger Block Size 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Capacity/Conf. Misses by Compiler Optimizations 7. Reducing Capacity/Conf. Misses by Compiler Optimizations

22 L2 cache block size & A.M.A.T.  32KB L1, 8 byte path to memory

23 Reducing Miss Penalty Summary  Five techniques Read priority over write on miss Read priority over write on miss Subblock placement Subblock placement Early Restart and Critical Word First on miss Early Restart and Critical Word First on miss Non-blocking Caches (Hit Under Miss) Non-blocking Caches (Hit Under Miss) Second Level Cache Second Level Cache  reducing L2 miss rate effectively reduces L1 miss penalty!  Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between Danger is that time to DRAM will grow with multiple levels in between

24 Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. (next lecture)