Siddhartha Chatterjee

Siddhartha Chatterjee
Cache Performance Fall 2000 Siddhartha Chatterjee

Sources of Cache Misses
COMP 206 Fall 2000 Sources of Cache Misses Direct Mapped N-way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss Same Same Same Conflict Miss High Medium Zero Capacity Miss Low(er) Medium High This is a hidden slide. Invalidation Miss Same Same Same If you are going to run “billions” of instruction, compulsory misses are insignificant. Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

COMP 206 Fall 2000 3Cs Absolute Miss Rate Conflict Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

How to Improve Cache Performance
Latency Reduce miss rate (Section 5.3 of HP2) Reduce miss penalty (Section 5.4 of HP2) Reduce hit time (Section 5.5 of HP2) Bandwidth Increase hit bandwidth Increase miss bandwidth Fall 2000 Siddhartha Chatterjee

1. Reduce Misses via Larger Block Size
COMP 206 Fall 2000 1. Reduce Misses via Larger Block Size Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

2. Reduce Misses via Higher Associativity
COMP 206 Fall 2000 2. Reduce Misses via Higher Associativity 2:1 Cache Rule Miss Rate DM cache size N  Miss Rate FA cache size N/2 Not merely empirical Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2): ,1985 Beware: Execution time is only final measure! Will clock cycle time increase? Hill [1988] suggested hit time external cache +10%, internal + 2% for 2-way vs. 1-way Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

3. Reduce Conflict Misses via Victim Cache
COMP 206 Fall 2000 3. Reduce Conflict Misses via Victim Cache CPU How to combine fast hit time of direct mapped yet still avoid conflict misses Add small highly associative buffer to hold data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache TAG DATA ? TAG DATA Mem ? Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

5. Reduce Misses by Hardware Prefetching of Instruction & Data
COMP 206 Fall 2000 5. Reduce Misses by Hardware Prefetching of Instruction & Data Instruction prefetching Alpha fetches 2 blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on extra memory bandwidth that can be used without penalty Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

6. Reducing Misses by Software Prefetching Data
COMP 206 Fall 2000 6. Reducing Misses by Software Prefetching Data Data prefetch Load data into register (HP PA-RISC loads) binding Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v9) non-binding Special prefetching instructions cannot cause faults; a form of speculative execution Issuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses? Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

7. Reduce Misses by Compiler Optimizations
Fall 2000 7. Reduce Misses by Compiler Optimizations Instructions Reorder procedures in memory so as to reduce misses Profiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks Data Merging Arrays Improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange Change nesting of loops to access data in order stored in memory Loop Fusion Combine two independent loops that have same looping and some variables overlap Blocking Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Merging Arrays Example
COMP 206 Fall 2000 Merging Arrays Example /* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reduces conflicts between val and key Addressing expressions are different Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Loop Interchange Example
COMP 206 Fall 2000 Loop Interchange Example /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

COMP 206 Fall 2000 Loop Fusion Example /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} Two misses per access to a & c vs. one miss per access Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

COMP 206 Fall 2000 Blocking Example /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) {r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r;} Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N and Cache Size 3 NxN  no capacity misses; otherwise ... Idea: compute on BxB submatrix that fits Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

COMP 206 Fall 2000 Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) {r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r;} Capacity misses go from 2N3 + N2 to 2N3/B +N2 B called Blocking Factor What happens to conflict misses? Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Summary of Compiler Optimizations to Reduce Cache Misses
Fall 2000 Summary of Compiler Optimizations to Reduce Cache Misses Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

COMP 206 Fall 2000 Impact of Caches : Speed = ƒ(no. operations) 1997 Pipelined Execution & Fast Clock Rate Out-of-Order completion Superscalar Instruction Issue 1999: Speed = ƒ(non-cached memory accesses) What does this mean for Compilers, Operating Systems, Algorithms, Data Structures? Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Siddhartha Chatterjee

Similar presentations

Presentation on theme: "Siddhartha Chatterjee"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Siddhartha Chatterjee

Similar presentations

Presentation on theme: "Siddhartha Chatterjee"— Presentation transcript:

Similar presentations

About project

Feedback