Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP 740: Computer Architecture and Implementation

Similar presentations


Presentation on theme: "COMP 740: Computer Architecture and Implementation"— Presentation transcript:

1 COMP 740: Computer Architecture and Implementation
Montek Singh Sep 14, 2016 Topic: Optimization of Cache Performance

2 Outline Cache Performance Means of improving performance
Read textbook Appendix B.3 and Ch. 2.2 2

3 How to Improve Cache Performance
Latency Reduce miss rate Reduce miss penalty Reduce hit time Bandwidth Increase hit bandwidth Increase miss bandwidth

4 1. Reduce Misses via Larger Block Size
Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size is too large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot these lines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on a DECstation 5000 [Gee et al. 1993].

5 2. Reduce Misses by Increasing Cache Size
Increasing cache size reduces cache misses both capacity misses and conflict misses reduced

6 3. Reduce Misses via Higher Associativity
2:1 Cache Rule Miss Rate DM cache size N  Miss Rate FA cache size N/2 Not merely empirical Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2): ,1985 Beware: Execution time is only final measure! Will clock cycle time increase? Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-way

7 Example: Ave Mem Access Time vs. Miss Rate
Example: assume clock cycle time is 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of direct mapped (Red means A.M.A.T. not improved by more associativity)

8 4. Miss Penalty Reduction: L2 Cache
L2 Equations: AMAT = Hit TimeL1 + Miss RateL1  Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2  Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1  (Hit TimeL2 + Miss RateL2 Miss PenaltyL2) Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1  Miss RateL2)

9 5. Reducing Miss Penalty Read Priority over Write on Miss: Challenges:
Goal: allow reads to be served before writes have completed Challenges: Write-through caches: Using write buffers: RAW conflicts with reads on cache misses If simply wait for write buffer to empty might increase read miss penalty by 50% (old MIPS 1000) Check write buffer contents before read; if no conflicts, let the memory access continue Write-back caches: Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as read completes

10 Summary of Basic Optimizations
The University of Adelaide, School of Computer Science 28 April 2018 Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss rate Increases hit time, increases power consumption Higher associativity Reduces conflict misses Higher number of cache levels Reduces overall memory access time Giving priority to read misses over writes Reduces miss penalty Avoiding address translation in cache indexing (later) Reduces hit time Chapter 2 — Instructions: Language of the Computer

11 More advanced optimizations

12 1. Fast Hit Times via Small, Simple Caches
Simple caches can be faster cache hit time increasingly a bottleneck to CPU performance set associativity requires complex tag matching  slower direct-mapped are simpler  faster  shorter CPU cycle times tag check can be overlapped with transmission of data Smaller caches can be faster can fit on the same chip as CPU avoid penalty of going off-chip for L2 caches: compromise keep tags on chip, and data off chip fast tag check, yet greater cache capacity L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV

13 Simple and small is fast
Access time vs. size and associativity

14 Simple and small is energy-efficient
Energy per read vs. size and associativity

15 The University of Adelaide, School of Computer Science
2. Way Prediction The University of Adelaide, School of Computer Science 28 April 2018 Way prediction to improve hit time Goal: reduce conflict misses, yet maintain hit speed of a direct-mapped cache Approach: keep extra bits to predict the “way” within the set the output multiplexor is pre-set to select the desired block if block is correct one, fast hit time of 1 clock cycle if block isn’t correct, check other blocks in 2nd clock cycle Mis-prediction gives longer hit time Prediction accuracy > 90% for two-way > 80% for four-way I-cache has better accuracy than D-cache First used on MIPS R10000 in mid-90s Used on ARM Cortex-A8 Chapter 2 — Instructions: Language of the Computer

16 The University of Adelaide, School of Computer Science
2a. Way Selection The University of Adelaide, School of Computer Science 28 April 2018 Extension of way prediction Idea: Instead of pre-setting the output multiplexor to select the correct block out of many… … only the ONE predict block is actually read from the cache Pros: energy efficient only reading one block (assuming prediction is correct) Cons: longer latency on misprediction if prediction was wrong, other block(s) have to now be read and their tags checks Chapter 2 — Instructions: Language of the Computer

17 The University of Adelaide, School of Computer Science
3. Pipelining Cache The University of Adelaide, School of Computer Science 28 April 2018 Pipeline cache access to improve bandwidth For faster clock cycle time: allow L1 hit time to be multiple clock cycles (instead of 1 cycle) make cache pipelined, so it still has high bandwidth Examples: Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles Cons: increases number of pipeline stages for an instruction longer branch mis-prediction penalty more clock cycles between “load” and receiving the data Pros: allows faster clock rate for the processor makes it easier to increase associativity Chapter 2 — Instructions: Language of the Computer

18 4. Non-blocking Caches Non-blocking cache or lockup-free cache allows the data cache to continue to supply cache hits during a miss “Hit under miss” reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU “Hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses

19 Value of Hit Under Miss for SPEC
Hit under 1 miss, 2 misses and 64 misses Hit under 1 miss miss penalty reduced 9% for integer and 12.5% for floating-pt Hit under 2 misses benefir is slightly higher: 10% and 16% respectively No further benefit for 64 misses

20 The University of Adelaide, School of Computer Science
5. Multibanked Caches The University of Adelaide, School of Computer Science 28 April 2018 Organize cache as independent banks to support simultaneous access originally banks only used for main memory now common for L2 caches ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for L2 Interleave banks according to block address can be accesses in parallel Chapter 2 — Instructions: Language of the Computer

21 6. Early Restart and Critical Word First
Don’t wait for full block to be loaded before restarting CPU Early Restart—As soon as the requested word of the block arrrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives let the CPU continue while filling the rest of the words in the block. also called “wrapped fetch” and “requested word first” Generally useful only in large blocks Spatial locality a problem tend to want next sequential word, so not clear if benefit by early restart

22 The University of Adelaide, School of Computer Science
7. Merging Write Buffer The University of Adelaide, School of Computer Science 28 April 2018 Write buffers used in both write-through and write-back write-through: write sent to buffer so memory update can happen in background write-back: when a dirty block is replaced, write sent to buffer Merging writes: when updating a location that is already pending in the write buffer, update write buffer, instead of creating a new entry in write buffer No write buffer merging With write buffer merging Chapter 2 — Instructions: Language of the Computer

23 Merging Write Buffer (contd.)
The University of Adelaide, School of Computer Science 28 April 2018 Pros: reduces stalls due to write buffer being full But: I/O writes cannot be merged memory-mapped I/O I/O writes become memory writes should not be merged because I/O has different semantics want to keep each I/O event distinct No write buffer merging With write buffer merging 23 Chapter 2 — Instructions: Language of the Computer

24 8. Reduce Misses by Compiler Optzns.
Instructions Reorder procedures in memory so as to reduce misses Profiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks Data Merging Arrays Improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange Change nesting of loops to access data in order stored in memory Loop Fusion Combine two independent loops that have same looping and some variables overlap Blocking Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

25 Merging Arrays Example
/* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reduces conflicts between val and key Addressing expressions are different

26 Loop Interchange Example
/* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words

27 Loop Fusion Example Before: 2 misses per access to a and c
for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} Before: 2 misses per access to a and c After: 1 miss per access to a and c

28 Blocking Example Two Inner Loops:
/* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; } Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N and Cache Size 3 NxN  no capacity misses; otherwise ... Idea: compute on BxB submatrix that fits

29 Blocking Example (contd.)
Age of accesses White means not touched yet Light gray means touched a while ago Dark gray means newer accesses

30 Blocking Example (contd.)
/* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; } Work with BxB submatrices smaller working set can fit within the cache fewer capacity misses

31 Blocking Example (contd.)
Capacity reqd. goes from (2N3 + N2) to (2N3/B +N2) B = “blocking factor”

32 Summary: Compiler Optimizations to Reduce Cache Misses

33 9. Reduce Misses by Hardware Prefetching
Prefetching done by hardware outside of the cache Instruction prefetching Alpha fetches 2 blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on extra memory bandwidth that can be used without penalty e.g., up to 8 prefetch stream buffers in the UltraSPARC III

34 Hardware Prefetching: Benefit
The University of Adelaide, School of Computer Science Hardware Prefetching: Benefit 28 April 2018 Fetch two blocks on miss (include next sequential block) Pentium 4 Pre-fetching Chapter 2 — Instructions: Language of the Computer

35 10. Reducing Misses by Software Prefetching
Data prefetch Compiler inserts special “prefetch” instructions into program Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9) A form of speculative execution don’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” to prgm does not change registers or memory cannot cause a fault/exception if they would fault, they are simply turned into NOP’s Issuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses? Combine with loop unrolling and software pipelining

36 A couple other optimizations

37 Reduce Conflict Misses via Victim Cache
How to combine fast hit time of direct mapped yet avoid conflict misses Add small highly associative buffer to hold data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache CPU TAG DATA ? TAG DATA Mem ?

38 Reduce Conflict Misses via Pseudo-Assoc.
How to combine fast hit time of direct mapped and have the lower conflict misses of 2-way SA cache Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor Hit Time Pseudo Hit Time Miss Penalty Time

39 Fetching Subblocks to Reduce Miss Penalty
Don’t have to load full block on a miss Have bits per subblock to indicate valid Valid Bits 100 200 300 1

40 Review: Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

41 The University of Adelaide, School of Computer Science
Summary The University of Adelaide, School of Computer Science 28 April 2018 Chapter 2 — Instructions: Language of the Computer


Download ppt "COMP 740: Computer Architecture and Implementation"

Similar presentations


Ads by Google