Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Similar presentations


Presentation on theme: "Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006."— Presentation transcript:

1 Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006

2 Anshul Kumar, CSE IITD slide 2 PerformancePerformance Average memory access time = Hit time + Mem stalls / access = Hit time + Miss rate * Miss penalty Program execution time = IC * Cycle time * (CPI exec + Mem stalls / instr) Mem stalls / instr = Miss rate * Miss Penalty * Mem accesses / instr Miss Penalty in OOO processor = Total miss latency - Overlapped miss latency

3 Anshul Kumar, CSE IITD slide 3 Performance Improvement Reducing miss penalty Reducing miss rate Reducing miss penalty * miss rate Reducing hit time

4 Anshul Kumar, CSE IITD slide 4 Reducing Miss Penalty Multi level caches Critical word first and early restart Giving priority to read misses over write Merging write buffer Victim caches

5 Anshul Kumar, CSE IITD slide 5 Multi Level Caches Average memory access time = Hit time L1 + Miss rate L1 * Miss penalty L1 Miss penalty L1 = Hit time L2 + Miss rate L2 * Miss penalty L2 Multi level inclusion and Multi level exclusion

6 Anshul Kumar, CSE IITD slide 6 Misses in Multilevel Cache Local Miss rate –no. of misses / no. of requests, as seen at a level Global Miss rate –no. of misses / no. of requests, on the whole Solo Miss rate –miss rate if only this cache was present

7 Anshul Kumar, CSE IITD slide 7 Two level cache miss example A: L1, L2 B: ~L1, L2 C: L1, ~L2 D: ~L1, ~L2 Local miss (L1) = (B+D)/(A+B+C+D) Local miss (L2) = D/(B+D) Global Miss = D/(A+B+C+D) Solo miss (L2) = (C+D)/(A+B+C+D)

8 Anshul Kumar, CSE IITD slide 8 Critical Word First and Early Restart Read policy Load policy More effective when block size is large

9 Anshul Kumar, CSE IITD slide 9 Read Miss Priority Over Write Provide write buffers Processor writes into buffer and proceeds (for write through as well as write back) On read miss –wait for buffer to be empty, or –check addresses in buffer for conflict

10 Anshul Kumar, CSE IITD slide 10 Merging Write Buffer Merge writes belonging to same block in case of write through

11 Anshul Kumar, CSE IITD slide 11 Victim Cache (proposed by Jouppi) Evicted blocks are recycled Much faster than getting a block from the next level Size = 1 to 5 blocks A significant fraction of misses may be found in victim cache Cache Victim Cache from mem to proc

12 Anshul Kumar, CSE IITD slide 12 Reducing Miss Rate Large block size Larger cache Higher associativity Way prediction and pseudo-associative cache Warm start in multi-tasking Compiler optimizations

13 Anshul Kumar, CSE IITD slide 13 Large Block Size Reduces compulsory misses Too large block size - misses increase Miss Penalty increases

14 Anshul Kumar, CSE IITD slide 14 Large Cache Reduces capacity misses Hit time increases Keep small L1 cache and large L2 cache

15 Anshul Kumar, CSE IITD slide 15 Higher Associativity Reduces conflict misses 8-way is almost like fully associative Hit time increases

16 Anshul Kumar, CSE IITD slide 16 Way Prediction and Pseudo- associative Cache Way prediction: low miss rate of SA cache with hit time of DM cache Only one tag is compared initially Extra bits are kept for prediction Hit time in case of mis-prediction is high Pseudo-assoc. or column assoc. cache: get advantage of SA cache in a DM cache Check sequentially in a pseudo-set Fast hit and slow hit

17 Anshul Kumar, CSE IITD slide 17 Warm Start in Multi-tasking Cold start –process starts with empty cache –blocks of previous process invalidated Warm start –some blocks from previous activation are still available

18 Anshul Kumar, CSE IITD slide 18 Compiler optimizations Loop interchange Improve spatial locality by scanning arrays row-wise Blocking Improve temporal and spatial locality

19 Anshul Kumar, CSE IITD slide 19 Improving Locality Matrix Multiplication example

20 Anshul Kumar, CSE IITD slide 20 Cache Organization for the example Cache line (or block) = 4 matrix elements. Matrices are stored row wise. Cache can’t accommodate a full row/column. (In other words, L, M and N are so large w.r.t. the cache size that after an iteration along any of the three indices, when an element is accessed again, it results in a miss.) Ignore misses due to conflict between matrices. (as if there was a separate cache for each matrix.)

21 Anshul Kumar, CSE IITD slide 21 Matrix Multiplication : Code I for (i = 0; i < L; i++) for (j = o; j < M; j++) for (k = 0; k < N; k++) c[i][j] += A[i][k] * B[k][j]; CAB accessesLMLMNLMN missesLM/4LMN/4LMN Total misses = LM(5N+1)/4

22 Anshul Kumar, CSE IITD slide 22 Matrix Multiplication : Code II for (k = 0; k < N; k++) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; CAB accessesLMNLNLMN missesLMN/4LNLMN/4 Total misses = LN(2M+4)/4

23 Anshul Kumar, CSE IITD slide 23 Matrix Multiplication : Code III for (i = 0; i < L; i++) for (k = 0; k < N; k++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; CAB accessesLMNLNLMN missesLMN/4LN/4LMN/4 Total misses = LN(2M+1)/4

24 Anshul Kumar, CSE IITD slide 24 BlockingBlocking =  k j jj kk k i j jj i kk i j k 5 nested loops blocking factor = b CAB accessesLMN/bLMN/bLMN missesLMN/4bLMN/4bMN/4 Total misses = MN(2L/b+1)/4

25 Anshul Kumar, CSE IITD slide 25 Loop Blocking for (k = 0; k < N; k+=4) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k]*B[k][j] +A[i][k+1]*B[k+1][j] +A[i][k+2]*B[k+2][j] +A[i][k+3]*B[k+3][j]; CAB accessesLMN/4LNLMN missesLMN/16LN/4LMN/4 Total misses = LN(5M/4+1)/4

26 Anshul Kumar, CSE IITD slide 26 Reducing Miss Penalty * Miss Rate Non-blocking cache Hardware prefetching Compiler controlled prefetching

27 Anshul Kumar, CSE IITD slide 27 Non-blocking Cache In OOO processor Hit under a miss –complexity of cache controller increases Hit under multiple misses or miss under a miss –memory should be able to handle multiple misses

28 Anshul Kumar, CSE IITD slide 28 Hardware Prefetching Prefetch items before they are requested –both data and instructions What and when to prefetch? –fetch two blocks on a miss (requested+next) Where to keep prefetched information? –in cache –in a separate buffer (most common case)

29 Anshul Kumar, CSE IITD slide 29 Prefetch Buffer/Stream Buffer Cache prefetch buffer from mem to proc

30 Anshul Kumar, CSE IITD slide 30 Hardware prefetching: Stream buffers Joupi’s experiment [1990]: Single instruction stream buffer catches 15% to 25% misses from a 4KB direct mapped instruction cache with 16 byte blocks 4 block buffer – 50%, 16 block – 72% single data stream buffer catches 25% misses from 4 KB direct mapped cache 4 data stream buffers (each prefetching at a different address) – 43%

31 Anshul Kumar, CSE IITD slide 31 HW prefetching: UltraSPARC III example 64 KB data cache, 36.9 misses per 1000 instructions 22% instructions make data reference hit time = 1, miss penalty = 15 prefetch hit rate = 20% 1 cycle to get data from prefetch buffer What size of cache will give same performance? miss rate = 36.9/220 = 16.7% av mem access time =1+(.167*.2*1)+(.167*.8*15)=3.046 effective miss rate = (3.046-1)/15=13.6%=> 256 KB cache

32 Anshul Kumar, CSE IITD slide 32 Compiler Controlled Prefetching Register prefetch / Cache prefetch Faulting / non-faulting (non-binding) Semantically invisible (no change in registers or cache contents) Makes sense if processor doesn’t stall while prefetching (non-blocking cache) Overhead of prefetch instruction should not exceed the benefit

33 Anshul Kumar, CSE IITD slide 33 SW Prefetch Example 8 KB direct mapped, write back data cache with 16 byte blocks. a is 3  100, b is 101  3 for (i = 0; i < 3; i++) for (j = o; j < 100; j++) a[i][j] = b[j][0] * b[j+1][0]; each array element is 8 bytes misses in array a = 3 * 100 /2 = 150 misses in array b = 101 total misses = 251

34 Anshul Kumar, CSE IITD slide 34 SW Prefetch Example – contd. Suppose we need to prefetch 7 iterations in advance for (j = o; j < 100; j++){ prefetch(b[j+7]][0]); prefetch(a[0][j+7]); a[0][j] = b[j][0] * b[j+1][0];}; for (i = 1; i < 3; i++) for (j = o; j < 100; j++){ prefetch(a[i][j+7]); a[i][j] = b[j][0] * b[j+1][0];}; misses in first loop = 7 (for b[0..6][0]) + 4 (for a[0][0..6] ) misses in second loop = 4 (for a[1][0..6]) + 4 (for a[2][0..6] ) total misses = 19, total prefetches = 400

35 Anshul Kumar, CSE IITD slide 35 SW Prefetch Example – contd. Performance improvement? Assume no capacity and conflict misses, prefetches overlap with each other and with misses Original loop: 7, Prefetch loops: 9 and 8 cycles Miss penalty = 100 cycles Original loop = 300*7 + 251*100 = 27,200 cycles 1 st prefetch loop = 100*9 + 11*100 = 2,000 cycles 2 nd prefetch loop = 200*8 + 8*100 = 2,400 cycles Speedup = 27200/(2000+2400) = 6.2

36 Anshul Kumar, CSE IITD slide 36 Reducing Hit Time Small and simple caches Avoid time loss in address translation Pipelined cache access Trace caches

37 Anshul Kumar, CSE IITD slide 37 Small and Simple Caches Small size => faster access Small size => fit on the chip, lower delay Simple (direct mapped) => lower delay Second level – tags may be kept on chip

38 Anshul Kumar, CSE IITD slide 38 Cache access time estimates using CACTI.8 micron technology, 1 R/W port, 32 b address, 64 b o/p, 32 B block

39 Anshul Kumar, CSE IITD slide 39 Avoid time loss in addr translation Virtually indexed, physically tagged cache –simple and effective approach –possible only if cache is not too large Virtually addressed cache –protection? –multiple processes? –aliasing? –I/O?

40 Anshul Kumar, CSE IITD slide 40 Cache Addressing Physical Address –first convert virtual address into physical address, then access cache –no time loss if index field available without address translation Virtual Address –access cache directly using the virtual address

41 Anshul Kumar, CSE IITD slide 41 Problems with virtually addressed cache page level protection? –copy protection info from TLB same virtual address from two different processes needs to be distinguished –purge cache blocks on context switch or use PID tags along with other address tags aliasing (different virtual addresses from two processes pointing to same physical address) – inconsistency? I/O uses physical addresses

42 Anshul Kumar, CSE IITD slide 42 Multi processes in virtually addr cache purge cache blocks on context switch use PID tags along with other address tags

43 Anshul Kumar, CSE IITD slide 43 Inconsistency in virtually addr cache Hardware solution (Alpha 21264) –64 KB cache, 2-way set associative, 8 KB page –a block with a given offset in a page can map to 8 locations in cache –check all 8 locations, invalidate duplicate entries Software solution (page coloring) –make 18 lsbs of all aliases same – ensures that direct mapped cache  256 KB has no duplicates –i.e., 4 KB pages are mapped to 64 sets (or colors)

44 Anshul Kumar, CSE IITD slide 44 Pipelined Cache Access Multi-cycle cache access but pipelined reduces cycle time but hit time is more than one cycle Pentium 4 takes 4 cycles greater penalty on branch misprediction more clock cycles between issue of load and use of data

45 Anshul Kumar, CSE IITD slide 45 Trace Caches what maps to a cache block? –not statically determined –decided by the dynamic sequence of instructions, including predicted branches Used in Pentium 4 (NetBurst architecture) starting addresses not word size * powers of 2 Better utilization of cache space downside – same instruction may be stored multiple times


Download ppt "Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006."

Similar presentations


Ads by Google