# Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance.

## Presentation on theme: "Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance."— Presentation transcript:

Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance

© 2006 Department of Computing Science CMPUT 229 Standard Matrix Multiplication for (i = 0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4219374/slides/slide_2.jpg", "name": "© 2006 Department of Computing Science CMPUT 229 Standard Matrix Multiplication for (i = 0; i

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B

© 2006 Department of Computing Science CMPUT 229 Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = \$8000000, Address(b[0,0]) = \$80800000 Address(c[0,0]) = \$8100000 What is the data cache hit ratio for this program? 32K-byte cache 128-byte cache line = 256 lines/cache

© 2006 Department of Computing Science CMPUT 229 Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = \$8000000, Address(b[0,0]) = \$80800000 Address(c[0,0]) = \$8100000 What is the data cache hit ratio for this program? 128-byte cache lines 8-byte element = 16 elements/line 32K-byte cache 128-byte cache line = 256 lines/cache

© 2006 Department of Computing Science CMPUT 229 Cache Data Access Pattern If we ignore conflict misses, then: Every 16th access of A is a miss; Every access to B is a miss; How many hits and misses will occur to compute one element of C? 256 lines/cache 16 elements/line In A there will be 1024/16 = 64 misses and 1024-64 = 960 hits. In B there will be 1024 misses. Thus, what is the hit ratio? # hits # of accesses Hit ratio = = 960 hits 2048 accesses = 0.47 = 47%

© 2006 Department of Computing Science CMPUT 229 Address anatomy The data cache has 32 Kbytes and 128-byte cache lines; 128 = 2 7 256 = 2 8 256 lines/cache 16 elements/line 151476031 TagIndexOffset 7 bits 8 bits 17 bits

© 2006 Department of Computing Science CMPUT 229 Conflict Misses 256 lines/cache 16 elements/line Cache Access Address Index Outcome A[0,0] \$80000000 0 miss B[0,0] \$80800000 0 miss A[0,1] \$80000004 0 miss B[1,0] \$80801000 32 miss A[0,2] \$80000008 0 hit B[2,0] \$80802000 64 miss A[0,3] \$8000000C 0 hit B[3,0] \$80803000 96 miss A[0,4] \$80000010 0 hit B[4,0] \$80804000 128 miss A[0,5] \$80000014 0 hit B[5,0] \$80805000 160 miss A[0,6] \$80000018 0 hit B[6,0] \$80806000 192 miss A[0,7] \$8000001C 0 hit B[7,0] \$80807000 244 miss A[0,8] \$80000020 0 hit B[8,0] \$80808000 0 miss A[0,9] \$80000024 0 miss B[9,0] \$80809000 32 miss 0  32  64  96  128  160  192  244  In General: A 1024-element row of A Occupies 64 16-element cache lines. There will be 2 conflict misses in two of these rows. A total of 4 conflict misses per row. Thus the accesses of A will result in 68 misses and 986 hits for each 1024 accesses. The conflict misses are not significant and can be ignored.

© 2006 Department of Computing Science CMPUT 229 Matrix Multiplication with Transpose for (i = 0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4219374/slides/slide_9.jpg", "name": "© 2006 Department of Computing Science CMPUT 229 Matrix Multiplication with Transpose for (i = 0; i

© 2006 Department of Computing Science CMPUT 229 Where to place matrix b1? 151476031 TagIndexOffset Intuitively the index of b1[0][0] should be away from the index of a[0][0]. The index of a[0][0] is 0. Thus we could aim to place b1 at an address whose index is 128.

© 2006 Department of Computing Science CMPUT 229 Cache Access Pattern for the Transpose If we ignore conflict misses, then: Every 16th access of b1 is a miss; Every access to b is a miss; The transpose’s inner loop yields: 2048 accesses 960 hits. And the inner loop is repeated 1024 times: 1024  2048 accesses 1024  960 hits Thus, the hit ratio is: # hits # of accesses Hit ratio = = 960 hits 2048 accesses = 0.47 = 47% for (i = 0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4219374/slides/slide_11.jpg", "name": "© 2006 Department of Computing Science CMPUT 229 Cache Access Pattern for the Transpose If we ignore conflict misses, then: Every 16th access of b1 is a miss; Every access to b is a miss; The transpose’s inner loop yields: 2048 accesses 960 hits.", "description": "And the inner loop is repeated 1024 times: 1024  2048 accesses 1024  960 hits Thus, the hit ratio is: # hits # of accesses Hit ratio = = 960 hits 2048 accesses = 0.47 = 47% for (i = 0; i

© 2006 Department of Computing Science CMPUT 229 Cache Access Pattern for the Multiplication If we ignore conflict misses, then: Every 16th access of a is a miss; Every 16th access to b1 is a miss; Thus the inner loop yields 2048 accesses and 1920 hits. for (i = 0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4219374/slides/slide_12.jpg", "name": "© 2006 Department of Computing Science CMPUT 229 Cache Access Pattern for the Multiplication If we ignore conflict misses, then: Every 16th access of a is a miss; Every 16th access to b1 is a miss; Thus the inner loop yields 2048 accesses and 1920 hits.", "description": "for (i = 0; i

© 2006 Department of Computing Science CMPUT 229 Hit Ratio for Multiplication with Transpose 1024  960+ 1024  1024  1920 hits 2048  1024 + 1024  1024  2048 accesses Hit ratio = The total number of accesses (ignoring accesses to c) in the multiplication is: 1024  1024  2048 accesses 1024  1024  1920 hits The transpose yields: 1024  2048 accesses 1024  960 hits. 960+ 1024  1920 hits 1025  2048 accesses Hit ratio = = 0.937 = 93.7%

© 2006 Department of Computing Science CMPUT 229 Blocked Matrix Multiplication* for (i0 = 0; i0 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4219374/slides/slide_14.jpg", "name": "© 2006 Department of Computing Science CMPUT 229 Blocked Matrix Multiplication* for (i0 = 0; i0

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 2 0

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 3 1

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 2

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 4

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 6

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 8

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 10

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 12

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4 14 Multiplying the first row of the block of A by the block of B required 18 accesses that resulted in 4 misses. How many of the 18 accesses required to multiply the second row of the block of A by the block of B will be misses?

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4|1 14|1

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4|1 14|17

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit 4| 1 | 1 = 6 14|17|17 = 48

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit What is the hit ratio for the next block multiplication? 4| 1 | 1 = 6 14|17|17 = 48 3 hits and 48 references In general, there are b misses and 2  b 3 accesses 2  b 3 - b 2b32b3 Hit ratio =

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B miss hit What is the hit ratio for the next block multiplication? 4| 1 | 1 = 6 14|17|17 = 48 3 hits and 48 references In general, there are b misses and 2  b 3 accesses 2  b 2 - 1 2b22b2 Hit ratio =

© 2006 Department of Computing Science CMPUT 229 Data Access Pattern A B Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; What should be the value of b? Do the memory locations of A and B matter? miss hit

© 2006 Department of Computing Science CMPUT 229 Cache Usage for Blocked Matrix Multiplication Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; for (i0 = 0; i0 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4219374/slides/slide_30.jpg", "name": "© 2006 Department of Computing Science CMPUT 229 Cache Usage for Blocked Matrix Multiplication Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; for (i0 = 0; i0