Download presentation

Presentation is loading. Please wait.

Published byJovan Parramore Modified over 2 years ago

1
The Study of Cache Oblivious Algorithms Prepared by Jia Guo

2
CS598dhp 2 Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

3
CS598dhp 3 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion

4
CS598dhp 4 Assumption Only two levels of memory hierarchies: An ideal cache Fully associative Optimal replacement strategy “ Tall cache ” A very large memory

5
CS598dhp 5 An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line

6
CS598dhp 6 Cache Complexity An algorithm with input size n is measured by: Work complexity W(n) Cache complexity: the number of cache misses it incurs. Q(n; Z, L)

7
CS598dhp 7 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion

8
CS598dhp 8 Cache Aware Algorithms Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). Need to adjust parameters when running on different platforms.

9
CS598dhp 9 Example: A blocked matrix multiplication algorithm s is a tuning parameter to make the algorithm run fast A11 s s n A

10
CS598dhp 10 Example (2) Cache complexity The three s x s sub matrices should fit into the cache so they occupy cache lines Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into cache n 2 /L cache misses needed to read n 2 elements It is

11
CS598dhp 11 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition and FFT Conclusion

12
CS598dhp 12 Cache Oblivious Algorithms Have no parameters about hardware, such as cache size (Z), cache-line length (L). No tuning needed, platform independent. The following algorithms introduced are proved to have the optimal cache complexity.

13
CS598dhp 13 Matrix Multiplication Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p Proceed recursively until reach the base case - one element. n ≥ max ( m, p) m ≥ max ( n, p) p ≥ max ( n, m)

14
CS598dhp 14 Matrix Multiplication (2) A*B A1*B1A2*B2 A11*B11A12*B12A21*B21A22*B22 Assume Sizes of A, B are nx4n, 4nxn ++ +

15
CS598dhp 15 Matrix Multiplication (3) Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

16
CS598dhp 16 Matrix Multiplication (4) Cache complexity Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware) For a square matrix, the optimal cache complexity is achieved.

17
CS598dhp 17 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion

18
CS598dhp 18 If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) Matrix Transposition AATAT for i 1 to m for j 1 to n B( j, i ) = A( i, j ) m x n B n x m

19
CS598dhp 19 Matrix Transposition (2) Partition array A along the longer dimension and recursively execute the transpose function. A11 A12 A21 A22 A11 T A21 T A12 T A22 T

20
CS598dhp 20 Matrix Transposition (3) Cache complexity It has the optimal cache complexity Q(m, n) = Θ(1+mn/L)

21
CS598dhp 21 Fast Fourier Transform Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n 1 n 2 as: Perform n 2 DFTs of size n 1. Multiply by complex roots of unity called twiddle factors. Perform n 1 DFTs of size n 2.

22
CS598dhp 22 n2n2 n1n1

23
CS598dhp 23 Assume X is a row-major n 1× n 2 matrix Steps: Transpose X in place. Compute n 2 DFTs Multiply by twiddle factors Transpose X in place Compute n 1 DFTs Transpose X in-place

24
CS598dhp 24 Fast Fourier Transform *twiddle factor Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return Transpose to select n1 DFT of size n2 Transpose and return n1=4, n2=2

25
CS598dhp 25 Fast Fourier Transform Cache complexity Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2 Q(n) = O(1+(n/L)(1+log z n)

26
CS598dhp 26 Other Cache Oblivious Algorithms Funnelsort Distribution sort LU decomposition without pivots

27
CS598dhp 27 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion

28
CS598dhp 28 Questions How large is the range of practicality of cache-oblivious algorithms? What are the relative strengths of cache- oblivious and cache-aware algorithms?

29
CS598dhp 29 Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N 2

30
CS598dhp 30 Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N 3

31
CS598dhp 31 Question 2 Do cache-oblivious algorithms perform as well as cache-aware algorithms? FFTW library No answer yet.

32
CS598dhp 32 References Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA. Cache-Oblivious Algorithms by Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google