The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo

CS598dhp 2 Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

CS598dhp 3 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms  Matrix multiplication  Matrix transposition  FFT Conclusion

CS598dhp 4 Assumption Only two levels of memory hierarchies:  An ideal cache  Fully associative  Optimal replacement strategy  “ Tall cache ”  A very large memory

CS598dhp 5 An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line

CS598dhp 6 Cache Complexity An algorithm with input size n is measured by:  Work complexity W(n)  Cache complexity: the number of cache misses it incurs. Q(n; Z, L)

CS598dhp 8 Cache Aware Algorithms Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). Need to adjust parameters when running on different platforms.

CS598dhp 9 Example: A blocked matrix multiplication algorithm s is a tuning parameter to make the algorithm run fast A11 s s n A

CS598dhp 10 Example (2) Cache complexity  The three s x s sub matrices should fit into the cache so they occupy cache lines  Optimal performance is obtained when  Z/L cache misses needed to bring 3 sub matrices into cache  n 2 /L cache misses needed to read n 2 elements  It is

CS598dhp 11 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms  Matrix multiplication  Matrix transposition and FFT Conclusion

CS598dhp 12 Cache Oblivious Algorithms Have no parameters about hardware, such as cache size (Z), cache-line length (L).  No tuning needed, platform independent. The following algorithms introduced are proved to have the optimal cache complexity.

CS598dhp 13 Matrix Multiplication Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p Proceed recursively until reach the base case - one element. n ≥ max ( m, p) m ≥ max ( n, p) p ≥ max ( n, m)

CS598dhp 14 Matrix Multiplication (2) A*B A1*B1A2*B2 A11*B11A12*B12A21*B21A22*B22 Assume Sizes of A, B are nx4n, 4nxn ++ +

CS598dhp 15 Matrix Multiplication (3) Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

CS598dhp 16 Matrix Multiplication (4) Cache complexity  Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware)  For a square matrix, the optimal cache complexity is achieved.

CS598dhp 18 If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) Matrix Transposition AATAT for i 1 to m for j 1 to n B( j, i ) = A( i, j ) m x n B n x m

CS598dhp 19 Matrix Transposition (2) Partition array A along the longer dimension and recursively execute the transpose function. A11 A12 A21 A22 A11 T A21 T A12 T A22 T

CS598dhp 20 Matrix Transposition (3) Cache complexity  It has the optimal cache complexity  Q(m, n) = Θ(1+mn/L)

CS598dhp 21 Fast Fourier Transform Use Cooley-Tukey algorithm  Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n 1 n 2 as:  Perform n 2 DFTs of size n 1.  Multiply by complex roots of unity called twiddle factors.  Perform n 1 DFTs of size n 2.

CS598dhp 22 n2n2 n1n1

CS598dhp 23 Assume X is a row-major n 1× n 2 matrix Steps:  Transpose X in place.  Compute n 2 DFTs  Multiply by twiddle factors  Transpose X in place  Compute n 1 DFTs  Transpose X in-place

CS598dhp 24 Fast Fourier Transform *twiddle factor Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return Transpose to select n1 DFT of size n2 Transpose and return n1=4, n2=2

CS598dhp 25 Fast Fourier Transform Cache complexity  Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2  Q(n) = O(1+(n/L)(1+log z n)

CS598dhp 26 Other Cache Oblivious Algorithms Funnelsort Distribution sort LU decomposition without pivots

CS598dhp 28 Questions How large is the range of practicality of cache-oblivious algorithms? What are the relative strengths of cache- oblivious and cache-aware algorithms?

CS598dhp 29 Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N 2

CS598dhp 30 Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N 3

CS598dhp 31 Question 2 Do cache-oblivious algorithms perform as well as cache-aware algorithms?  FFTW library  No answer yet.

CS598dhp 32 References Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA. Cache-Oblivious Algorithms by Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Similar presentations

Presentation on theme: "The Study of Cache Oblivious Algorithms Prepared by Jia Guo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Similar presentations

Presentation on theme: "The Study of Cache Oblivious Algorithms Prepared by Jia Guo."— Presentation transcript:

Similar presentations

About project

Feedback