Download presentation

Presentation is loading. Please wait.

Published byJovan Parramore Modified over 3 years ago

1
The Study of Cache Oblivious Algorithms Prepared by Jia Guo

2
CS598dhp 2 Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

3
CS598dhp 3 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion

4
CS598dhp 4 Assumption Only two levels of memory hierarchies: An ideal cache Fully associative Optimal replacement strategy “ Tall cache ” A very large memory

5
CS598dhp 5 An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line

6
CS598dhp 6 Cache Complexity An algorithm with input size n is measured by: Work complexity W(n) Cache complexity: the number of cache misses it incurs. Q(n; Z, L)

7
CS598dhp 7 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion

8
CS598dhp 8 Cache Aware Algorithms Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). Need to adjust parameters when running on different platforms.

9
CS598dhp 9 Example: A blocked matrix multiplication algorithm s is a tuning parameter to make the algorithm run fast A11 s s n A

10
CS598dhp 10 Example (2) Cache complexity The three s x s sub matrices should fit into the cache so they occupy cache lines Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into cache n 2 /L cache misses needed to read n 2 elements It is

11
CS598dhp 11 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition and FFT Conclusion

12
CS598dhp 12 Cache Oblivious Algorithms Have no parameters about hardware, such as cache size (Z), cache-line length (L). No tuning needed, platform independent. The following algorithms introduced are proved to have the optimal cache complexity.

13
CS598dhp 13 Matrix Multiplication Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p Proceed recursively until reach the base case - one element. n ≥ max ( m, p) m ≥ max ( n, p) p ≥ max ( n, m)

14
CS598dhp 14 Matrix Multiplication (2) A*B A1*B1A2*B2 A11*B11A12*B12A21*B21A22*B22 Assume Sizes of A, B are nx4n, 4nxn ++ +

15
CS598dhp 15 Matrix Multiplication (3) Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

16
CS598dhp 16 Matrix Multiplication (4) Cache complexity Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware) For a square matrix, the optimal cache complexity is achieved.

17
CS598dhp 17 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion

18
CS598dhp 18 If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) Matrix Transposition AATAT for i 1 to m for j 1 to n B( j, i ) = A( i, j ) m x n B n x m

19
CS598dhp 19 Matrix Transposition (2) Partition array A along the longer dimension and recursively execute the transpose function. A11 A12 A21 A22 A11 T A21 T A12 T A22 T

20
CS598dhp 20 Matrix Transposition (3) Cache complexity It has the optimal cache complexity Q(m, n) = Θ(1+mn/L)

21
CS598dhp 21 Fast Fourier Transform Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n 1 n 2 as: Perform n 2 DFTs of size n 1. Multiply by complex roots of unity called twiddle factors. Perform n 1 DFTs of size n 2.

22
CS598dhp 22 n2n2 n1n1

23
CS598dhp 23 Assume X is a row-major n 1× n 2 matrix Steps: Transpose X in place. Compute n 2 DFTs Multiply by twiddle factors Transpose X in place Compute n 1 DFTs Transpose X in-place

24
CS598dhp 24 Fast Fourier Transform *twiddle factor Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return Transpose to select n1 DFT of size n2 Transpose and return n1=4, n2=2

25
CS598dhp 25 Fast Fourier Transform Cache complexity Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2 Q(n) = O(1+(n/L)(1+log z n)

26
CS598dhp 26 Other Cache Oblivious Algorithms Funnelsort Distribution sort LU decomposition without pivots

27
CS598dhp 27 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms Matrix multiplication Matrix transposition FFT Conclusion

28
CS598dhp 28 Questions How large is the range of practicality of cache-oblivious algorithms? What are the relative strengths of cache- oblivious and cache-aware algorithms?

29
CS598dhp 29 Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N 2

30
CS598dhp 30 Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N 3

31
CS598dhp 31 Question 2 Do cache-oblivious algorithms perform as well as cache-aware algorithms? FFTW library No answer yet.

32
CS598dhp 32 References Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA. Cache-Oblivious Algorithms by Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999. Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.

Similar presentations

Presentation is loading. Please wait....

OK

Sorting Algorithms CS 524 – High-Performance Computing.

Sorting Algorithms CS 524 – High-Performance Computing.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Ppt on matrix organizational structure Ppt on summary writing example Ppt on channels of communication Ppt on sustainable development Ppt on employee motivation Ppt on social networking websites Ppt on solar system for class 4 Ppt on idiopathic thrombocytopenia purpura therapy Ppt on cloud service providers Ppt on power generation by speed breaker design