The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Slides:



Advertisements
Similar presentations
Acceleration of Cooley-Tukey algorithm using Maxeler machine
Advertisements

Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign.
Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011.
David Hansen and James Michelussi
Fast Fourier Transform for speeding up the multiplication of polynomials an Algorithm Visualization Alexandru Cioaca.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Parallel Fast Fourier Transform Ryan Liu. Introduction The Discrete Fourier Transform could be applied in science and engineering. Examples: ◦ Voice recognition.
Digital Kommunikationselektronik TNE027 Lecture 5 1 Fourier Transforms Discrete Fourier Transform (DFT) Algorithms Fast Fourier Transform (FFT) Algorithms.
DFT and FFT FFT is an algorithm to convert a time domain signal to DFT efficiently. FFT is not unique. Many algorithms are available. Each algorithm has.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Instructor: Shengyu Zhang 1. Example 1: Merge sort 2.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Lecture 8 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Maths for Computer Graphics
Sorting Algorithms CS 524 – High-Performance Computing.
Towards a Theory of Cache-Efficient Algorithms Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
1 Cache-Efficient Matrix Transposition Written by : Siddhartha Chatterjee and Sandeep Sen Presented By: Iddit Shalem.
1 Cache-Oblivious Algorithms Authors: Matteo Frigo, Charles E. Leiserson, Harald Prokop & Sridhar Ramachandran. Presented By: Solodkin Yuri.
5 - 1 § 5 The Divide-and-Conquer Strategy e.g. find the maximum of a set S of n numbers.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
CSC 2300 Data Structures & Algorithms January 30, 2007 Chapter 2. Algorithm Analysis.
Input image Output image Transform equation All pixels Transform equation.
10 Algorithms in 20th Century Science, Vol. 287, No. 5454, p. 799, February 2000 Computing in Science & Engineering, January/February : The Metropolis.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
1 Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
Row 1 Row 2 Row 3 Row m Column 1Column 2Column 3 Column 4.
Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.
Hossein Sameti Department of Computer Engineering Sharif University of Technology.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Lecture 5 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms – Matrix multiplication – Distribution sort – Static searching.
1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.
Funnel Sort*: Cache Efficiency and Parallelism
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
Memory Hierarchies [FLPR12] Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. Cache- Oblivious Algorithms. ACM Transactions on Algorithms,
CMPT 238 Data Structures More on Sorting: Merge Sort and Quicksort.
Ioannis E. Venetis Department of Computer Engineering and Informatics
Section 7: Memory and Caches
CS 213: Data Structures and Algorithms
Fast Fourier Transform
BLAS: behind the scenes
CS38 Introduction to Algorithms
Objective of This Course
Unit-2 Divide and Conquer
Real-time 1-input 1-output DSP systems
Topic: Divide and Conquer
CSCI N207 Data Analysis Using Spreadsheet
Parallel Programming in C with MPI and OpenMP
Low Depth Cache-Oblivious Algorithms
Cache-Oblivious Algorithms
Presentation transcript:

The Study of Cache Oblivious Algorithms Prepared by Jia Guo

CS598dhp 2 Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, October, 1999, New York, NY, USA.

CS598dhp 3 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms  Matrix multiplication  Matrix transposition  FFT Conclusion

CS598dhp 4 Assumption Only two levels of memory hierarchies:  An ideal cache  Fully associative  Optimal replacement strategy  “ Tall cache ”  A very large memory

CS598dhp 5 An Ideal Cache Model An ideal cache model (Z,L) Z: Total words in the cache L: Words in one cache line

CS598dhp 6 Cache Complexity An algorithm with input size n is measured by:  Work complexity W(n)  Cache complexity: the number of cache misses it incurs. Q(n; Z, L)

CS598dhp 7 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms  Matrix multiplication  Matrix transposition  FFT Conclusion

CS598dhp 8 Cache Aware Algorithms Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L). Need to adjust parameters when running on different platforms.

CS598dhp 9 Example: A blocked matrix multiplication algorithm s is a tuning parameter to make the algorithm run fast A11 s s n A

CS598dhp 10 Example (2) Cache complexity  The three s x s sub matrices should fit into the cache so they occupy cache lines  Optimal performance is obtained when  Z/L cache misses needed to bring 3 sub matrices into cache  n 2 /L cache misses needed to read n 2 elements  It is

CS598dhp 11 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms  Matrix multiplication  Matrix transposition and FFT Conclusion

CS598dhp 12 Cache Oblivious Algorithms Have no parameters about hardware, such as cache size (Z), cache-line length (L).  No tuning needed, platform independent. The following algorithms introduced are proved to have the optimal cache complexity.

CS598dhp 13 Matrix Multiplication Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p Proceed recursively until reach the base case - one element. n ≥ max ( m, p) m ≥ max ( n, p) p ≥ max ( n, m)

CS598dhp 14 Matrix Multiplication (2) A*B A1*B1A2*B2 A11*B11A12*B12A21*B21A22*B22 Assume Sizes of A, B are nx4n, 4nxn ++ +

CS598dhp 15 Matrix Multiplication (3) Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

CS598dhp 16 Matrix Multiplication (4) Cache complexity  Can achieve the same as the cache complexity of Block-MULT algorithm (cache aware)  For a square matrix, the optimal cache complexity is achieved.

CS598dhp 17 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms  Matrix multiplication  Matrix transposition  FFT Conclusion

CS598dhp 18 If n is very large, the access of B in column will cause cache miss every time! (No spatial locality in B) Matrix Transposition AATAT for i 1 to m for j 1 to n B( j, i ) = A( i, j ) m x n B n x m

CS598dhp 19 Matrix Transposition (2) Partition array A along the longer dimension and recursively execute the transpose function. A11 A12 A21 A22 A11 T A21 T A12 T A22 T

CS598dhp 20 Matrix Transposition (3) Cache complexity  It has the optimal cache complexity  Q(m, n) = Θ(1+mn/L)

CS598dhp 21 Fast Fourier Transform Use Cooley-Tukey algorithm  Cooley-Tukey algorithms recursively re-express a DFT of a composite size n = n 1 n 2 as:  Perform n 2 DFTs of size n 1.  Multiply by complex roots of unity called twiddle factors.  Perform n 1 DFTs of size n 2.

CS598dhp 22 n2n2 n1n1

CS598dhp 23 Assume X is a row-major n 1× n 2 matrix Steps:  Transpose X in place.  Compute n 2 DFTs  Multiply by twiddle factors  Transpose X in place  Compute n 1 DFTs  Transpose X in-place

CS598dhp 24 Fast Fourier Transform *twiddle factor Transpose to select n2 DFT of size n1 Call FFT recursively with n1=2, n2=2 Reach the base case, return Transpose to select n1 DFT of size n2 Transpose and return n1=4, n2=2

CS598dhp 25 Fast Fourier Transform Cache complexity  Optimal for a Cooley-Tukey algorithm, when n is an exact power of 2  Q(n) = O(1+(n/L)(1+log z n)

CS598dhp 26 Other Cache Oblivious Algorithms Funnelsort Distribution sort LU decomposition without pivots

CS598dhp 27 Outline Cache complexity Cache aware algorithms Cache oblivious algorithms  Matrix multiplication  Matrix transposition  FFT Conclusion

CS598dhp 28 Questions How large is the range of practicality of cache-oblivious algorithms? What are the relative strengths of cache- oblivious and cache-aware algorithms?

CS598dhp 29 Practicality of Cache-oblivious Algorithms Average time to transpose an NxN matrix, divided by N 2

CS598dhp 30 Practicality of Cache-oblivious Algorithms (2) Average time taken to multiply two NxN matrices, divided by N 3

CS598dhp 31 Question 2 Do cache-oblivious algorithms perform as well as cache-aware algorithms?  FFTW library  No answer yet.

CS598dhp 32 References Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, October, 1999, New York, NY, USA. Cache-Oblivious Algorithms by Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.