15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms – Matrix multiplication – Distribution sort – Static searching.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms Quicksort
Advertisements

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
Divide and Conquer Strategy
CS4413 Divide-and-Conquer
Lecture 10 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Foundations of Algorithms, Fourth Edition
Discrete Structure Li Tak Sing( 李德成 ) Lectures
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 4 Some of the sides are exported from different sources.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 4 Comparison-based sorting Why sorting? Formal analysis of Quick-Sort Comparison.
Cache-Oblivious B-Trees
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
1 Cache-Efficient Matrix Transposition Written by : Siddhartha Chatterjee and Sandeep Sen Presented By: Iddit Shalem.
Data Structures Review Session 1
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Cache Oblivious Search Trees via Binary Trees of Small Height
External Memory Algorithms Kamesh Munagala. External Memory Model Aggrawal and Vitter, 1988.
I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.
© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Cache-Oblivious Dynamic Dictionaries with Update/Query Tradeoff Gerth Stølting Brodal Erik D. Demaine Jeremy T. Fineman John Iacono Stefan Langerman J.
CSE 373 Data Structures Lecture 15
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Divide-and-Conquer 7 2  9 4   2   4   7
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
CSC 41/513: Intro to Algorithms Linear-Time Sorting Algorithms.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Introduction to Algorithms Jiafen Liu Sept
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
CSC 211 Data Structures Lecture 13
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Lecture 2: External Memory Indexing Structures CS6931 Database Seminar.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Recurrences David Kauchak cs161 Summer Administrative Algorithms graded on efficiency! Be specific about the run times (e.g. log bases) Reminder:
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
+ David Kauchak cs312 Review. + Midterm Will be posted online this afternoon You will have 2 hours to take it watch your time! if you get stuck on a problem,
COSC 3101A - Design and Analysis of Algorithms 6 Lower Bounds for Sorting Counting / Radix / Bucket Sort Many of these slides are taken from Monica Nicolescu,
Divide and Conquer Strategy
15-853: Algorithms in the Real World Locality I: Cache-aware algorithms – Introduction – Sorting – List ranking – B-trees – Buffer trees.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
MA/CSSE 473 Day 30 B Trees Dynamic Programming Binomial Coefficients Warshall's algorithm No in-class quiz today Student questions?
Sorting Lower Bounds n Beating Them. Recap Divide and Conquer –Know how to break a problem into smaller problems, such that –Given a solution to the smaller.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Memory Hierarchies [FLPR12] Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. Cache- Oblivious Algorithms. ACM Transactions on Algorithms,
CS6045: Advanced Algorithms Sorting Algorithms. Sorting So Far Insertion sort: –Easy to code –Fast on small inputs (less than ~50 elements) –Fast on nearly-sorted.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
The Hardware/Software Interface CSE351 Winter 2013
Advanced Topics in Data Management
Low Depth Cache-Oblivious Algorithms
Presentation transcript:

15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms – Matrix multiplication – Distribution sort – Static searching

I/O Model Abstracts a single level of the memory hierarchy Fast memory (cache) of size M Accessing fast memory is free, but moving data from slow memory is expensive Memory is grouped into size-B blocks of contiguous data Fast Memory block M/BM/B B B CPU Slow Memory Slow Memory Cost: the number of block transfers (or I/Os) from slow memory to fast memory.

Cache-Oblivious Algorithms Algorithms not parameterized by B or M. – These algorithms are unaware of the parameters of the memory hierarchy Analyze in the ideal cache model — same as the I/O model except optimal replacement is assumed Fast Memory block M/BM/B B CPU Slow Memory Slow Memory – Optimal replacement means proofs may posit an arbitrary replacement policy, even defining an algorithm for selecting which blocks to load/evict.

Advantages of Cache-Oblivious Algorithms Since CO algorithms do not depend on memory parameters, bounds generalize to multilevel hierarchies. Algorithms are platform independent Algorithms should be effective even when B and M are not static

Matrix Multiplication Consider standard iterative matrix- multiplication algorithm X X Y Y Z Z := Where X, Y, and Z are N×N matrices for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][j] += X[i][k] * Y[k][j] for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][j] += X[i][k] * Y[k][j] Θ (N 3 ) computation in RAM model. What about I/O?

How Are Matrices Stored? X X Y Y Z Z := for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][k] += X[i][k] * Y[k][j] for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][k] += X[i][k] * Y[k][j] If N ≥B, reading a column of Y is expensive ⇒ Θ(N) I/Os If N ≥M, no locality across iterations for X and Y ⇒ Θ(N 3 ) I/Os How data is arranged in memory affects I/O performance Suppose X, Y, and Z are in row-major order

How Are Matrices Stored? Suppose X and Z are in row-major order but Y is in column-major order – Not too inconvenient. Transposing Y is relatively cheap X X Y Y Z Z := for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][k] += X[i][k] * Y[k][j] for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][k] += X[i][k] * Y[k][j] Scan row of X and column of Y ⇒ Θ(N/B) I/Os If N ≥ M, no locality across iterations for X and Y ⇒ Θ(N 3 /B) We can do much better than Θ(N 3 /B) I/Os, even if all matrices are row-major.

Recursive Matrix Multiplication Summing two matrices with row-major layout is cheap — just scan the matrices in memory order. – Cost is Θ (N 2 /B) I/Os to sum two N×N matrices, assuming N ≥ B. X 11 Y 11 Z 11 := Z 12 Z 21 Z 22 X 12 X 21 X 22 Y 21 Y 12 Y 22 Compute 8 submatrix products recursively Z 11 := X 11 Y 11 + X 12 Y 21 Z 12 := X 11 Y 12 + X 12 Y 22 Z 21 := X 21 Y 11 + X 22 Y 21 Z 22 := X 21 Y 12 + X 22 Y 21 Compute 8 submatrix products recursively Z 11 := X 11 Y 11 + X 12 Y 21 Z 12 := X 11 Y 12 + X 12 Y 22 Z 21 := X 21 Y 11 + X 22 Y 21 Z 22 := X 21 Y 12 + X 22 Y 21

Recursive Multiplication Analysis Mult(n) = 8Mult(n/2) + Θ (n 2 /B) Mult(n 0 ) = O(M/B) when n 0 for X, Y and Z fit in memory The big question is the base case n 0 X 11 Y 11 Z 11 := Z 12 Z 21 Z 22 X 12 X 21 X 22 Y 21 Y 12 Y 22 Recursive algorithm: Z 11 := X 11 Y 11 + X 12 Y 21 Z 12 := X 11 Y 12 + X 12 Y 22 Z 21 := X 21 Y 11 + X 22 Y 21 Z 22 := X 21 Y 12 + X 22 Y 21 Recursive algorithm: Z 11 := X 11 Y 11 + X 12 Y 21 Z 12 := X 11 Y 12 + X 12 Y 22 Z 21 := X 21 Y 11 + X 22 Y 21 Z 22 := X 21 Y 12 + X 22 Y 21

Array storage How many blocks does a size-N array occupy? If it’s aligned on a block (usually true for cache- aware), it takes exactly N/B blocks If you’re unlucky, it’s N/B+1 blocks. This is generally what you need to assume for cache- oblivious algorithms as you can’t force alignment In either case, it’s Θ(1+N/B) blocks Page 10 block

If you look at the full matrix, it’s just a single array, so rows appear one after the other Row-major matrix Page 11   N N the matrix layout of matrix in slow memory So entire matrix fits in N 2 /B+1=Θ(1+N 2 /B) blocks

In a submatrix, rows are not adjacent in slow memory Row-major submatrix Page 12   N N the matrix layout of matrix in slow memory Need to treat this as k arrays, so total number of blocks to store submatrix is k(k/B+1) = Θ(k+k 2 /B) k k

Row-major submatrix Recall we had the recurrence Mult(N) = 8 Mult(N/2) + Θ(N 2 /B) (1) The question is when does the base case occur here? Specifically, does a Θ(√M)×Θ(√M) matrix fit in cache, i.e., does it occupy at most M/B different blocks? If a Θ(√M)×Θ(√M) fits in cache, we stop the analysis at a Θ(√M) size — lower levels are free. i.e., Mult(Θ(√M)) = Θ(M/B) (2) Solving (1) with (2) as a base case gives Mult(N) = Θ(N 2 /B + N 3 /B√M) Page 13 load full submat in cache

Is that assumption correct? Does a Θ(√M)×Θ(√M) matrix occupy at most Θ(M/B) different blocks? We have a formula from before. A k × k submatrix requires Θ(k + k 2 /B) blocks, so a Θ(√M) × Θ(√M) submatrix occupies roughly √M + M/B blocks The answer is “yes” only if Θ(√M + M/B) = Θ(M/B). iff √M ≤ M/B, or M ≥ B 2. If “no,” analysis (base case) is broken — recursing into the submatrix will still require more I/Os Page 14

Fixing the base case Two fixes: 1.The “tall cache” assumption: M≥B 2. Then the base case is correct, completing the analysis. 2.Change the matrix layout Page 15

Without Tall-Cache Assumption Try a better matrix layout The algorithm is recursive. Use a layout that matches the recursive nature of the algorithm For example, Z-morton ordering: -The line connects elements that are adjacent in memory - In other words, construct the layout by storing each quadrant of the matrix contiguously, and recurse

Recursive MatMul with Z-Morton The analysis becomes easier Each quadrant of the matrix is contiguous in memory, so a c√M ×c√M submatrix fits in memory – The tall-cache assumption is not required to make this base case work The rest of the analysis is the same

Searching: binary search is bad Search hits a a different block until reducing keyspace to size Θ(B). Thus, total cost is log 2 N – Θ(log 2 B) = Θ(log 2 (N/B)) ≈ Θ(log 2 N) for N >>B A A C C D D E E F F G G H H I I J J K K L L M M N N O O Example: binary search for element A with block size B = 2 B B

Static cache-oblivious searching Goal: organize N keys in memory to facilitate efficient searching. (van Emde Boas layout) 1.build a balanced binary tree on the keys 2.layout the tree recursively in memory, splitting the tree at half the height … N √N memory layout N √N

Static layout example A A C C B B D D L L H H E E G G F F I I K K J J M M O O N N H H D D L L B B A A C C F F E E G G J J I I K K N N M M O O

Cache-oblivious searching: Analysis I Consider recursive subtrees of size √B to B on a root-to-leaf search path. Each subtree is contiguous and fits in O(1) blocks. Each subtree has height Θ (lgB), so there are Θ (log B N) of them. memory size-B block size-O(B) subtree …

Cache-oblivious searching: Analysis I Consider recursive subtrees of size √B to B on a root-to-leaf search path. Each subtree is contiguous and fits in O(1) blocks. Each subtree has height Θ (lgB), so there are Θ (log B N) of them. memory size-B block size-O(B) subtree …

Cache-oblivious searching: Analysis I Consider recursive subtrees of size √B to B on a root-to-leaf search path. Each subtree is contiguous and fits in O(1) blocks. Each subtree has height Θ (lgB), so there are Θ (log B N) of them. memory size-B block size-O(B) subtree …

Cache-oblivious searching: Analysis II … N √N memory layout N √N S(N) = 2S(√N) + O(1) base case S(<B) = 0. Analyze using a recurrence Solves to O(log B N) or Counts the # of random accesses S(N) = 2S(√N) base case S(<B) = 1.

Distribution sort outline Analogous to multiway quicksort 1.Split input array into √N contiguous subarrays of size √N. Sort subarrays recursively … √N, sorted N

Distribution sort outline 2.Choose √N “good” pivots p 1 ≤ p 2 ≤ … ≤ p √N-1. 3.Distribute subarrays into buckets, according to pivots … Bucket 1Bucket 2Bucket √N ≤ p 1 ≤ ≤ p 2 ≤ … ≤ p √N-1 ≤ √N ≤ size ≤ 2√N

Distribution sort outline 4.Recursively sort the buckets 5.Copy concatenated buckets back to input array Bucket 1Bucket 2Bucket √N ≤ p 1 ≤ ≤ p 2 ≤ … ≤ p √N-1 ≤ √N ≤ size ≤ 2√N sorted

Distribution sort analysis sketch Step 1 (implicitly) divides array and sorts √N size- √N subproblems Step 4 sorts √N buckets of size √N ≤ n i ≤ 2√N, with total size N Step 5 copies back the output, with a scan Gives recurrence: T(N) = √N T(√N) + ∑ T(n i ) + Θ(N/B) + Step 2&3 ≈ 2√N T(√N) + Θ(N/B) Base: T(<M) = 1 = Θ((N/B) log M/B (N/B)) if M ≥ B 2

Missing steps 2.Choose √N “good” pivots p 1 ≤ p 2 ≤ … ≤ p √N-1. 3.Distribute subarrays into buckets, according to pivots … Bucket 1Bucket 2Bucket √N ≤ p 1 ≤ ≤ p 2 ≤ … ≤ p √N-1 ≤ √N ≤ size ≤ 2√N (2) Not too hard in Θ(N/B)

Naïve distribution Distribute first subarray, then second, then third, … Cost is only Θ(N/B) to scan input array What about writing to the output buckets? – Suppose each subarray writes 1 element to each bucket. Cost is 1 I/O per write, for N total! … Bucket 1Bucket 2Bucket √N ≤ p 1 ≤ ≤ p 2 ≤ … ≤ p √N-1 ≤

Better recursive distribution Given subarrays s 1,…,s k and buckets b 1,…,b k 1.Recursively distribute s 1,…,s k/2 to b 1,…,b k/2 2.Recursively distribute s 1,…,s k/2 to b k/2,…,b k 3.Recursively distribute s k/2,…,s k to b 1,…,b k/2 4.Recursively distribute s k/2,…,s k to b k/2,…,b k Despite crazy order, each subarray operates left to right. So only need to check next pivot. s1s1 s1s1 s2s2 s2s2 sksk sksk … b1b1 b2b2 bkbk ≤ p 1 ≤ s k-1 ≤ p 1 ≤ ≤ … ≤

Distribute analysis Counting only “random accesses” here D(k) = 4D(k/2) + O(k) Base case: when the next block in each of the k buckets/subarrays fits in memory (this is like an M/B-way merge) So we have D(M/B) = D(B) = free Solves to D(k) = O(k 2 /B) ⇒ distribute uses O(N/B) random accesses — the rest is scanning at a cost of O(1/B) per element

Note on distribute If you unroll the recursion, it’s going in Z- morton order on this matrix: subarray # bucket # i.e., first distribute s 1 to b 1, then s 1 to b 2, then s 2 to b 1, then s 2 to b 2, and so on.