Funnel Sort*: Cache Efficiency and Parallelism

Slides:



Advertisements
Similar presentations
©2001 by Charles E. Leiserson Introduction to AlgorithmsDay 9 L6.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 6 Prof. Erik Demaine.
Advertisements

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Lecture 3: Parallel Algorithm Design
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
CS 3343: Analysis of Algorithms Lecture 14: Order Statistics.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
September 12, Algorithms and Data Structures Lecture III Simonas Šaltenis Nykredit Center for Database Research Aalborg University
Using Divide and Conquer for Sorting
Quicksort Quicksort     29  9.
Spring 2015 Lecture 5: QuickSort & Selection
Sorting Algorithms. Motivation Example: Phone Book Searching Example: Phone Book Searching If the phone book was in random order, we would probably never.
Data Structures Data Structures Topic #13. Today’s Agenda Sorting Algorithms: Recursive –mergesort –quicksort As we learn about each sorting algorithm,
Section 8.8 Heapsort.  Merge sort time is O(n log n) but still requires, temporarily, n extra storage locations  Heapsort does not require any additional.
CSE332: Data Abstractions Lecture 10: More B-Trees Tyler Robison Summer
Lecture 25 Selection sort, reviewed Insertion sort, reviewed Merge sort Running time of merge sort, 2 ways to look at it Quicksort Course evaluations.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
1 Cache-Oblivious Algorithms Authors: Matteo Frigo, Charles E. Leiserson, Harald Prokop & Sridhar Ramachandran. Presented By: Solodkin Yuri.
Cmpt-225 Algorithm Efficiency.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
Sorting Algorithms and Analysis Robert Duncan. Refresher on Big-O  O(2^N)Exponential  O(N^2)Quadratic  O(N log N)Linear/Log  O(N)Linear  O(log N)Log.
© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
CSE 373 Data Structures Lecture 15
October 1, Algorithms and Data Structures Lecture III Simonas Šaltenis Nykredit Center for Database Research Aalborg University
Efficient Algorithms Quicksort. Quicksort A common algorithm for sorting non-sorted arrays. Worst-case running time is O(n 2 ), which is actually the.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.
CSIS 123A Lecture 9 Recursion Glenn Stevenson CSIS 113A MSJC.
1 CSC 427: Data Structures and Algorithm Analysis Fall 2008 Algorithm analysis, searching and sorting  best vs. average vs. worst case analysis  big-Oh.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
CSC317 1 So far so good, but can we do better? Yes, cheaper by halves... orkbook/cheaperbyhalf.html.
Quicksort Quicksort is a well-known sorting algorithm that, in the worst case, it makes Θ(n 2 ) comparisons. Typically, quicksort is significantly faster.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
An Adventure into Sorting By Mauktik Gandhi Date: 12 th May 2002.
CS6045: Advanced Algorithms Sorting Algorithms. Sorting So Far Insertion sort: –Easy to code –Fast on small inputs (less than ~50 elements) –Fast on nearly-sorted.
CSC 427: Data Structures and Algorithm Analysis
CSC 427: Data Structures and Algorithm Analysis
External Sorting Sort n records/elements that reside on a disk.
External Sorting Chapter 13
Divide and Conquer Strategy
Sorting LinkedLists.
(2,4) Trees 11/15/2018 9:25 AM Sorting Lower Bound Sorting Lower Bound.
Unit-2 Divide and Conquer
Ch 7: Quicksort Ming-Te Chi
External Sorting Chapter 13
(2,4) Trees 12/4/2018 1:20 PM Sorting Lower Bound Sorting Lower Bound.
(2,4) Trees 2/28/2019 3:21 AM Sorting Lower Bound Sorting Lower Bound.
CSC 427: Data Structures and Algorithm Analysis
David Kauchak cs161 Summer 2009
External Sorting Chapter 13
CSE 190D Database System Implementation
CMPT 225 Lecture 10 – Merge Sort.
Presentation transcript:

Funnel Sort*: Cache Efficiency and Parallelism Paul Youn, 6.895 Final Project (5689FPPYaaceijllnnoortuu) * Developed by Frigo, Leiserson, et. al.

Outline Cache Efficient vs Cache Oblivious Funnel Sort: How it achieves cache efficiency Performance: Serial Execution Cache Performance Parallel Implementation Future Possible Work Conclusions

Cache Efficiency and Sorting Cache Efficiency: Many sorting algorithms are cache oblivious, but not cache efficient. For example, a 2-way recursive sorting algorithm (such as Quick Sort or Merge Sort). At every level all N elements are processed, which means all N elements are loaded into cache. If loaded one line at a time, for every L (size of cache line) elements loaded into cache, there is 1 cache miss: 1/L amortized cache miss per element.

Cache Performance Implies an O(n/L lg n) cache miss bound. If Z is the number of lines in a cache, Professor Bender’s lecture showed cache aware O(n/L * (log n)/(log Z)) cache miss sorting algorithm by doing an O(Z)-way merge sort. Works by decreasing the height of the tree. Can we achieve this bound with a cache-oblivious algorithm? Yes!

Funnel Sort Try to get as close to a Z-way merge as possible. Intuition: Want to recursively lay out a K-way merge (lets call it a K-funnel) to consist of smaller funnels. Important that all K-funnels be the same size no matter what location in the sort tree!

Note that all K-funnels will be of the same size regardless of the size of inputs. At some point, K is small enough that the K-way merge fits into cache.

When K fits in cache For some K, the entire K-funnel will fit in cache. Now, think of the original problem as a series of K merges, where K is close to Z.

Funnel Sort cache efficiency For the appropriate constants for the size of buffers, achieve O(n/L * (log n)/(log Z)) cache misses for this cache-oblivious sorting algorithm. Provable that no better cache-oblivious bound exists.

Performance: Serial Execution Actual implementation performed poorly! Runs significantly slower than Quicksort (about 4x as slow).

Why? Possible Reasons Bad implementation. Runtime not dominated by cache misses. Much more memory management than Quick Sort (not in place). Lots of calculations to keep track of buffers and how full they are. Cache dominated by the internal buffers of a funnel, for a K-funnel, K^2 memory used for internal buffers, 2*K used for input buffers. Rounding errors! Funnel sort relies on taking square roots, and cube roots often, which in practice yields an imbalanced funnel.

Cache Performance Tested by changing type of item being sorted from int to long. Should approximately double the number of cache misses! Change in performance is less than 5% slower in a 4-way merge sort (Cilk Sort, implemented by Matteo Frigo), implying perhaps that cache misses are not a large cost. Appears that Funnel Sort suffers the smallest slowdown from increasing the size of data (versus Quick Sort and Cilk Sort), but difficult to say accurately.

Parallelize FunnelSort? Not terribly practical without a fast serial implementation. Lends well to parallelism. Recursive merging can be done on separate processors. Because of buffering, final merge shortly after inputs start being processed. Could use locks on the circular buffers so that simultaneous reads from the head and writes to the tail are possible. With log(n)/log(Z) processors, possibly O(n) sorting?

Conclusions Possible to create a cache-oblivious sorting algorithm that has cache misses on the order of a cache-aware algorithm. In practice, difficult to implement correctly. Extra overhead difficult to recover with the reduced cache misses. Potentially very parallelizable.