A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Steve Wolfman, based on work by Dan.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms Quicksort
Advertisements

Section 5: More Parallel Algorithms
Lecture 3: Parallel Algorithm Design
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2010.
Data Structures Lecture 9 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
Advanced Algorithms Piyush Kumar (Lecture 12: Parallel Algorithms) Welcome to COT5405 Courtesy Baker 05.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
CSCE 3110 Data Structures & Algorithm Analysis
Quicksort Quicksort     29  9.
COSC 3100 Transform and Conquer
CSE 332 Data Abstractions: Introduction to Parallelism and Concurrency Kate Deibel Summer 2012 August 1, 2012CSE 332 Data Abstractions, Summer
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 24 Sorting.
David Luebke 1 5/20/2015 CS 332: Algorithms Quicksort.
© 2004 Goodrich, Tamassia QuickSort1 Quick-Sort     29  9.
CPSC 171 Introduction to Computer Science More Efficiency of Algorithms.
1 Sorting Problem: Given a sequence of elements, find a permutation such that the resulting sequence is sorted in some order. We have already seen: –Insertion.
Quick-Sort     29  9.
© 2004 Goodrich, Tamassia Quick-Sort     29  9.
CS 206 Introduction to Computer Science II 04 / 27 / 2009 Instructor: Michael Eckmann.
Parallel Prefix Computation Advanced Algorithms & Data Structures Lecture Theme 14 Prof. Dr. Th. Ottmann Summer Semester 2006.
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Dan Grossman Last Updated: November.
CSE332: Data Abstractions Lecture 20: Parallel Prefix and Parallel Sorting Dan Grossman Spring 2010.
CSE332: Data Abstractions Lecture 21: Parallel Prefix and Parallel Sorting Tyler Robison Summer
Lecture 5: Linear Time Sorting Shang-Hua Teng. Sorting Input: Array A[1...n], of elements in arbitrary order; array size n Output: Array A[1...n] of the.
CS 206 Introduction to Computer Science II 12 / 05 / 2008 Instructor: Michael Eckmann.
CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.
Lecture 5: Master Theorem and Linear Time Sorting
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.
CSC 2300 Data Structures & Algorithms January 30, 2007 Chapter 2. Algorithm Analysis.
Analysis of Algorithms CS 477/677
Sorting CS-212 Dick Steflik. Exchange Sorting Method : make n-1 passes across the data, on each pass compare adjacent items, swapping as necessary (n-1.
Unit 1. Sorting and Divide and Conquer. Lecture 1 Introduction to Algorithm and Sorting.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Sorting (Part II: Divide and Conquer) CSE 373 Data Structures Lecture 14.
CSE 373 Data Structures Lecture 15
A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Steve Wolfman, based on work by Dan.
1 More Sorting radix sort bucket sort in-place sorting how fast can we sort?
1 Lecture 11 POLYNOMIALS and Tree sort 2 INTRODUCTION EVALUATING POLYNOMIAL FUNCTIONS Horner’s method Permutation Tree sort.
CSCE 3110 Data Structures & Algorithm Analysis Sorting (I) Reading: Chap.7, Weiss.
Computer Science 101 Fast Searching and Sorting. Improving Efficiency We got a better best case by tweaking the selection sort and the bubble sort We.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Sorting Fun1 Chapter 4: Sorting     29  9.
CS 61B Data Structures and Programming Methodology July 28, 2008 David Sun.
Analysis of Algorithms CS 477/677
September 29, Algorithms and Data Structures Lecture V Simonas Šaltenis Aalborg University
Lecture 5 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
CS 2133: Data Structures Quicksort. Review: Heaps l A heap is a “complete” binary tree, usually represented as an array:
Merge Sort Data Structures and Algorithms CS 244 Brent M. Dingle, Ph.D. Department of Mathematics, Statistics, and Computer Science University of Wisconsin.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
CSC 413/513: Intro to Algorithms Solving Recurrences Continued The Master Theorem Introduction to heapsort.
Quicksort Data Structures and Algorithms CS 244 Brent M. Dingle, Ph.D. Game Design and Development Program Department of Mathematics, Statistics, and Computer.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 25 Sorting.
David Luebke 1 2/5/2016 CS 332: Algorithms Introduction to heapsort.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
Sorting Lower Bounds n Beating Them. Recap Divide and Conquer –Know how to break a problem into smaller problems, such that –Given a solution to the smaller.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Parallel Prefix, Pack, and Sorting. Outline Done: –Simple ways to use parallelism for counting, summing, finding –Analysis of running time and implications.
CMPT 238 Data Structures More on Sorting: Merge Sort and Quicksort.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Lecture 3: Parallel Algorithm Design
Parallel Sorting Algorithms
More on Merge Sort CS 244 This presentation is not given in class
CSE332: Data Abstractions Lecture 20: Parallel Prefix, Pack, and Sorting Dan Grossman Spring 2012.
Parallel Sorting Algorithms
CSE 332: Parallel Algorithms
Presentation transcript:

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Steve Wolfman, based on work by Dan Grossman

MOTIVATION: Pack AKA, filter, like getting all elts less than the pivot in QuickSort. Given an array input, produce an array output containing only elements such that f(elt) is true Example: input [17, 4, 6, 8, 11, 5, 13, 19, 0, 24] f: is elt > 10 output [17, 11, 13, 19, 24] Parallelizable? Sure, using a list concatenation reduction. Efficiently parallelizable on arrays? Can we just put the output straight into the array at the right spots? 2Sophomoric Parallelism and Concurrency, Lecture 3

MOTIVATION: Pack as map, reduce, prefix combo?? Given an array input, produce an array output containing only elements such that f(elt) is true Example: input [17, 4, 6, 8, 11, 5, 13, 19, 0, 24] f: is elt > 10 Which pieces can we do as maps, reduces, or prefixes? 3Sophomoric Parallelism and Concurrency, Lecture 3

MOTIVATION: Parallel prefix sum to the rescue (if only we had it!) 1.Parallel map to compute a bit-vector for true elements input [17, 4, 6, 8, 11, 5, 13, 19, 0, 24] bits [1, 0, 0, 0, 1, 0, 1, 1, 0, 1] 2.Parallel prefix-sum on the bit-vector bitsum [1, 1, 1, 1, 2, 2, 3, 4, 4, 5] 3.Parallel map to produce the output output [17, 11, 13, 19, 24] 4Sophomoric Parallelism and Concurrency, Lecture 3 output = new array of size bitsum[n-1] FORALL(i=0; i < input.size(); i++){ if(bits[i]) output[bitsum[i]-1] = input[i]; }

The prefix-sum problem Given a list of integers as input, produce a list of integers as output where output[i] = input[0]+input[1]+…+input[i] Sequential version is straightforward: 5Sophomoric Parallelism and Concurrency, Lecture 3 vector prefix_sum(const vector & input){ vector output(input.size()); output[0] = input[0]; for(int i=1; i < input.size(); i++) output[i] = output[i-1]+input[i]; return output; } Example: input output

The prefix-sum problem Given a list of integers as input, produce a list of integers as output where output[i] = input[0]+input[1]+…+input[i] Sequential version is straightforward: 6Sophomoric Parallelism and Concurrency, Lecture 3 Vector prefix_sum(const vector & input){ vector output(input.size()); output[0] = input[0]; for(int i=1; i < input.size(); i++) output[i] = output[i-1]+input[i]; return output; } Why isn’t this (obviously) parallelizable? Isn’t it just map or reduce? Work: Span:

Let’s just try D&C… 7Sophomoric Parallelism and Concurrency, Lecture 3 input output range 0,8 range 0,4range 4,8 range 6,8range 4,6range 2,4range 0,2 r 0,1r 1,2r 2,3r 3,4r 4,5r 5,6r 6,7r 7.8 So far, this is the same as every map or reduce we’ve done.

Let’s just try D&C… 8Sophomoric Parallelism and Concurrency, Lecture 3 input output range 0,8 range 0,4range 4,8 range 6,8range 4,6range 2,4range 0,2 r 0,1r 1,2r 2,3r 3,4r 4,5r 5,6r 6,7r 7.8 What do we need to solve this problem?

Let’s just try D&C… 9Sophomoric Parallelism and Concurrency, Lecture 3 input output range 0,8 range 0,4range 4,8 range 6,8range 4,6range 2,4range 0,2 r 0,1r 1,2r 2,3r 3,4r 4,5r 5,6r 6,7r 7.8 How about this problem?

Re-using what we know 10Sophomoric Parallelism and Concurrency, Lecture 3 input output range 0,8 sum range 0,4 sum range 4,8 sum range 6,8 sum range 4,6 sum range 2,4 sum range 0,2 sum r 0,1 s r 1,2 s r 2,3 s r 3,4 s r 4,5 s r 5,6 s r 6,7 s r 7.8 s We already know how to do a D&C parallel sum (reduce with “+”). Does it help?

Example 11Sophomoric Parallelism and Concurrency, Lecture 3 input output range 0,8 sum fromleft 0 range 0,4 sum fromleft range 4,8 sum fromleft range 6,8 sum fromleft range 4,6 sum fromleft range 2,4 sum fromleft range 0,2 sum fromleft r 0,1 s f r 1,2 s f r 2,3 s f r 3,4 s f r 4,5 s f r 5,6 s f r 6,7 s f r 7.8 s f Algorithm from [Ladner and Fischer, 1977] Let’s do just one branch (path to a leaf) first. That’s what a fully parallel solution will do!

Parallel prefix-sum The parallel-prefix algorithm does two passes: 1.build a “sum” tree bottom-up 2.traverse the tree top-down, accumulating the sum from the left 12Sophomoric Parallelism and Concurrency, Lecture 3

The algorithm, step 1 1.Step one does a parallel sum to build a binary tree: –Root has sum of the range [ 0,n ) –An internal node with the sum of [ lo,hi ) has Left child with sum of [ lo,middle ) Right child with sum of [ middle,hi ) –A leaf has sum of [ i,i+1 ), i.e., input[i] (or an appropriate larger region w/a cutoff) How? Parallel sum but explicitly build a tree: return left+right;  return new Node(left->sum + right->sum, left, right); Step 1: Work?Span? 13Sophomoric Parallelism and Concurrency, Lecture 3

The algorithm, step 2 2.Parallel map, passing down a fromLeft parameter –Root gets a fromLeft of 0 –Internal nodes pass along: to its left child the same fromLeft to its right child fromLeft plus its left child’s sum –At a leaf node for array position i, output[i]=fromLeft+input[i] How? A map down the step 1 tree, leaving results in the output array. Notice the invariant: fromLeft is the sum of elements left of the node’s range Step 2: Work?Span? 14Sophomoric Parallelism and Concurrency, Lecture 3 (already calculated in step 1!)

Parallel prefix-sum The parallel-prefix algorithm does two passes: 1.build a “sum” tree bottom-up 2.traverse the tree top-down, accumulating the sum from the left Step 1: Work: O(n)Span: O(lg n) Step 2: Work: O(n) Span: O(lg n) Overall:Work?Span? Paralellism (work/span)? 15Sophomoric Parallelism and Concurrency, Lecture 3 In practice, of course, we’d use a sequential cutoff!

Parallelizing Quicksort Recall quicksort was sequential, in-place, expected time O(n lg n) 16Sophomoric Parallelism and Concurrency, Lecture 3 Best / expected case work 1.Pick a pivot element O(1) 2.Partition all the data into: O(n) A.The elements less than the pivot B.The pivot C.The elements greater than the pivot 3.Recursively sort A and C 2T(n/2) How do we parallelize this? What span do we get? T  (n) =

Parallelizing Quicksort Recall quicksort was sequential, in-place, expected time O(n lg n) 17Sophomoric Parallelism and Concurrency, Lecture 3 Best / expected case work 1.Pick a pivot element O(1) 2.Partition all the data into: O(n) A.The elements less than the pivot B.The pivot C.The elements greater than the pivot 3.Recursively sort A and C 2T(n/2) How should we parallelize this? Parallelize the recursive calls as we usually do in fork/join D&C. Parallelize the partition by doing two packs (filters) instead.

Analyzing T  (n) = lg n + T  (n/2) Turns out our techniques from way back at the start of the term will work just fine for this: T  (n) = lg n + T  (n/2)if n > 1 = 1otherwise 18Sophomoric Parallelism and Concurrency, Lecture 3