Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Garfield AP Computer Science
Lecture 3: Parallel Algorithm Design
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
Chapter 10 in textbook. Sorting Algorithms
Algorithms and Applications
1 Lecture 11 Sorting Parallel Computing Fall 2008.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
Sorting Algorithms: Topic Overview
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
CS 104 Introduction to Computer Science and Graphics Problems Data Structure & Algorithms (3) Recurrence Relation 11/11 ~ 11/14/2008 Yang Song.
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Topic Overview One-to-All Broadcast and All-to-One Reduction
CSCI-455/552 Introduction to High Performance Computing Lecture 22.
1 Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. ITCS4145/5145, Parallel Programming B. Wilkinson.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
1 Lecture 16: Lists and vectors Binary search, Sorting.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
Simple Iterative Sorting Sorting as a means to study data structures and algorithms Historical notes Swapping records Swapping pointers to records Description,
CSC 211 Data Structures Lecture 13
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Sorting CS 110: Data Structures and Algorithms First Semester,
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
“Sorting networks and their applications”, AFIPS Proc. of 1968 Spring Joint Computer Conference, Vol. 32, pp
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,
Basic Communication Operations Carl Tropper Department of Computer Science.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
Sorting: Parallel Compare Exchange Operation A parallel compare-exchange operation. Processes P i and P j send their elements to each other. Process P.
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Advanced Sorting 7 2  9 4   2   4   7
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Parallel Sorting Algorithms
Bitonic Sorting and Its Circuit Design
Parallel Sort, Search, Graph Algorithms
Parallel Sorting Algorithms
Parallel Sorting Algorithms
Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. Sorting number is important in applications as it can.
CENG 351 Data Management and File Structures
Parallel Sorting Algorithms
Presentation transcript:

Parallel Sorting Sathish Vadhiyar

Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key on processor k is larger than every key on processor k-1  The number of keys on any processor should not be larger than (n/p + thres)  Communication-intensive due to large migration of data between processors

Bitonic Sort  One of the traditional algorithms for parallel sorting  Follows a divide-and-conquer algorithm  Also has nice properties – only a pair of processors communicate at each stage  Can be mapped efficiently to hypercube and mesh networks

Bitonic Sequence  Rearranges a bitonic sequence into a sorted sequence  Bitonic sequence – sequence of elements (a0,a1,a2,…,an-1) such that  Or there exists a cyclic shift of indices satisfying the above  E.g.: (1,2,4,7,6,0) or (8,9,2,1,0,4) a0 ai an-1

Using bitonic sequence for sorting  Let s = (a0,a1,…,an-1) be a bitonic sequence such that a0 =a n/2+1 >=…>=a n-1  Consider S1 = (min(a 0,a n/2 ),min(a 1,a n/2+1 ),….,min(a n/2-1,a n-1 )) and S2 = (max(a 0,a n/2 ),max(a 1,a n/2+1 ),….,max(a n/2-1,a n-1 )) Both are bitonic sequences Every element of s1 is smaller than s2

Using bitonic sequence for sorting  Thus, initial problem of rearranging a bitonic sequence of size n is reduced to problem of rearranging two smaller bitonic sequences and concatenating the results  This operation of splitting is bitonic split  This is done recursively until the size is 1 at which point the sequence is sorted; number of splits is logn  This procedure of sorting a bitonic sequence using bitonic splits is called bitonic merge

Bitonic Merging Network Takes a bitonic sequence and outputs sorted order; contains logn columns A bitonic merging network with n inputs denoted as BM[n] +

Sorting unordered n elements  By repeatedly merging bitonic sequences of increasing length BM[4] BM[2] BM[4] BM[8] BM[16] An unsorted sequence can be viewed as a concactenation of bitonic sequences of size two Each stage merges adjancent bitonic sequences into increasing and decreasing order Forming a larger bitonic sequence

Bitonic Sort  Eventually obtain a bitonic sequence of size n which can be merged into a sorted sequence  Figure 9.8 in your book  Total number of stages, d(n) = d(n/2)+logn = O(log 2 n)  Total time complexity = O(nlog 2 n)

Parallel Bitonic Sort Mapping to a Hypercube  Imagine N processes (one element per process).  Each process id can be mapped to the corresponding node number of the hypercube.  Communications between processes for compare-exchange operations will always be neighborhood communications  In the ith step of the final stage, processes communicate along the (d-(i-1)) th dimension  Figure 9.9 in the book

Parallel Bitonic Sort Mapping to a Mesh  Connectivity of a mesh is lower than that of hypercube  One mapping is row-major shuffled mapping  Processes that do frequent compare- exchanges are located closeby

Mesh..  For example, processes that perform compare-exchange during every stage of bitonic sort are neighbors

Block of Elements per Process General

General..  For a given stage, a process communicates with only one other process  Communications are for only logP steps  In a given step i, the communicating process is determined by the ith bit

Drawbacks  Bitonic sort moves data between pairs of processes  Moves data O(logP) times  Bottleneck for large P

 Sample Sort

Sample Sort  A sample of data of size s is collected from each processor; then samples are combined on a single processor  The processor produces p-1 splitters from the sp-sized sample; broadcasts the splitters to others  Using the splitters, processors send each key to the correct final destination

Parallel Sorting by Regular Sampling (PSRS) 1.Each processor sorts its local data 2.Each processor selects a sample vector of size p-1; kth element is (n/p * (k+1)/p) 3.Samples are sent and merge-sorted on processor 0 4.Processor 0 defines a vector of p-1 splitters starting from p/2 element; i.e., kth element is p(k+1/2); broadcasts to the other processors

PSRS 5.Each processor sends local data to correct destination processors based on splitters; all-to-all exchange 6.Each processor merges the data chunk it receives

Step 5  Each processor finds where each of the p-1 pivots divides its list, using a binary search  i.e., finds the index of the largest element number larger than the jth pivot  At this point, each processor has p sorted sublists with the property that each element in sublist i is greater than each element in sublist i-1 in any processor

Step 6  Each processor i performs a p-way merge-sort to merge the ith sublists of p processors

Example

Example Continued

Analysis  The first phase of local sorting takes O((n/p)log(n/p))  2 nd phase: Sorting p(p-1) elements in processor 0 – O(p 2 logp 2 ) Each processor performs p-1 binary searches of n/p elements – plog(n/p)  3 rd phase: Each processor merges (p-1) sublists Size of data merged by any processor is no more than 2n/p (proof) Complexity of this merge sort 2(n/p)logp  Summing up: O((n/p)logn)

Analysis  1 st phase – no communication  2 nd phase – p(p-1) data collected; p-1 data broadcast  3 rd phase: Each processor sends (p-1) sublists to other p-1 processors; processors work on the sublists independently

Analysis Not scalable for large number of processors Merging of p(p-1) elements done on one processor; processors require 16 GB memory

Sorting by Random Sampling  An interesting alternative; random sample is flexible in size and collected randomly from each processor’s local data  Advantage A random sampling can be retrieved before local sorting; overlap between sorting and splitter calculation

Sources/References  On the versatility of parallel sorting by regular sampling. Li et al. Parallel Computing  Parallel Sorting by regular sampling. Shi and Schaeffer. JPDC  Highly scalable parallel sorting. Solomonic and Kale. IPDPS 2010.

 END

Bitonic Sort - Compare-splits  When dealing with a block of elements per process, instead of compare-exchange, use compare-split  i.e, each process sorts its local elementsl then each process in a pair sends all its elements to the receiving process  Both processes do the rearrangement with all the elements  The process then sends only the necessary elements in the rearranged order to the other process  Reduces data communication latencies

Block of elements and Compare Splits  Think of blocks as elements  Problem of sorting p blocks is identical to performing bitonic sort on the p blocks using compare-split operations  log 2 P steps  At the end, all n elements are sorted since compare-splits preserve the initial order in each block  n/p elements assigned to each process are sorted initially using a fast sequential algorithm

Block of Elements per Process Hypercube and Mesh  Similar to one element per process case, but now we have p blocks of size n/p, and compare exchanges are replaced by compare-splits  Each compare-split takes O(n/p) computation and O(n/p) communication time  For hypercube, the complexity is: O(n/p log(n/p)) for sorting O(n/p log 2 p) for computation O(n/p log 2 p) for communication

Histogram Sort  Another splitter-based method  Histogram also determines a set of p-1 splitters  It achieves this task by taking an iterative approach rather than one big sample  A processor broadcasts k (> p-1) initial splitter guesses called a probe  The initial guesses are spaced evenly over data range

Histogram Sort Steps 1.Each processor sorts local data 2.Creates a histogram based on local data and splitter guesses 3.Reduction sums up histograms 4.A processor analyzes which splitter guesses were satisfactory (in terms of load) 5.If unsatisfactory splitters, the, processor broadcasts a new probe, go to step 2; else proceed to next steps

Histogram Sort Steps 6.Each processor sends local data to appropriate processors – all-to-all exchange 7.Each processor merges the data chunk it receives Merits:  Only moves the actual data once  Deals with uneven distributions

Probe Determination  Should be efficient – done on one processor  The processor keeps track of bounds for all splitters Ideal location of a splitter i is (i+1)n/p When a histogram arrives, the splitter guesses are scanned

Probe Determination  A splitter can either Be a success – its location is within some threshold of the ideal location Or not – update the desired splitter to narrow the range for the next guess  Size of a generated probe depends on how many splitters are yet to be resolved  Any interval containing s unachieved splitters is subdivided with sxk/u guess where u is the total number of unachieved splitters and k is the number of newly generated splitters

Merging and all-to-all overlap  For merging p arrays at the end Iterate through all arrays simultaneously Merge using a binary tree  In the first case, we need all the arrays to have arrived  In the second case, we can start as soon as two arrays arrive  Hence this merging can be overlapped with all-to-all

Radix Sort  During every step, the algorithm puts every key in a bucket corresponding to the value of some subset of the key’s bits  A k-bit radix sort looks at k bits every iteration  Easy to parallelize – assign some subset of buckets to each processor  Lad balance – assign variable number of buckets to each processor

Radix Sort – Load Balancing  Each processor counts how many of its keys will go to each bucket  Sum up these histograms with reductions  Once a processor receives this combined histogram, it can adaptively assign buckets

Radix Sort - Analysis  Requires multiple iterations of costly all- to-all  Cache efficiency is low – any given key can move to any bucket irrespective of the destination of the previously indexed key  Affects communication as well