# Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

## Presentation on theme: "Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key."— Presentation transcript:

Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key on processor k is larger than every key on processor k-1  The number of keys on any processor should not be larger than (n/p + thres)  Communication-intensive due to large migration of data between processors

Bitonic Sort  One of the traditional algorithms for parallel sorting  Follows a divide-and-conquer algorithm  Also has nice properties – only a pair of processors communicate at each stage  Can be mapped efficiently to hypercube and mesh networks

Bitonic Sequence  Rearranges a bitonic sequence into a sorted sequence  Bitonic sequence – sequence of elements (a0,a1,a2,…,an-1) such that  Or there exists a cyclic shift of indices satisfying the above  E.g.: (1,2,4,7,6,0) or (8,9,2,1,0,4) a0 ai an-1

Using bitonic sequence for sorting  Let s = (a0,a1,…,an-1) be a bitonic sequence such that a0 =a n/2+1 >=…>=a n-1  Consider S1 = (min(a 0,a n/2 ),min(a 1,a n/2+1 ),….,min(a n/2-1,a n-1 )) and S2 = (max(a 0,a n/2 ),max(a 1,a n/2+1 ),….,max(a n/2-1,a n-1 )) Both are bitonic sequences Every element of s1 is smaller than s2

Using bitonic sequence for sorting  Thus, initial problem of rearranging a bitonic sequence of size n is reduced to problem of rearranging two smaller bitonic sequences and concatenating the results  This operation of splitting is bitonic split  This is done recursively until the size is 1 at which point the sequence is sorted; number of splits is logn  This procedure of sorting a bitonic sequence using bitonic splits is called bitonic merge

Bitonic Merging Network 3 5 8 9 10 12 14 20 95 90 60 40 35 23 18 0 + + + + + + + + + + + + + + + + 3 5 8 9 10 12 14 0 95 90 60 40 35 23 18 20 + + + + + + + + + + + + + + + + 3 5 8 0 10 12 14 9 35 23 18 20 95 90 60 40 + + + + + + + + + + + + + + + + 3 0 8 5 10 9 14 12 18 20 35 23 60 40 95 90 + + + + + + + + + + + + + + + + 0 3 5 8 9 10 12 14 18 20 23 35 40 60 90 95 Takes a bitonic sequence and outputs sorted order; contains logn columns A bitonic merging network with n inputs denoted as BM[n] +

Sorting unordered n elements  By repeatedly merging bitonic sequences of increasing length + + + + - - - - BM[4] + + + + - - - BM[2] BM[4] BM[8] BM[16] An unsorted sequence can be viewed as a concactenation of bitonic sequences of size two Each stage merges adjancent bitonic sequences into increasing and decreasing order Forming a larger bitonic sequence

Bitonic Sort  Eventually obtain a bitonic sequence of size n which can be merged into a sorted sequence  Figure 9.8 in your book  Total number of stages, d(n) = d(n/2)+logn = O(log 2 n)  Total time complexity = O(nlog 2 n)

Parallel Bitonic Sort Mapping to a Hypercube  Imagine N processes (one element per process).  Each process id can be mapped to the corresponding node number of the hypercube.  Communications between processes for compare-exchange operations will always be neighborhood communications  In the ith step of the final stage, processes communicate along the (d-(i-1)) th dimension  Figure 9.9 in the book

Parallel Bitonic Sort Mapping to a Mesh  Connectivity of a mesh is lower than that of hypercube  One mapping is row-major shuffled mapping  Processes that do frequent compare- exchanges are located closeby 0 14 5 2 36 7 8 912 13 10 1114 15

Mesh..  For example, processes that perform compare-exchange during every stage of bitonic sort are neighbors 0 14 5 2 36 7 8 912 13 10 1114 15

Block of Elements per Process General 3 5 8 9 10 12 14 20 95 90 60 40 35 23 18 0 + + + + + + + + + + + + + + + + 3 5 8 9 10 12 14 0 95 90 60 40 35 23 18 20 + + + + + + + + + + + + + + + + 3 5 8 0 10 12 14 9 35 23 18 20 95 90 60 40 + + + + + + + + + + + + + + + + 3 0 8 5 10 9 14 12 18 20 35 23 60 40 95 90 + + + + + + + + + + + + + + + + 0 3 5 8 9 10 12 14 18 20 23 35 40 60 90 95

General..  For a given stage, a process communicates with only one other process  Communications are for only logP steps  In a given step i, the communicating process is determined by the ith bit

Drawbacks  Bitonic sort moves data between pairs of processes  Moves data O(logP) times  Bottleneck for large P

 Sample Sort

Sample Sort  A sample of data of size s is collected from each processor; then samples are combined on a single processor  The processor produces p-1 splitters from the sp-sized sample; broadcasts the splitters to others  Using the splitters, processors send each key to the correct final destination

Parallel Sorting by Regular Sampling (PSRS) 1.Each processor sorts its local data 2.Each processor selects a sample vector of size p-1; kth element is (n/p * (k+1)/p) 3.Samples are sent and merge-sorted on processor 0 4.Processor 0 defines a vector of p-1 splitters starting from p/2 element; i.e., kth element is p(k+1/2); broadcasts to the other processors

PSRS 5.Each processor sends local data to correct destination processors based on splitters; all-to-all exchange 6.Each processor merges the data chunk it receives

Step 5  Each processor finds where each of the p-1 pivots divides its list, using a binary search  i.e., finds the index of the largest element number larger than the jth pivot  At this point, each processor has p sorted sublists with the property that each element in sublist i is greater than each element in sublist i-1 in any processor

Step 6  Each processor i performs a p-way merge-sort to merge the ith sublists of p processors

Example

Example Continued

Analysis  The first phase of local sorting takes O((n/p)log(n/p))  2 nd phase: Sorting p(p-1) elements in processor 0 – O(p 2 logp 2 ) Each processor performs p-1 binary searches of n/p elements – plog(n/p)  3 rd phase: Each processor merges (p-1) sublists Size of data merged by any processor is no more than 2n/p (proof) Complexity of this merge sort 2(n/p)logp  Summing up: O((n/p)logn)

Analysis  1 st phase – no communication  2 nd phase – p(p-1) data collected; p-1 data broadcast  3 rd phase: Each processor sends (p-1) sublists to other p-1 processors; processors work on the sublists independently

Analysis Not scalable for large number of processors Merging of p(p-1) elements done on one processor; 16384 processors require 16 GB memory

Sorting by Random Sampling  An interesting alternative; random sample is flexible in size and collected randomly from each processor’s local data  Advantage A random sampling can be retrieved before local sorting; overlap between sorting and splitter calculation

Sources/References  On the versatility of parallel sorting by regular sampling. Li et al. Parallel Computing. 1993.  Parallel Sorting by regular sampling. Shi and Schaeffer. JPDC 1992.  Highly scalable parallel sorting. Solomonic and Kale. IPDPS 2010.

 END

Bitonic Sort - Compare-splits  When dealing with a block of elements per process, instead of compare-exchange, use compare-split  i.e, each process sorts its local elementsl then each process in a pair sends all its elements to the receiving process  Both processes do the rearrangement with all the elements  The process then sends only the necessary elements in the rearranged order to the other process  Reduces data communication latencies

Block of elements and Compare Splits  Think of blocks as elements  Problem of sorting p blocks is identical to performing bitonic sort on the p blocks using compare-split operations  log 2 P steps  At the end, all n elements are sorted since compare-splits preserve the initial order in each block  n/p elements assigned to each process are sorted initially using a fast sequential algorithm

Block of Elements per Process Hypercube and Mesh  Similar to one element per process case, but now we have p blocks of size n/p, and compare exchanges are replaced by compare-splits  Each compare-split takes O(n/p) computation and O(n/p) communication time  For hypercube, the complexity is: O(n/p log(n/p)) for sorting O(n/p log 2 p) for computation O(n/p log 2 p) for communication

Histogram Sort  Another splitter-based method  Histogram also determines a set of p-1 splitters  It achieves this task by taking an iterative approach rather than one big sample  A processor broadcasts k (> p-1) initial splitter guesses called a probe  The initial guesses are spaced evenly over data range

Histogram Sort Steps 1.Each processor sorts local data 2.Creates a histogram based on local data and splitter guesses 3.Reduction sums up histograms 4.A processor analyzes which splitter guesses were satisfactory (in terms of load) 5.If unsatisfactory splitters, the, processor broadcasts a new probe, go to step 2; else proceed to next steps

Histogram Sort Steps 6.Each processor sends local data to appropriate processors – all-to-all exchange 7.Each processor merges the data chunk it receives Merits:  Only moves the actual data once  Deals with uneven distributions

Probe Determination  Should be efficient – done on one processor  The processor keeps track of bounds for all splitters Ideal location of a splitter i is (i+1)n/p When a histogram arrives, the splitter guesses are scanned

Probe Determination  A splitter can either Be a success – its location is within some threshold of the ideal location Or not – update the desired splitter to narrow the range for the next guess  Size of a generated probe depends on how many splitters are yet to be resolved  Any interval containing s unachieved splitters is subdivided with sxk/u guess where u is the total number of unachieved splitters and k is the number of newly generated splitters

Merging and all-to-all overlap  For merging p arrays at the end Iterate through all arrays simultaneously Merge using a binary tree  In the first case, we need all the arrays to have arrived  In the second case, we can start as soon as two arrays arrive  Hence this merging can be overlapped with all-to-all

Radix Sort  During every step, the algorithm puts every key in a bucket corresponding to the value of some subset of the key’s bits  A k-bit radix sort looks at k bits every iteration  Easy to parallelize – assign some subset of buckets to each processor  Lad balance – assign variable number of buckets to each processor

Radix Sort – Load Balancing  Each processor counts how many of its keys will go to each bucket  Sum up these histograms with reductions  Once a processor receives this combined histogram, it can adaptively assign buckets

Radix Sort - Analysis  Requires multiple iterations of costly all- to-all  Cache efficiency is low – any given key can move to any bucket irrespective of the destination of the previously indexed key  Affects communication as well

Download ppt "Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key."

Similar presentations