Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.

Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua Presented by Anton Morozov

Motivations and Observations Success of ATLAS, FFTW and SPIRAL (signal processing libraries) Success of ATLAS, FFTW and SPIRAL (signal processing libraries) What Can be done for Sorting?

Why are we interested in the sorting algorithms? Does this reflects the performance of the sorting algorithms?

Which additional factors influence the performance of the sorting algorithm?

Performance vs. Standard Deviation

Observation Quicksort and Merge sort are both comparison based sorts, thus they are independent of the chosen distribution or standard deviation Performance depends on degree of sortedness i.e. the number of inversions Max n(n-1)/2

Architectural Model and Empirical Search We saw how programs like BLAS and ATLAS use search to establish the parameters of the underlying architecture

So what Sort Algorithm is better? What performance of the sorting algorithm depends on? What performance of the sorting algorithm depends on? How to choose the best sorting algorithm? How to choose the best sorting algorithm?

Sorting algorithms QuickSort Radix Sort Merge Sort Insertion Sort Sorting Networks Heap Sort

Sorting algorithms QuickSort Radix Sort Cache-Conscious Radix sort Merge Sort Multiway Merge Sort Insertion Sort Sorting Networks Heap Sort Register sorts }

Quick Sort Description: Pick a pivot, move records around the pivot, records which are smaller than pivot go to the front, bigger go to the back, and pivot inserted between them. Improvements: Move iteratively Choose pivot among the first, middle and last keys Use fast sorts for the small partitioning. (insertion or sorting networks)

Cache-Conscious Radix Sort Having b-bit integer and a radix of size 2 r, algorithm first sorts by lower r bits then sorts by next r bits total in b/r phases, where r is chosen to be r ≤ log 2 S TBL -1 where S TBL number of entries in translation look-aside buffer. Improvements : Proceed iteratively, Proceed iteratively, Compute the histogram of the each r bits first time the sort is applied, Compute the histogram of the each r bits first time the sort is applied, Choose r as described above Choose r as described above

Multiway merge sort. It partitions the keys into p subsets, each subset is then sorted in (in this case with CC- radix sort) and then subsets are merged using heap. First smallest/largest element of the subset is promoted to the leaves of the heap then leaves are compared and an appropriate leaf is promoted. Heap contains 2*p-1 leaves. Each parent in a heap has A/r children, A cache line, r size of a node.

Insertion Sort. Used for the small data sizes Algorithm working from left to right for each key scans to the left of the key and places it in the appropriate place Sorting Networks Algorithms compares two inputs in sequence and if one is bigger then the other it swaps them.

Input Data Factors Number of keys Number of keys Distribution Distribution Standard deviation Standard deviation … … Approximate S.D. with Entropy vector ∑ i -P i *log 2 P i where P i =c i /N, c i is a number of keys with value i in that digit

Parameters to search for during installation Merge Sort: Size of the heap and the fanout depends on cache size, cache line, input size and entropy at run time needs N and E Quick Sort: Insertion sort or Sorting Networks and their thresholds, depends on the number of registers and cache size CC-radix Sort: Insertion sort or Sorting Networks or standard Radix sort depending on the size, also depends on the number of registers and cache size

Learning procedure Winnow algorithm: ∑ i w i *E i > Θ Computes weights vector and threshold depending on the Entropy vector  : (N,E) → {CC-radix, Multiway Merge(N,E), Quicksort}

Sample the input array (every fourth entry) Compute the entropy vector Compute S = ∑ i w i * entropy i If S ≥ Ө If S ≥ Ө choose CC-radix else choose others based on size of input (either Merge Sort or QuickSort) (either Merge Sort or QuickSort) Selection at run time

Runtime Factors Distribution shape of the data Amount of data to Sort Distribution Width Architectural Factors Cache / TLB size Number of Registers Cache Line Size Summarize Empirical Search Any, since it doesn’t matter Learn at installation time

Performance Results

Is it possible to do better?

Sorting Primitives To build a new sorting algorithms: sorting and selection primitives Sorting primitive: Is a pure sorting algorithm looked before Selection primitive: Is a process to be executed at run time to decide which sorting algorithm to apply

Sorting Primitives Divide-by-Value: corresponds to the first phase of Quicksort takes the number of pivots as a parameter (np+1) Divide-by-Value: corresponds to the first phase of Quicksort takes the number of pivots as a parameter (np+1) - A step in Quicksort -Select one or multiple pivots and sort the input array around these pivots Divide-by-Position: corresponds to initial break of Merg Sort Divide-by-Position: corresponds to initial break of Merg Sort takes size of each partition and fan-out of the heap - Divide input into same-size sub-partitions - Use heap to merge the multiple sorted sub-partitions

Sorting Primitives Divide-by-Radix: corresponds to the step in the radix sort algorithm. Takes a radix as a parameter.Divide-by-Radix: corresponds to the step in the radix sort algorithm. Takes a radix as a parameter. Parameter: radix (r bits) Step 1: Scan the input to get distribution array, which records how many elements in each of the 2r sub-partitions. Step 2: Compute the accumulative distribution array, which is used as the indexes when copying the input to the destination array. Step 3: Copy the input to the 2r sub-partitions. 1 1 1 1 01230123 counter 0 1 2 3 01230123 accum.dest. 11 23 30 12 src. 30 11 12 23 1 2 3 4

Sorting Primitives Divide-by-radix-assuming-Uniform-distribution: same as above. Assumes that each bucket contains n/2 r keysDivide-by-radix-assuming-Uniform-distribution: same as above. Assumes that each bucket contains n/2 r keys - Step 1 and Step 2 in DR are expensive. - If the input elements are distributed among 2r sub- partitions near evenly, the input can be copied into the destination array directly assuming every partition have the same number of elements. - Overhead: partition overflow

Sorting Primitives Once the partition is small: Leaf-Divide-by-Value: same as DV but applies recursively to the partitions. < Threshold applies register sorting Leaf-Divide-by-Value: same as DV but applies recursively to the partitions. < Threshold applies register sorting Leaf-Divide-by-Radix: same as DR but is used on all remaining subsets. < threshold applies register sorting Leaf-Divide-by-Radix: same as DR but is used on all remaining subsets. < threshold applies register sorting

Selection Primitives Branch-by-Size: used to select different paths based on size Branch-by-Size: used to select different paths based on size Branch-by-Entropy: uses entropy to branch on different path. Branch-by-Entropy: uses entropy to branch on different path. Uses Winnow for learning the weight vector

Genetic Algorithm Crossover: Propagate good sub-trees Propagate good sub-trees Mutation: Mutate the structure of the algorithm. Mutate the structure of the algorithm. Change the parameter values of primitives. Change the parameter values of primitives.

Genetic Algorithm Fitness function: Average performance by S.D. Average performance by S.D. Uses Rank instead of fitness. Uses Rank instead of fitness.

Performance Results

Is it possible to do better? Empirically was observed that Branch-by-Entropy selection primitive was never used

Classifier Sorting Based on the idea that the performance of the algorithm in one region of input space can be independent of the other. i is an input characteristic string, c is a condition string with “1”, “0” and “*” for don’t care.

Example: Encode number of keys into 4 bits. 0000: 0~1M, 0001: 1~2M… Number of keys = 10.5M. Encoded into “1100” ConditionActionFitnessAccuracy (dr 5 (lq 1 16)) …… (dp 4 2 ( lr 5 16)) …… …… 1100 01** 1010 110*(dv 2 ( lr 6 16))

Experimental Results

Summary and Future work The work presented shows how sorting can be adapted to underlying platforms Potential future work: - Figure out what went wrong or not wrong with those graphs - Incorporate the notion of “sortedness” into sort selection - Simplify the selection algorithm - See if these notions can be used in the cache oblivious way

Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.

Similar presentations

Presentation on theme: "Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.

Similar presentations

Presentation on theme: "Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua."— Presentation transcript:

Similar presentations

About project

Feedback