A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04) Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign

2 Motivation  Sorting –Core operation in many applications, such as databases –Well understood symbolic computing problem  Libraries generators such as ATLAS and SPIRAL have used empirical search to adapt to –Architectural features of the target machine –Size of the input data But, performance of sorting also depends on the distribution of the values to be sorted

3  Main difficulties to build a sorting library 1.Theoretical complexity is not sufficient to measure quality Cache effect, instructions executed 2.Performance depends on the characteristics of the input Amount & distribution of data to sort A single algorithm is not optimal for all possible input sets Motivation

4 Contributions 1.Identify the architectural and runtime factors that affect the performance of the sorting algorithms. 2.Use empirical search to identify the best shape and parameter values of a sorting algorithm. 3.Use machine learning and runtime adaptation to select the best sorting algorithm for a specific input set.

5 Contributions IBM Power 3, sorting 12 M keys (integer 32 bits) Standard deviation of the inputs Execution Time (Cycles)

6 Outline  Sorting Algorithms  Factors that determine performance  The Library  Evaluation  Future Work  Conclusions

7 Sorting Algorithms  Our sorting library contains –Quicksort –CC-Radix –Multiway Merge –Insertion Sort –Sorting Networks For small partitions

8 Quicksort  Divide and conquer in-place sorting algorithm  Our implementation includes Sedgewick’s optimizations: –Set guardians at both ends of the input array. –Eliminate recursion. –Correctly select the pivot. –Use insertion sort for small partitions.

9 Radix sort  Non comparison algorithm 12 23 31 13 4 1 012345012345 Vector to sort 21212121 12341234 counter 02350235 12341234 accum. 3 231341231341 012345012345 Dest. vector 31 1 12 23 33 4 12 23 112334112334 31233123 12311231

10 CC-radix (Cache Conscious Radix Sort)  Tries to exploit data locality in caches  Based on radix sort (Jimenez and Larriba – UPC) if fits in cache (bucket) then radix sort (bucket) CC-radix(bucket) else sub-buckets = Reverse sorting(bucket) for each sub-bucket in sub-buckets CC-radix(sub-buckets) endfor endif

11 Multiway Merge Sort Sorted Subset Sorted Subset Sorted Subset Sorted Subset Heap p subsets 2*p -1 nodes  This algorithm exploits data locality very efficiently

12 Sorting algorithms for small partitions  Insertion sort  Exploits locality in the cache line  Sorting networks  Register blocking

13 Performance Comparison Pentium III Xeon, 16 M keys (float)

15 Factors that determine performance  Architectural Factors Considered –Cache / TLB size –Number of Registers –Cache Line Size  Runtime Factors Considered –Amount of data to Sort –Distribution of the data

16 Architectural: Cache Size/TLB Size  Tiling: Partition the data in subsets that fit in the cache –Quicksort Using multiple pivots to tile –CC-radix Fit each partition into cache The # active partitions < TLB size –Multiway Merge Sort Fit the heap into cache Fit sorted subsets into cache

17 Architectural: Number of Registers  For small partitions, sort in place using the processor registers  Optimizations like unroll and scheduling can be applied cmp&swap(r0,r1) cmp&swap(r2,r3) cmp&swap(r1,r2) cmp&swap(r0,r3) cmp&swap(r4,r5) ….. cmp&swap(r0,r1) cmp&swap(r2,r3) cmp&swap(r4,r5) cmp&swap(r1,r2) cmp&swap(r0,r3)

18 Architectural: Cache Line Size  Fanout = Cache Line Size  Increase cache line utilization when accessing children nodes … Cache Line

19 Runtime: Amount and Distribution Shape Number of Keys (Millions) Execution Time (Cycles)

20 Runtime: Amount and Distribution Shape Execution Time (Cycles) Number of Keys (Millions)

21 Runtime: Standard Deviation Execution Time (Cycles) Standard deviation of the keys Pentium III Xeon, 16 M keys

23 Library adaptation  Architectural Factors –Cache / TLB size –Number of Registers –Cache Line Size Empirical Search  Runtime Factors –Distribution shape of the data –Amount of data to Sort –Standard Deviation Does not matter Machine learning and runtime adaptation

24 The Library  Building the library  Intallation time –Empirical Search –Learning Procedure Use of training data  Running the library  Runtime –Runtime Procedure Runtime Adaptation

25 Runtime Adaptation: Learning Procedure  Goal function: f:(N,E)  {Multiway Merge Sort, Quicksort, CC-radix} N: amount of input data E: the entropy vector –Use N to choose between Multiway Merge or Quicksort –Use the entropy and Winnow algorithm to learn the best algorithm Output: weight vector ( ) and threshold ( Ө) w →

26 Runtime Adaptation:Runtime Procedure  Sample the input array  Compute the entropy vector  Compute S = ∑ i w i * entropy i  If S ≥ Ө choose CC-radix else choose others

28 Experimental Setup  Test Platforms: –SGI R12000: 300 Mhz; L1I/D=32KB; L2 = 4MB –UltraSparcIII: 750 Mhz; L1I/D=32KB, 64KB; L2 = 8MB –PentiumIII Xeon: 550 Mhz; L1I/D=16KB; L2 = 512KB –IBM Power3: 375 Mhz, L1I/D=64KB; L2 = 8MB

29 Sun UltraSparcIII: 12 M keys Execution Time (Cycles per key) Standard deviation of the keys

30 IBM Power3: 12 M Keys Execution Time (Cycles per key) Standard deviation of the keys

31 Conclusions  Identify the architectural and runtime factors  Use empirical search to find the best parameters values  Our machine learning techniques prove to be quite effective: –Always selects the best algorithm. –The wrong decision introduces a 37% average performance degradation –Overhead (average 5%, worst case 7%)

32 Future Work 1.Search in the space of sorting algorithms using high-level primitives 2.Extend sorting to include more data types 3.Include other comparison strategies 4.Parallel algorithms 5.Explore other database operations, such as join. For example, less than to sort vectors, graphs, …

33 Empirical Search  Adaptation to the architecture of the machine –Quicksort and CC-radix, the best configuration does not change significantly with the characteristics of the input data set. Quicksort, CC-Radix: -Use of insertion sort/sorting networks for small partitions -Threshold to use them CC-radix -Size of the radix –Multiway Merge Sort the best configuration changes with the amount and the distribution of the input data. The best values will be searched during the learning procedure.

35 Multiway Merge Sort Sorted Run Sorted Run Sorted Run Sorted Run Heap 11212360742 21 60 42 28 60 42 28 4 42 28 23

36 Empirical Search Example:  Multiway Merge Search the heap size that obtains the best performance: -Different amount of data and standard deviation

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Similar presentations

Presentation on theme: "A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Similar presentations

Presentation on theme: "A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)"— Presentation transcript:

Similar presentations

About project

Feedback