The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms Marek Olszewski and Michael Voss ECE Department University of Toronto

PDPTA 2004 Motivation  Sorting is a fundamental algorithm  Many algorithmic choices for sorting  Performance heavily influenced by Data being sorted (type, entropy) Target machine being used  How can we build the best sort for a given machine? An empirical install-time system

PDPTA 2004 Outline of Talk  Motivation  An Overview of Sorting Algorithms  Our install-time empirical system An adaptive hybrid sequential sort An adaptive hybrid parallel sort  An Evaluation  Related Work  Conclusions

PDPTA 2004 An overview of sorting algorithms  Art of Computer Programming V3 (Knuth) 25 algorithms comprehensively studied  Comparison sorts Lower bound shown to be  (n log n) Examples include: insertion sort, quick sort and merge sort  Non-comparison sorts Can be linear time, i.e. O(n) But require knowing the range of the data Examples include: radix sort and bucket sort

PDPTA 2004 An overview of sorting algorithms  Hybrid sorts Divide and conquer sorts are recursive May be beneficial to switch algorithms Most C++ STL sorts are hybrid sorts  Gnu std::sort is a hybrid sort with pre-defined points to switch between heap sort, quick sort, merge sort and insertion sort

PDPTA 2004 An overview of parallel sorts  Ideally, O( (n log n) / p) If p = n, then O( log n) Several parallel sorts demonstrate this bound, e.g. Column sort Parallelized sequential sorts often better for low numbers of processors (our focus).  Parallelized divide and conquer algorithms Effective for small numbers of processors Use a work-queue model Tasks are place in a shared work-queue Idle processors remove tasks from the queue Good load balance

PDPTA 2004 Our install-time system Start Sample input data provided to installer Specialized decision Function place in library Time Sorts Random algorithms at each recursive step Calculate best sorting algorithm for each data aet size Convert tree to C++ C4.5 creates decision tree End Parallel? Time Sorts Different input sizes and work-share points Work-share cutoff point tree and C++ functions generated

PDPTA 2004 Algorithms available to our hybrid sort: AlgorithmDescription Insertion Sort O(n 2 ) but with small lower order terms. Efficient for small lists. Merge Sort O(n log n). Subtasks evenly divided by has higher lower-order terms than quick sort. Quick Sort O(n log n) on average, but is O(n 2 ) worst-case. Has smaller lower-order terms than merge sort. In-place Merge Sort O(n log n). Higher constant coefficient than merge sort, but uses less memory. Heap Sort O(n log n). Non-recursive algorithm. Can do well on medium sized lists. Higher lower-order terms than quick sort.

PDPTA 2004 Hybrid Adaptive Sequential Sort  Use random data to train system Up to 10 million elements Insertion sort not used for large inputs Not all inputs sorted to completion  Dynamic programming used to find best choice Assume best sort at each subsequent step Per step timings were measured  C4.5 decision tree used to analyze this data  C4.5 tree converted to C++ template code

PDPTA 2004 Hybrid Adaptive Parallel Sort  Start with sequential hybrid sort  Determine work-sharing cutoff point When should a thread execute its own tasks When should a thread place tasks in work queue  Determines the point at which synchronization costs are no longer amortized by small work

PDPTA 2004 Methodology: Platforms  Sequential platforms Linux 2.4.18 Intel Penitum 4 1.6 GHz Xeon Linux 2.4.24 AMD Athlon XP 1700+ SunOS 5.8 on a 600 MHz Sparc Workstation  Parallel platform 4 processor 1.6 GHz Intel Xeon SMP Modified 2.4.18-smp kernel (allowed binding)

PDPTA 2004 Methodology: Comparisons  Adaptive Hybrid Sequential Sort  Adaptive Hybrid Parallel Sort  Gnu G++ 2.96 std::sort and std::stable_sort Also hybrid sorts Complex – not easily parallelized  8 equally sized merge sorts that called std::sort and std::stable_sort in parallel

PDPTA 2004 Serial Non-Optimized (w/o –O) Results

PDPTA 2004 Serial Optimized (w –O) Results

PDPTA 2004 Parallel Work-share Cutoff Point

PDPTA 2004 Parallel Non-Optimized (w/o –O) Results

PDPTA 2004 Parallel Optimized (with –O) Results

PDPTA 2004 Parallel Sort Speedups

PDPTA 2004 Related Work  Install-time empirical optimization systems ATLAS: Level 3 BLAS FFTW: FFT  STAPL: Adaptive Parallel C++ Library Uses decision trees like our approach Uses only single-level sorts, not hybrids Not available for comparison  A Dynamically Tuned Sorting Library (CGO’04) Install-time tuning of sequential sorts Only single-level sorts, not hybrid

PDPTA 2004 Conclusion  Presented an install-time system for empirically constructing a “best” sorting algorithm for a target machine  Competitive with STL sort on 1 processor  Better than a parallelized STL sort on multiple processors

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Similar presentations

Presentation on theme: "The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Similar presentations

Presentation on theme: "The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback