High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Slides:

Advertisements

Similar presentations

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Advertisements

CS 400/600 – Data Structures External Sorting.

Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.

Advanced Topics in Algorithms and Data Structures Lecture 6.1 – pg 1 An overview of lecture 6 A parallel search algorithm A parallel merging algorithm.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Comp 122, Spring 2004 Elementary Sorting Algorithms.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.

Sorting Algorithms CS 524 – High-Performance Computing.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.

CS 284a, 4 November 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.

CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.

CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

MapReduce How to painlessly process terabytes of data.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

1 CSE 326: Data Structures Sorting It All Out Henry Kautz Winter Quarter 2002.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Memory Management during Run Generation in External Sorting – Larson & Graefe.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

CS 284a, 29 October 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 29 October, 1997.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

A Comparison of Parallel Sorting Algorithms on Different Architectures Nancy M. Amato, Ravishankar Iyer, Sharad Sundaresan and Yan Wu Texas A&M University.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS Spring 2012.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Sunpyo Hong, Hyesoon Kim

Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,

Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.

Sathish Vadhiyar Parallel Programming

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Database Applications (15-415) DBMS Internals- Part V Lecture 17, March 20, 2018 Mohammad Hammoud.

CS/EE 217 – GPU Architecture and Parallel Programming

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Presentation transcript:

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China

Outline GPU computation model Our sorting algorithm –A new bitonic-based merge sort, named Warpsort Experiment results conclusion

GPU computation model Massively multi-threaded, data-parallel many-core architecture Important features: –SIMT execution model Avoid branch divergence –Warp-based scheduling implicit hardware synchronization among threads within a warp –Access pattern Coalesced vs. non-coalesced

Why merge sort ? Similar case with external sorting –Limited shared memory on chip vs. limited main memory Sequential memory access –Easy to meet coalesced requirement

Why bitonic-based merge sort ? Massively fine-grained parallelism –Because of the relatively high complexity, bitonic network is not good at sorting large arrays –Only used to sort small subsequences in our implementation Again, coalesced memory access requirement

Problems in bitonic network naïve implementation –Block-based bitonic network –One element per thread Some problems –in each stage n elements produce only n/2 compare-and-swap operations Form both ascending pairs and descending pairs –Between stages synchronization block thread Too many branch divergences and synchronization operations

What we use ? Warp-based bitonic network –each bitonic network is assigned to an independent warp, instead of a block Barrier-free, avoid synchronization between stages –threads in a warp perform 32 distinct compare-and-swap operations with the same order Avoid branch divergences At least 128 elements per warp And further a complete comparison-based sorting algorithm: GPU-Warpsort

Overview of GPU-Warpsort Divide input seq into small tiles, and each followed by a warp-based bitonic sort Merge, until the parallelism is insufficient. Split into small subsequences Merge, and form the output

Step1: barrier-free bitonic sort divide the input array into equal-sized tiles Each tile is sorted by a warp-based bitonic network –128+ elements per tile to avoid branch divergence –No need for __syncthreads() –Ascending pairs + descending pairs –Use max() and min() to replace if-swap pairs

Step 2: bitonic-based merge sort t-element merge sort –Allocate a t-element buffer in shared memory –Load the t/2 smallest elements from seq A and B, respectively –Merge –Output the lower t/2 elements –Load the next t/2 smallest elements from A or B t = 8 in this example

Step 3: split into small tiles Problem of merge sort –the number of pairs decreases geometrically –Can not fit this massively parallel platform Method –Divide the large seqs into independent small tiles which satisfy:

Step 3: split into small tiles (cont.) How to get the splitters? –Sample the input sequence randomly

Step 4: final merge sort Subsequences (0,i), (1,i),…, (l-1,i) are merged into S i Then,S 0, S 1,…, S l are assembled into a totally sorted array

Experimental setup Host –AMD 2.4 GHz, 2GB RAM GPU –9800GTX+, 512 MB Input sequence –Key-only and key-value configurations 32-bit keys and values –Sequence size: from 1M to 16M elements –Distributions Zero, Sorted, Uniform, Bucket, and Gaussian

Performance comparison Mergesort –Fastest comparison-based sorting algorithm on GPU (Satish, IPDPS’09) –Implementations already compared by Satish are not included Quicksort –Cederman, ESA’08 Radixsort –Fastest sorting algorithm on GPU (Satish, IPDPS’09) Warpsort –Our implementation

Performance results Key-only –70% higher performance than quicksort Key-value –20%+ higher performance than mergesort –30%+ for large sequences (>4M)

Results under different distributions Uniform, Bucket, and Gaussian distribution almost get the same performance Zero distribution is the fastest Not excel on Sorted distribution –Load imbalance

Conclusion We present an efficient comparison-based sorting algorithm for many-core GPUs –carefully map the tasks to GPU architecture Use warp-based bitonic network to eliminate barriers –provide sufficient homogeneous parallel operations for each thread avoid thread idling or thread divergence –totally coalesced global memory accesses when fetching and storing the sequence elements The results demonstrate up to 30% higher performance –Compared with previous optimized comparison-based algorithms

Thanks