Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
Sven Woop Computer Graphics Lab Saarland University
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
1 Lecture 5 PRAM Algorithm: Parallel Prefix Parallel Computing Fall 2008.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
L16: Sorting and OpenGL Interface
Sorting Part 4 CS221 – 3/25/09. Sort Matrix NameWorst Time Complexity Average Time Complexity Best Time Complexity Worst Space (Auxiliary) Selection SortO(n^2)
Counting Sort Non-comparison sort. Precondition: n numbers in the range 1..k. Key ideas: For each x count the number C(x) of elements ≤ x Insert x at output.
1 GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management Naga K. Govindaraju Jim Gray Ritesh Kumar Dinesh Manocha Presented.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Sorting Algorithms CS 524 – High-Performance Computing.
1 Friday, November 17, 2006 “In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance.” -H.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
Chapter 11 Broadcasting with Selective Reduction -BSR- Serpil Tokdemir GSU, Department of Computer Science.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises (Lab 2: Sorting)
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Data Structures and Algorithms in Parallel Computing Lecture 1.
J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang The George Washington.
1 Parallel Sorting Algorithm. 2 Bitonic Sequence A bitonic sequence is defined as a list with no more than one LOCAL MAXIMUM and no more than one LOCAL.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Sorting – Lecture 3 More about Merge Sort, Quick Sort.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.
The Present and Future of Parallelism on GPUs
Sorting.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
CS 147 – Parallel Processing
Real-Time Ray Tracing Stefan Popov.
Algorithm Design and Analysis (ADA)
Divide & Conquer Sorting
6- General Purpose GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

 System Model  The Algorithm  Experiments

 System Model  The Algorithm  Experiments

Normal processorsGraphics processors  Multi-core  Large cache  Few threads  Speculative execution  Many-core  Small cache  Wide and fast memory bus  Thousands of threads hides memory latency

 Designed for general purpose computation  Extensions to C/C++

Global Memory

Multiprocessor 0 Multiprocessor 1 SharedMemorySharedMemory

Global Memory Multiprocessor 0 Multiprocessor 1 SharedMemorySharedMemory Thread Block 0 Thread Block 1 Thread Block x Thread Block y

 Minimal, manual cache  32-word SIMD instruction  Coalesced memory access  No block synchronization  Expensive synchronization primitives

 System Model  The Algorithm  Experiments

Data Parallel!

NOT Data Parallel! NOT Data Parallel!

Phase One Sequences divided Block synchronization on the CPU Phase One Sequences divided Block synchronization on the CPU Thread Block 1Thread Block 2

Phase Two Each block is assigned own sequence Run entirely on GPU Phase Two Each block is assigned own sequence Run entirely on GPU Thread Block 1Thread Block 2

 Assign Thread Blocks

 Uses an auxiliary array  In-place too expensive

Fetch-And-AddonLeftPointerFetch-And-AddonLeftPointer LeftPointer = 0 Thread Block ONE Thread Block TWO 1 0

Fetch-And-AddonLeftPointerFetch-And-AddonLeftPointer LeftPointer = 2 Thread Block ONE Thread Block TWO

Fetch-And-AddonLeftPointerFetch-And-AddonLeftPointer LeftPointer = 4 Thread Block ONE Thread Block TWO

LeftPointer =

Three way partition

< > = > <<< > < > < = < > < >>> < > = < >> ===

 Recurse on smallest sequence ◦ Max depth log(n)  Explicit stack  Alternative sorting ◦ Bitonic sort

 System Model  The Algorithm  Experiments

 GPU-Quicksort  GPU-QuicksortCederman and Tsigas, ESA08  GPUSort  GPUSortGovindaraju et. al., SIGMOD05  Radix-Merge  Radix-MergeHarris et. al., GPU Gems 3 ’07  Global radix  Global radixSengupta et. al., GH07  Hybrid  HybridSintorn and Assarsson, GPGPU07  Introsort  Introsort David Musser, Software: Practice and Experience

 Uniform  Gaussian  Bucket  Sorted  Zero  Stanford Models

 Uniform  Gaussian  Bucket  Sorted  Zero  Stanford Models

 Uniform  Gaussian  Bucket  Sorted  Zero  Stanford Models

 Uniform  Gaussian  Bucket  Sorted  Zero  Stanford Models

 Uniform  Gaussian  Bucket  Sorted  Zero  Stanford Models

 Uniform  Gaussian  Bucket  Sorted  Zero  Stanford Models P x1x1x1x1 x2x2x2x2 x3x3x3x3 x4x4x4x4

 First Phase ◦ Average of maximum and minimum element  Second Phase ◦ Median of first, middle and last element

 8800GTX ◦ 16 multiprocessors (128 cores) ◦ 86.4 GB/s bandwidth  8600GTS ◦ 4 multiprocessors (32 cores) ◦ 32 GB/s bandwidth  CPU-Reference ◦ AMD Dual-Core Opteron 265 / 1.8 GHz

CPU-Reference ~4s CPU-Reference ~4s

GPUSort and Radix-Merge ~1.5s GPUSort and Radix-Merge ~1.5s

GPU-Quicksort ~380ms Global Radix ~510ms GPU-Quicksort ~380ms Global Radix ~510ms

Reference is faster than GPUSort and Radix-Merge

GPU-Quicksort takes less than 200ms GPU-Quicksort takes less than 200ms

Bitonic is unaffected by distribution

Three-way partition

Reference equal to Radix-Merge

Quicksort and hybrid ~4 times faster

Hybrid needs randomization

Reference better

Similar to uniform distribution

Sequence needs to be a power of two in length

 Minimal, manual cache ◦ Used only for prefix sum and bitonic  32-word SIMD instruction ◦ Main part executes same instructions  Coalesced memory access ◦ All reads coalesced  No block synchronization ◦ Only required in first phase  Expensive synchronization primitives ◦ Two passes amortizes cost

 Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way  It is competitive