CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]
More on threads, shared memory, synchronization
Basic Algorithms on Arrays. Learning Objectives Arrays are useful for storing data in a linear structure We learn how to process data stored in an array.
CS 179: GPU Programming Lecture 10. Topics Non-numerical algorithms – Parallel breadth-first search (BFS) Texture memory.
CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)
Quicksort Ack: Several slides from Prof. Jim Anderson’s COMP 750 notes. UNC Chapel Hill1.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
S: Application of quicksort on an array of ints: partitioning.
CS2420: Lecture 11 Vladimir Kulyukin Computer Science Department Utah State University.
Computer Algorithms Lecture 10 Quicksort Ch. 7 Some of these slides are courtesy of D. Plaisted et al, UNC and M. Nicolescu, UNR.
ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)
CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
Examples of Recursion Data Structures in Java with JUnit ©Rick Mercer.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Mudasser Naseer 1 3/4/2016 CSC 201: Design and Analysis of Algorithms Lecture # 6 Bubblesort Quicksort.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Recursion Riley Chapter 14 Sections 14.1 – 14.5 (14.6 optional)
CS 179: GPU Programming Lecture 10.
CS 179: GPU Programming Lecture 1: Introduction 1
Data Structures in Java with JUnit ©Rick Mercer
CS 179: GPU Programming Lecture 1: Introduction 1
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Introduction to CUDA Programming
CS 179: GPU Programming Lecture 7.
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
GPGPU: Parallel Reduction and Scan
Parallel Computation Patterns (Reduction)
CS 179: Lecture 3.
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 498AL Lecture 10: Control Flow
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Presentation transcript:

CS 179: GPU Programming Lecture 7

Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries

Outline GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Reduction Find the sum of an array: – (Or any associative operator, e.g. product) CPU code: float sum = 0.0; for (int i = 0; i < N; i++) sum += A[i];

Add two arrays fffffffffffffffffffff – A[] + B[] -> C[] fffffffffffffffffffffffff CPU code: float *C = malloc(N * sizeof(float)); for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; Find the sum of an array: – (Or any associative operator, e.g. product) CPU code: float sum = 0.0; for (int i = 0; i < N; i++) sum += A[i];

Reduction vs. elementwise add Add two arrays (multithreaded pseudocode) (allocate memory for C) (create threads, assign indices)... In each thread, for (i from beginning region of thread) C[i] <- A[i] + B[i] Wait for threads to synchronize... f Sum of an array (multithreaded pseudocode) (set sum to 0.0) (create threads, assign indices)... In each thread, (Set thread_sum to 0.0) for (i from beginning region of thread) thread_sum += A[i] “return” thread_sum Wait for threads to synchronize... for j = 0,…,#threads-1: sum += (thread j’s sum)

Reduction vs. elementwise add Add two arrays (multithreaded pseudocode) (allocate memory for C) (create threads, assign indices)... In each thread, for (i from beginning region of thread) C[i] <- A[i] + B[i] Wait for threads to synchronize... f Sum of an array (multithreaded pseudocode) (set sum to 0.0) (create threads, assign indices)... In each thread, (Set thread_sum to 0.0) for (i from beginning region of thread) thread_sum += A[i] “return” thread_sum Wait for threads to synchronize... for j = 0,…,#threads-1: sum += (thread j’s sum) Serial recombination!

Reduction vs. elementwise add Sum of an array (multithreaded pseudocode) (set sum to 0.0) (create threads, assign indices)... In each thread, (Set thread_sum to 0.0) for (i from beginning region of thread) thread_sum += A[i] “return” thread_sum Wait for threads to synchronize... for j = 0,…,#threads-1: sum += (thread j’s sum) Serial recombination! Serial recombination has greater impact with more threads CPU – no big deal GPU – big deal

Reduction vs. elementwise add (v2) Add two arrays (multithreaded pseudocode) (allocate memory for C) (create threads, assign indices)... In each thread, for (i from beginning region of thread) C[i] <- A[i] + B[i] Wait for threads to synchronize... f Sum of an array (multithreaded pseudocode) (set sum to 0.0) (create threads, assign indices)... In each thread, (Set thread_sum to 0.0) for (i from beginning region of thread) thread_sum += A[i] Atomically add thread_sum to sum Wait for threads to synchronize... 1

Reduction vs. elementwise add (v2) Add two arrays (multithreaded pseudocode) (allocate memory for C) (create threads, assign indices)... In each thread, for (i from beginning region of thread) C[i] <- A[i] + B[i] Wait for threads to synchronize... f Sum of an array (multithreaded pseudocode) (set sum to 0.0) (create threads, assign indices)... In each thread, (Set thread_sum to 0.0) for (i from beginning region of thread) thread_sum += A[i] Atomically add thread_sum to sum Wait for threads to synchronize... 1 Serialized access!

Naive reduction Suppose we wished to accumulate our results…

Naive reduction Suppose we wished to accumulate our results… Thread-unsafe!

Naive (but correct) reduction

GPU threads in naive reduction

Shared memory accumulation

Shared memory accumulation (2)

“Binary tree” reduction One thread atomicAdd’s this to global result

“Binary tree” reduction Use __syncthreads() before proceeding!

“Binary tree” reduction Divergence! – Uses twice as many warps as necessary!

Non-divergent reduction

Bank conflicts! – 1st iteration: 2-way, – 2nd iteration: 4-way (!), … Non-divergent reduction

Sequential addressing

Reduction More improvements possible – “Optimizing Parallel Reduction in CUDA” (Harris) Code examples! Moral: – Different type of GPU-accelerized problems Some are “parallelizable” in a different sense – More hardware considerations in play

Outline GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Prefix Sum

Prefix Sum sample code (up-sweep) [1, 3, 3, 10, 5, 11, 7, 36] [1, 3, 3, 10, 5, 11, 7, 26] [1, 3, 3, 7, 5, 11, 7, 15] [1, 2, 3, 4, 5, 6, 7, 8] Original array We want: [0, 1, 3, 6, 10, 15, 21, 28] (University of Michigan EECS,

Prefix Sum sample code (down-sweep) [1, 3, 3, 10, 5, 11, 7, 36] [1, 3, 3, 10, 5, 11, 7, 0] [1, 3, 3, 0, 5, 11, 7, 10] [1, 0, 3, 3, 5, 10, 7, 21] [0, 1, 3, 6, 10, 15, 21, 28] Final result Original: [1, 2, 3, 4, 5, 6, 7, 8] (University of Michigan EECS,

Prefix Sum (Up-Sweep) Original array Use __syncthreads() before proceeding! (University of Michigan EECS,

Prefix Sum (Down-Sweep) Final result Use __syncthreads() before proceeding! (University of Michigan EECS,

Prefix sum Bank conflicts! – 2-way, 4-way, …

Prefix sum Bank conflicts! – 2-way, 4-way, … – Pad addresses! (University of Michigan EECS,

Why does the prefix sum matter?

Outline GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Stream Compaction Problem: – Given array A, produce subarray of A defined by boolean condition – e.g. given array: Produce array of numbers >

Stream Compaction Given array A: – GPU kernel 1: Evaluate boolean condition, Array M: 1 if true, 0 if false – GPU kernel 2: Cumulative sum of M (denote S) – GPU kernel 3: At each index, if M[idx] is 1, store A[idx] in output at position (S[idx] - 1)

Outline GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

GPU-accelerated quicksort Quicksort: – Divide-and-conquer algorithm – Partition array along chosen pivot point Pseudocode: quicksort(A, lo, hi): if lo < hi: p := partition(A, lo, hi) quicksort(A, lo, p - 1) quicksort(A, p + 1, hi) Sequential version

GPU-accelerated partition Given array A: – Choose pivot (e.g. 3) – Stream compact on condition: ≤ 3 – Store pivot – Stream compact on condition: > 3 (store with offset)

GPU acceleration details Continued partitioning/synchronization on sub-arrays results in sorted array

Final Thoughts “Less obviously parallelizable” problems – Hardware matters! (synchronization, bank conflicts, …) Resources: – GPU Gems, Vol. 3, Ch. 39