CS 179: GPU Programming Lecture 7.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

Lecture 3: Parallel Algorithm Design
More on threads, shared memory, synchronization
CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
© David Kirk/NVIDIA and Wen-mei W
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS 179: GPU Programming Lecture 10.
CS/EE 217 – GPU Architecture and Parallel Programming
Analysis of Algorithms CS 477/677
Lecture 3: Parallel Algorithm Design
CS 179: GPU Programming Lecture 1: Introduction 1
Subject Name: Design and Analysis of Algorithm Subject Code: 10CS43
Basic CUDA Programming
Quicksort "There's nothing in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The Sorting Hat, Harry Potter.
CS 179: GPU Programming Lecture 1: Introduction 1
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Introduction to CUDA Programming
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
GPGPU: Parallel Reduction and Scan
Parallel Computation Patterns (Reduction)
CS 179: Lecture 3.
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
CS 3343: Analysis of Algorithms
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 498AL Lecture 10: Control Flow
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CS 179: GPU Programming Lecture 7

Week 3 Goals: Advanced GPU-accelerable algorithms CUDA libraries and tools

This Lecture GPU-accelerable algorithms: Reduction Prefix sum Stream compaction Sorting (quicksort)

Elementwise Addition Problem: C[i] = A[i] + B[i] CPU code: GPU code: float *C = malloc(N * sizeof(float)); for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; GPU code: // assign device and host memory pointers, and allocate memory in host int thread_index = threadIdx.x + blockIdx.x * blockDim.x; while (thread_index < N) { C[thread_index] = A[thread_index] + B[thread_index]; thread_index += blockDim.x * gridDim.x; }

Reduction Example Problem: SUM(A[]) CPU code: float sum = 0.0; for (int i = 0; i < N; i++) sum += A[i]; GPU Pseudocode: // assign, allocate, initialize device and host memory pointers // create threads and assign indices for each thread // assign each thread a specific region to get a sum over // wait for all threads to finish running ( __syncthreads; ) // combine all thread sums for final solution

Naïve Reduction Problem: Sum of Array Serial Recombination causes speed reduction with GPUs, especially with higher number of threads GPU must use atomic functions for mutex atomicCAS atomicAdd

Naive Reduction Suppose we wished to accumulate our results…

Naive Reduction Suppose we wished to accumulate our results… Thread-unsafe!

Naive (but correct) Reduction

Shared memory accumulation

Shared memory accumulation (2)

“Binary tree” reduction One thread atomicAdd’s this to global result

“Binary tree” reduction Use __syncthreads() before proceeding!

“Binary tree” reduction Warp Divergence! Odd threads won’t even execute

Non-divergent reduction

Non-divergent reduction Shared Memory Bank Conflicts! 1st iteration: 2-way, 2nd iteration: 4-way (!), …

Sequential addressing Sequential Addressing Automatically Resolves Bank Conflict Problems

Reduction More improvements possible Moral: “Optimizing Parallel Reduction in CUDA” (Harris) Code examples! Moral: Different type of GPU-accelerized problems Some are “parallelizable” in a different sense More hardware considerations in play

Outline GPU-accelerated: Prefix sum Reduction Stream compaction Sorting (quicksort)

Prefix Sum Given input sequence x[n], produce sequence 𝑦 𝑛 = 𝑘=0 𝑛 𝑥 𝑘 e.g. x[n] = (1, 2, 3, 4, 5, 6) -> y[n] = (1, 3, 6, 10, 15, 21) Recurrence relation: 𝑦 𝑛 =𝑦 𝑛−1 +𝑥 𝑛

Prefix Sum Given input sequence x[n], produce sequence 𝑦 𝑛 = 𝑘=0 𝑛 𝑥 𝑘 e.g. x[n] = (1, 1, 1, 1, 1, 1, 1) -> y[n] = (1, 2, 3, 4, 5, 6, 7) e.g. x[n] = (1, 2, 3, 4, 5, 6) -> y[n] = (1, 3, 6, 10, 15, 21)

Prefix Sum Recurrence relation: Recall: 𝑦 𝑛 =𝑦 𝑛−1 +𝑥 𝑛 Is it parallelizable? Is it GPU-accelerable? Recall: 𝑦 𝑛 =𝑥 𝑛 +𝑥 𝑛−1 +…+𝑥[𝑛− 𝐾−1 ] Easily parallelizable! 𝑦 𝑛 =𝑐∙𝑥 𝑛 + 1−𝑐 ∙𝑦 𝑛−1 Not so much

Prefix Sum Recurrence relation: Goal: 𝑦 𝑛 =𝑦 𝑛−1 +𝑥 𝑛 Is it parallelizable? Is it GPU-accelerable? Goal: Parallelize using a “reduction-like” strategy

Prefix Sum sample code (up-sweep) [1, 3, 3, 10, 5, 11, 7, 36] [1, 3, 3, 10, 5, 11, 7, 26] [1, 3, 3, 7, 5, 11, 7, 15] [1, 2, 3, 4, 5, 6, 7, 8] d = 2 d = 1 d = 0 Original array for d = 0 to (log2n) -1 do for all k = 0 to n-1 by 2d+1 in parallel do x[k + 2d+1 – 1] = x[k + 2d -1] + x[k + 2d] We want: [0, 1, 3, 6, 10, 15, 21, 28] (University of Michigan EECS, http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf

Prefix Sum sample code (down-sweep) 1: x[n – 1]   0 2: for d = log2 n – 1 down to 0 do  3:       for all k = 0 to n – 1 by 2 d +1 in parallel do  4:            t = x[k +  2 d  – 1] 5:            x[k +  2 d  – 1] = x[k +  2 d +1 – 1] 6:            x[k +  2 d +1 – 1] = t +  x[k +  2 d +1 – 1] Prefix Sum sample code (down-sweep) [1, 3, 3, 10, 5, 11, 7, 36] [1, 3, 3, 10, 5, 11, 7, 0] [1, 3, 3, 0, 5, 11, 7, 10] [1, 0, 3, 3, 5, 10, 7, 21] [0, 1, 3, 6, 10, 15, 21, 28] Original: [1, 2, 3, 4, 5, 6, 7, 8] x[n-1] = 0 for d = log2(n) – 1 down to 0 do for all k = 0 to n-1 by 2d+1 in parallel do t = x[k + 2d – 1] x[k + 2d – 1] = x[k + 2d] x[k + 2d] = t + x[k + 2d] Final result (University of Michigan EECS, http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf

Prefix Sum (Up-Sweep) Use __syncthreads() before proceeding! Original array (University of Michigan EECS, http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf

Prefix Sum (Down-Sweep) Use __syncthreads() before proceeding! Final result (University of Michigan EECS, http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf

Prefix sum Bank conflicts galore! 2-way, 4-way, …

Prefix sum Bank conflicts! 2-way, 4-way, … Pad addresses! (University of Michigan EECS, http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf

Prefix Sum http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html -- See Link for a More In-Depth Explanation of Up-Sweep and Down-Sweep

Outline GPU-accelerated: Stream compaction Reduction Prefix sum Sorting (quicksort)

Stream Compaction Problem: Given array A, produce subarray of A defined by boolean condition e.g. given array: Produce array of numbers > 3 2 5 1 4 6 3 5 4 6

Stream Compaction Given array A: GPU kernel 1: Evaluate boolean condition, Array M: 1 if true, 0 if false GPU kernel 2: Cumulative sum of M (denote S) GPU kernel 3: At each index, if M[idx] is 1, store A[idx] in output at position (S[idx] - 1) 2 5 1 4 6 3 1 1 2 3 5 4 6

Outline GPU-accelerated: Sorting (quicksort) Reduction Prefix sum Stream compaction Sorting (quicksort)

GPU-accelerated quicksort Divide-and-conquer algorithm Partition array along chosen pivot point Pseudocode: quicksort(A, lo, hi): if lo < hi: p := partition(A, lo, hi) quicksort(A, lo, p - 1) quicksort(A, p + 1, hi) Sequential version

GPU-accelerated partition Given array A: Choose pivot (e.g. 3) Stream compact on condition: ≤ 3 Store pivot Stream compact on condition: > 3 (store with offset) 2 5 1 4 6 3 2 1 2 1 3 2 1 3 5 4 6

GPU acceleration details Continued partitioning/synchronization on sub-arrays results in sorted array

Final Thoughts “Less obviously parallelizable” problems Resources: Hardware matters! (synchronization, bank conflicts, …) Resources: GPU Gems, Vol. 3, Ch. 39 Highly Recommend Reading This Guide to CUDA Optimization, with a Reduction Example