CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]

More on threads, shared memory, synchronization

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

S: Application of quicksort on an array of ints: partitioning.

CUDA Grids, Blocks, and Threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

LECTURE 3: INTRODUCTION TO PARALLEL COMPUTING USING CUDA Ken Domino, Domem Technologies May 16, 2011 IEEE Boston Continuing Education Program.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

Basic CUDA Programming

CS 179: GPU Programming Lecture 1: Introduction 1

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Introduction to CUDA Programming

CS 179: GPU Programming Lecture 7.

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Scan)

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Parallel Computation Patterns (Reduction)

CS 179: Lecture 3.

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE 498AL Lecture 10: Control Flow

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Presentation transcript:

CS 179: GPU Programming Lecture 7

Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries

Outline GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Elementwise Addition CPU code: float *C = malloc(N * sizeof(float)); for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; GPU code: // assign device and host memory pointers, and allocate memory in host int thread_index = threadIdx.x + blockIdx.x * blockDim.x; while (thread_index < N) { C[thread_index] = A[thread_index] + B[thread_index]; thread_index += blockDim.x * gridDim.x; } Problem: C[i] = A[i] + B[i]

Reduction Example GPU “Code”: // assign, allocate, initialize device and host memory pointers // create threads and assign indices for each thread // assign each thread a specific region to get a sum over // wait for all threads to finish running ( __syncthreads; ) // combine all thread sums for final solution CPU code: float sum = 0.0; for (int i = 0; i < N; i++) sum += A[i]; Problem: Sum of Array

Naïve Reduction Serial Recombination causes speed reduction with GPUs, especially with higher number of threads GPU must use atomic functions for mutex – atomicCAS – atomicAdd Problem: Sum of Array

Naive Reduction Suppose we wished to accumulate our results…

Naive Reduction Suppose we wished to accumulate our results… Thread-unsafe!

Naive (but correct) Reduction

GPU threads in naive reduction

Shared memory accumulation

Shared memory accumulation (2)

“Binary tree” reduction One thread atomicAdd’s this to global result

“Binary tree” reduction Use __syncthreads() before proceeding!

“Binary tree” reduction Warp Divergence! – Odd threads won’t even execute

Non-divergent reduction

Shared Memory Bank Conflicts! – 1st iteration: 2-way, – 2nd iteration: 4-way (!), … Non-divergent reduction

Sequential addressing Sequential Addressing Automatically Resolves Bank Conflict Problems

Reduction More improvements possible – “Optimizing Parallel Reduction in CUDA” (Harris) Code examples! Moral: – Different type of GPU-accelerized problems Some are “parallelizable” in a different sense – More hardware considerations in play

Outline GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Prefix Sum

Prefix Sum sample code (up-sweep) [1, 3, 3, 10, 5, 11, 7, 36] [1, 3, 3, 10, 5, 11, 7, 26] [1, 3, 3, 7, 5, 11, 7, 15] [1, 2, 3, 4, 5, 6, 7, 8] Original array We want: [0, 1, 3, 6, 10, 15, 21, 28] (University of Michigan EECS, for d = 0 to (log 2 n) -1 do for all k = 0 to n-1 by 2 d+1 in parallel do x[k + 2 d+1 – 1] = x[k + 2 d -1] + x[k + 2 d ] d = 0 d = 1 d = 2

Prefix Sum sample code (down-sweep) [1, 3, 3, 10, 5, 11, 7, 36] [1, 3, 3, 10, 5, 11, 7, 0] [1, 3, 3, 0, 5, 11, 7, 10] [1, 0, 3, 3, 5, 10, 7, 21] [0, 1, 3, 6, 10, 15, 21, 28] Final result Original: [1, 2, 3, 4, 5, 6, 7, 8] (University of Michigan EECS, x[n-1] = 0 for d = log 2 (n) – 1 down to 0 do for all k = 0 to n-1 by 2 d +1 in parallel do t = x[k + 2 d – 1] x[k + 2 d – 1] = x[k + 2 d ] x[k + 2 d ] = t + x[k + 2 d ] 1: x[n – 1] 0 2: for d = log 2 n – 1 down to 0 do 3: for all k = 0 to n – 1 by 2 d +1 in parallel do 4: t = x[k + 2 d – 1] 5: x[k + 2 d – 1] = x[k + 2 d +1 – 1] 6: x[k + 2 d +1 – 1] = t + x[k + 2 d +1 – 1]

Prefix Sum (Up-Sweep) Original array Use __syncthreads() before proceeding! (University of Michigan EECS,

Prefix Sum (Down-Sweep) Final result Use __syncthreads() before proceeding! (University of Michigan EECS,

Prefix sum Bank conflicts galore! – 2-way, 4-way, …

Prefix sum Bank conflicts! – 2-way, 4-way, … – Pad addresses! (University of Michigan EECS,

Prefix Sum gpugems3_ch39.html -- See Link for a More In-Depth Explanation of Up-Sweep and Down- Sweep gpugems3_ch39.html Why does the prefix sum matter?

Outline GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

Stream Compaction Problem: – Given array A, produce subarray of A defined by boolean condition – e.g. given array: Produce array of numbers >

Stream Compaction Given array A: – GPU kernel 1: Evaluate boolean condition, Array M: 1 if true, 0 if false – GPU kernel 2: Cumulative sum of M (denote S) – GPU kernel 3: At each index, if M[idx] is 1, store A[idx] in output at position (S[idx] - 1)

Outline GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)

GPU-accelerated quicksort Quicksort: – Divide-and-conquer algorithm – Partition array along chosen pivot point Pseudocode: quicksort(A, lo, hi): if lo < hi: p := partition(A, lo, hi) quicksort(A, lo, p - 1) quicksort(A, p + 1, hi) Sequential version

GPU-accelerated partition Given array A: – Choose pivot (e.g. 3) – Stream compact on condition: ≤ 3 – Store pivot – Stream compact on condition: > 3 (store with offset)

GPU acceleration details Continued partitioning/synchronization on sub-arrays results in sorted array

Final Thoughts “Less obviously parallelizable” problems – Hardware matters! (synchronization, bank conflicts, …) Resources: – GPU Gems, Vol. 3, Ch. 39 – Highly Recommend Reading This Guide to CUDA Optimization, with a Reduction Example This