CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
L8: Control Flow CS6963. Administrative Next assignment on the website – Description at end of class – Due Wednesday, Feb. 17, 5PM (done?) – Use handin.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© NVIDIA and UC Davis 2008 Advanced Data-Parallel Programming: Data Structures and Algorithms John Owens UC Davis.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
L8: Control Flow CS6235. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
© David Kirk/NVIDIA and Wen-mei W
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Slides from “PMPP” book
CS 179: GPU Programming Lecture 7.
Lecture 5: GPU Compute Architecture for the last time
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
GPGPU: Parallel Reduction and Scan
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 498AL Lecture 10: Control Flow
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Parallel Computation Patterns (Histogram)
Presentation transcript:

CS 193G Lecture 5: Parallel Patterns I

Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping of threads to work Launch grid configuration __shared__ memory management Resource allocation Lots of moving parts Hard to see the forest for the trees

CUDA Madlibs __global__ void foo(...) { extern __shared__ smem[]; int i = ??? // now what??? }... int B = ??? int N = ??? int S = ??? foo >>();

Parallel Patterns Think at a higher level than individual CUDA kernels Specify what to compute, not how to compute it Let programmer worry about algorithm Defer pattern implementation to someone else

Common Parallel Computing Scenarios Many parallel threads need to generate a single result  Reduce Many parallel threads need to partition data  Split Many parallel threads produce variable output / thread  Compact / Expand

Primordial CUDA Pattern: Blocking Partition data to operate in well-sized blocks Small enough to be staged in shared memory Assign each data partition to a thread block No different from cache blocking! Provides several performance benefits Have enough blocks to keep processors busy Working in shared memory cuts memory latency dramatically Likely to have coherent access patterns on load/store to shared memory

Primordial CUDA Pattern: Blocking Partition data into subsets that fit into shared memory

Primordial CUDA Pattern: Blocking Handle each data subset with one thread block

Primordial CUDA Pattern: Blocking Load the subset from global memory to shared memory, using multiple threads to exploit memory- level parallelism

Primordial CUDA Pattern: Blocking Perform the computation on the subset from shared memory

Primordial CUDA Pattern: Blocking Copy the result from shared memory back to global memory

Primordial CUDA Pattern: Blocking All CUDA kernels are built this way Blocking may not matter for a particular problem, but you’re still forced to think about it Not all kernels require __shared__ memory All kernels do require registers All of the parallel patterns we’ll discuss have CUDA implementations that exploit blocking in some fashion

Reduction Reduce vector to a single value Via an associative operator (+, *, min/max, AND/OR, …) CPU: sequential implementation for(int i = 0, i < n, ++i)... GPU: “tree”-based implementation

Serial Reduction // reduction via serial iteration float sum(float *data, int n) { float result = 0; for(int i = 0; i < n; ++i) { result += data[i]; } return result; }

Parallel Reduction – Interleaved Values (in shared memory) Values Values Thread IDs Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs

Parallel Reduction – Contiguous

CUDA Reduction __global__ void block_sum(float *input, float *results, size_t n) { extern __shared__ float sdata[]; int i =..., int tx = threadIdx.x; // load input into __shared__ memory float x = 0; if(i < n) x = input[i]; sdata[tx] = x; __syncthreads();

CUDA Reduction // block-wide reduction in __shared__ mem for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) { if(tx < offset) { // add a partial sum upstream to our own sdata[tx] += sdata[tx + offset]; } __syncthreads(); }

CUDA Reduction // finally, thread 0 writes the result if(threadIdx.x == 0) { // note that the result is per-block // not per-thread results[blockIdx.x] = sdata[0]; }

An Aside // is this barrier divergent? for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {... __syncthreads(); }

An Aside // what about this one? __global__ void do_i_halt(int *input) { int i =... if(input[i]) {... __syncthreads(); } // a divergent barrier // hangs the machine

CUDA Reduction // global sum via per-block reductions float sum(float *d_input, size_t n) { size_t block_size =..., num_blocks =...; // allocate per-block partial sums // plus a final total sum float *d_sums = 0; cudaMalloc((void**)&d_sums, sizeof(float) * (num_blocks + 1));...

CUDA Reduction // reduce per-block partial sums int smem_sz = block_size*sizeof(float); block_sum >> (d_input, d_sums, n); // reduce partial sums to a total sum block_sum >> d_sums, d_sums + num_blocks, num_blocks); // copy result to host float result = 0; cudaMemcpy(&result, d_sums+num_blocks,...); return result; }

Caveat Reductor What happens if there are too many partial sums to fit into __shared__ memory in the second stage? What happens if the temporary storage is too big? Give each thread more work in the first stage Sum is associative & commutative Order doesn’t matter to the result We can schedule the sum any way we want  serial accumulation before block-wide reduction Exercise left to the hacker

Parallel Reduction Complexity Log(N) parallel steps, each step S does N/2 S independent ops Step Complexity is O(log N) For N=2 D, performs  S  [1..D] 2 D-S = N-1 operations Work Complexity is O(N) – It is work-efficient i.e. does not perform more operations than a sequential algorithm With P threads physically in parallel (P processors), time complexity is O(N/P + log N) Compare to O(N) for sequential reduction

FTFFTFFT FFFFFTTT Flag Payload Split Operation Given:array of true and false elements (and payloads) Return an array with all true elements at the beginning Examples: sorting, building trees

Variable Output Per Thread: Compact Remove null elements Example: collision detection

Variable Output Per Thread: General Case Reserve Variable Storage Per Thread Example: binning A B C D E F G H

Split, Compact, Expand Each thread must answer a simple question: “Where do I write my output?” The answer depends on what other threads write! Scan provides an efficient parallel answer

Scan (a.k.a. Parallel Prefix Sum) Given an array A = [a 0, a 1, …, a n-1 ] and a binary associative operator  with identity I, scan(A) = [I, a 0, (a 0  a 1 ), …, (a 0  a 1  …  a n-2 )] Prefix sum: if  is addition, then scan on the series returns the series

Applications of Scan Scan is a simple and useful parallel building block for many parallel algorithms: Fascinating, since scan is unnecessary in sequential computing! Radix sort Quicksort (seg. scan) String comparison Lexical analysis Stream compaction Run-length encoding Polynomial evaluation Solving recurrences Tree operations Histograms Allocation Etc.

Serial Scan int input[8] = {3, 1, 7, 0, 4, 1, 6, 3}; int result[8]; int running_sum = 0; for(int i = 0; i < 8; ++i) { result[i] = running_sum; running_sum += input[i]; } // result = {0, 3, 4, 11, 11, 15, 16, 22}

A Scan Algorithm – Preview Assume array is already in shared memory See Harris, M., S. Sengupta, and J.D. Owens. “Parallel Prefix Sum (Scan) in CUDA”, GPU Gems 3

A Scan Algorithm – Preview Iteration 0, n-1 threads Iterate log(n) times. Each thread adds value stride elements away to its own value Each corresponds to a single thread

A Scan Algorithm – Preview Iterate log(n) times. Each thread adds value offset elements away to its own value Each corresponds to a single thread Iteration 1, n-2 threads

A Scan Algorithm – Preview Iterate log(n) times. Each thread adds value offset elements away to its own value. Note that this algorithm operates in-place: no need for double buffering Each corresponds to a single thread Iteration i, n-2 i threads

A Scan Algorithm – Preview We have an inclusive scan result

A Scan Algorithm – Preview For an exclusive scan, right-shift through __shared__ memory Note that the unused final element is also the sum of the entire array Often called the “carry” Scan & reduce in one pass ?

CUDA Block-wise Inclusive Scan __global__ void inclusive_scan(int *data) { extern __shared__ int sdata[]; unsigned int i =... // load input into __shared__ memory int sum = input[i]; sdata[threadIdx.x] = sum; __syncthreads();...

CUDA Block-wise Inclusive Scan for(int o = 1; o < blockDim.x; o <<= 1) { if(threadIdx.x >= o) sum += sdata[threadIdx.x - o]; // wait on reads __syncthreads(); // write my partial sum sdata[threadIdx.x] = sum; // wait on writes __syncthreads(); }

CUDA Block-wise Inclusive Scan // we're done! // each thread writes out its result result[i] = sdata[threadIdx.x]; }

Results are Local to Each Block Block 0 Input: Result: Block 1 Input: Result:

Results are Local to Each Block Need to propagate results from each block to all subsequent blocks 2-phase scan 1. Per-block scan & reduce 2. Scan per-block sums Final update propagates phase 2 data and transforms to exclusive scan result Details in MP3

Summing Up Patterns like reduce, split, compact, scan, and others let us reason about data parallel problems abstractly Higher level patterns are built from more fundamental patterns Scan in particular is fundamental to parallel processing, but unnecessary in a serial world Get others to implement these for you!  but not until after MP3