CS 179: Lecture 3.

CS 179: Lecture 3

A bold problem! Suppose we have a polynomial P(r) with coefficients c0, …, cn-1, given by: We want, for r0, …, rN-1, the sum:

A kernel !!

A more correct kernel

Performance problem! Serialization of atomicAdd()

Parallel accumulation
…

Parallel accumulation

Accumulation methods – A visual
Purely atomic: “Linear accumulation”: *output

GPU: Internals Multiprocessor (Blocks go here) SIMD processing unit
(Warps go here) GPU: Internals Block Warp

Warp 32 threads that execute simultaneously!
A[] + B[] -> C[] problem: Didn’t matter! Warp of threads 0-31: A[idx] + B[idx] -> C[idx] Warp of threads 32-63: Warp of threads 64-95: …

Divergence – Example Branches result in different instructions!
//Suppose we have a pointer to some floating-point //value *val... if (threadIdx.x % 2 == 0) *value += 2; //Branch A else *value /= 3; //Branch B Branches result in different instructions!

Divergence What happens: Executes normally until if-statement
Branches to calculate Branch A (blue threads) Goes back (!) and branches to calculate Branch B (red threads)

Calculating polynomial values – does it matter?
Warp of threads 0-31: (calculate some values) Warp of threads 32-63: Warp of threads 64-95: Same instructions! Doesn’t matter!

Linear reduction - does it matter?
(after calculating values…) Warp of threads 0-31: Thread 0: Accumulate sum Threads 1-31: Do nothing Warp of threads 32-63: Do nothing Warp of threads 64-95: Doesn’t really matter… in this case.

Improving our reduction
More threads participating in the process! “Binary tree”

//Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0) AND you won’t exceed your block dimension: add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

What will happen? Each warp will do meaningful execution on ~16 threads You use lots more warps than you have to! A “divergent tree” How to improve?

“Non-divergent tree”

“Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension //For the first iteration, check that you don’t access //out of range memory while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

“Non-divergent tree” Suppose we have our block of 512 threads…
Warp of threads 0-31: Accumulate result … Warp of threads : Warp of threads : Do nothing Warp of threads :

“Non-divergent tree” Suppose we’re now down to 32 threads…
Warp of threads 0-31: Threads 0-15: Accumulate result Threads 16-31: Do nothing Much less divergence Divergence only occurs in the middle, if ever!

Reduction – Four Approaches
Atomic only: Divergent tree: Linear: Non-divergent tree: *output

Notes on files? (An aside)
Labs and CUDA programming typically have the following files involved: ____.cc Allows C++ code g++ compiles this ____.cu Allows CUDA syntax nvcc compiles this ____.cuh CUDA header file (declare accessible functions)

Big Picture (so far) CUDA allows large speedups!
Parallelism with CUDA: Different from CPU parallelism Threads are “small” Memory to think about Increase speedup by thinking about problem constraints! Reduction (and more!)

Big Picture (so far) Steps in these two problems are widely applicable! Dot product Norm calculation (and more)

Other notes Office hours: Lab 1 due Wednesday, 5 PM Monday: 8-10 PM
Tuesday: 8-10 PM Lab 1 due Wednesday, 5 PM

CS 179: Lecture 3.

Similar presentations

Presentation on theme: "CS 179: Lecture 3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 179: Lecture 3.

Similar presentations

Presentation on theme: "CS 179: Lecture 3."— Presentation transcript:

Similar presentations

About project

Feedback