Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Slides:

Advertisements

Similar presentations

More on threads, shared memory, synchronization

Advertisements

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

An Introduction to Programming with CUDA Paul Richmond

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

GPU Programming with CUDA – Optimisation Mike Griffiths

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.

Martin Kruliš by Martin Kruliš (v1.0)1.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Lecture 19 Revisting Strides, CUDA Threads… Topics Strides through memory Practical Performance considerationsReadings November 7, 2012 CSCE 513 Advanced.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

© David Kirk/NVIDIA and Wen-mei W

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

Single Instruction Multiple Threads

Lecture 5: Performance Considerations

Sathish Vadhiyar Parallel Programming

Atomic Operations in Hardware

Atomic Operations in Hardware

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Lecture 16 Revisiting Strides, CUDA Threads…

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

Parallel Computation Patterns (Reduction)

Kernel Synchronization II

General Purpose Graphics Processing Units (GPGPUs)

ECE 498AL Lecture 10: Control Flow

CS333 Intro to Operating Systems

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 3: CUDA Threads & Atomics

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture 5: Synchronization and ILP

Synchronization These notes introduce:

Parallel Computation Patterns (Histogram)

Presentation transcript:

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

The Problem: how do you do global communication? Finish a kernel and start a new one All writes from all threads complete before a kernel finishes Would need to decompose kernels into before and after parts CUDA Threads and Atomics – Slide 2 step1 >>(…); // The system ensures that all // writes from step 1 complete step2 >>(…);

Alternately write to a predefined memory location Race condition! Updates can be lost What is the value of a in thread 0? In thread 1917? CUDA Threads and Atomics – Slide 3 threadID: 0threadID: 1917 // vector[0] was equal to zero vector[0] += 5; … a = vector[0]; vector[0] += 1; … a = vector[0];

Thread 0 could have finished execution before 1917 started Or the other way around Or both are executing at the same time Answer: not defined by the programming model; can be arbitrary CUDA Threads and Atomics – Slide 4 threadID: 0threadID: 1917 // vector[0] was equal to zero vector[0] += 5; … a = vector[0]; vector[0] += 1; … a = vector[0];

CUDA provides atomic operations to deal with this problem An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes The name atomic comes from the fact that it is uninterruptable No dropped data, but ordering is still arbitrary CUDA Threads and Atomics – Slide 5

CUDA provides atomic operations to deal with this problem Requires hardware with compute capability 1.1 and above Different types of atomic instructions Addition/subtraction: atomicAdd, atomicSub Minimum/maximum: atomicMin, atomicMax Conditional increment/decrement: atomicInc, atomicDec Exchange/compare-and-swap: atomicExch, atomicCAS More types in fermi: atomicAnd, atomicOr, atomicXor CUDA Threads and Atomics – Slide 6

CUDA Threads and Atomics – Slide 7 // Determine frequency of colors in a picture // colors have already been converted into integers // Each thread looks at one pixel and increments // a counter atomically __global__ void histogram(int* color, int* buckets) { int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1); } atomicAdd returns the previous value at a certain address Useful for grabbing variable amounts of data from the list

CUDA Threads and Atomics – Slide 8 // For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue __global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }

if compare equals old value stored at address then val is stored instead in either case, routine returns the value of old seems a bizarre routine at first sight, but can be very useful for atomic locks CUDA Threads and Atomics – Slide 9 int atomicCAS(int* address, int compare, int val)

Most general type of atomic Can emulate all others with CAS CUDA Threads and Atomics – Slide 10 int atomicCAS(int* address, int oldval, int val) { int old_reg_val = *address; if (old_reg_val == compare) *address = val; return old_reg_val; }

Atomics are slower than normal load/store Most of these are associative operations on signed/unsigned integers: quite fast for data in shared memory slower for data in device memory You can have the whole machine queuing on a single location in memory Atomics unavailable on G80! CUDA Threads and Atomics – Slide 11

CUDA Threads and Atomics – Slide 12 // If you require the maximum across all threads // in a grid, you could do it with a single global // maximum value, but it will be VERY slow __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val); }

CUDA Threads and Atomics – Slide 13 // introduce intermediate maximum results, so that // most threads do not try to update the global max __global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; int region = i % num_regions; if (atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); }

Single value causes serial bottleneck Create hierarchy of values for more parallelism Performance will still be slow, so use judiciously CUDA Threads and Atomics – Slide 14

Can’t use normal load/store for inter-thread communication because of race conditions Use atomic instructions for sparse and/or unpredictable global communication Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism CUDA Threads and Atomics – Slide 15

How a streaming multiprocessor (SM) executes threads Overview of how a streaming multiprocessor works SIMT Execution Divergence CUDA Threads and Atomics – Slide 16

Hardware schedules thread blocks onto available SMs No guarantee of ordering among thread blocks Hardware will schedule thread blocks as soon as a previous thread block finished CUDA Threads and Atomics – Slide 17 Thread Block 5 Thread Block 27 Thread Block 61 Streaming Multiprocessor Thread Block 2001

A warp = 32 threads launched together Usually execute together as well CUDA Threads and Atomics – Slide 18 ALU Control ALU Control ALU Control ALU Control ALU Control ALU Control ALU Control ALU

Each thread block is mapped to one or more warps The hardware schedules each warp independently CUDA Threads and Atomics – Slide 19 Thread Block N (128 threads) TB N W1 TB N W2 TB N W3 TB N W4

SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM* Warps whose next instruction has its inputs ready for consumption are eligible for execution Eligible warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected CUDA Threads and Atomics – Slide 20

Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time What happens if you have the following code? CUDA Threads and Atomics – Slide 21 if (foo(threadIdx.x)) { do_A(); } else { do_B(); }

This is called warp divergence – CUDA will generate correct code to handle this, but to understand the performance you need to understand what CUDA does with it CUDA Threads and Atomics – Slide 22 if (foo(threadIdx.x)) { do_A(); } else { do_B(); }

CUDA Threads and Atomics – Slide 23 From Fung et al. MICRO ’07 Branch Path A Path B Branch Path A Path B

Nested branches are handled as well CUDA Threads and Atomics – Slide 24 if (foo(threadIdx.x)) { if (bar(threadIdx.x)) do_A(); else do_B(); } else do_C();

CUDA Threads and Atomics – Slide 25 Branch Path A Path C Branch Path B

You don’t have to worry about divergence for correctness Mostly true, except corner cases (for example intra-warp locks) You might have to think about it for performance Depends on your branch conditions CUDA Threads and Atomics – Slide 26

One solution: NVIDIA GPUs have predicated instructions which are carried out only if a logical flag is true. In the previous example, all threads compute the logical predicate and two predicated instructions CUDA Threads and Atomics – Slide 27 p = (foo(threadIdx.x)); p: do_A(); !p: do_B();

Performance drops off with the degree of divergence CUDA Threads and Atomics – Slide 28 switch (threadIdx.x % N) { case 0:... case 1:... }

Performance drops off with the degree of divergence In worst case, effectively lose a factor of 32 in performance if one thread needs expensive branch, while rest do nothing Another example: processing a long list of elements where, depending on run-time values, a few require very expensive processing GPU implementation: first process list to build two sub-lists of “simple” and “expensive” elements then process two sub-lists separately CUDA Threads and Atomics – Slide 29

Already introduced __syncthreads(); which forms a barrier – all threads wait until every one has reached this point. When writing conditional code, must be careful to make sure that all threads do reach the __syncthreads(); Otherwise, can end up in deadlock CUDA Threads and Atomics – Slide 30

Fermi supports some new synchronisation instructions which are similar to __syncthreads() but have extra capabilities: int __syncthreads_count(predicate) counts how many predicates are true int __syncthreads_and(predicate) returns non-zero (true) if all predicates are true int __syncthreads_or(predicate) returns non-zero (true) if any predicate is true CUDA Threads and Atomics – Slide 31

There are similar warp voting instructions which operate at the level of a warp: int __all(predicate) returns non-zero (true) if all predicates in warp are true int __any(predicate) returns non-zero (true) if any predicate is true unsigned int __ballot(predicate) sets n th bit based on n th predicate CUDA Threads and Atomics – Slide 32

Use very judiciously Always include a max_iter in your spinloop! Decompose your data and your locks CUDA Threads and Atomics – Slide 33

Problem: when a thread writes data to device memory the order of completion is not guaranteed, so global writes may not have completed by the time the lock is unlocked CUDA Threads and Atomics – Slide 34 // global variable: 0 unlocked, 1 locked __device__ int lock=0; __global__ void kernel(...) {... // set lock do {} while(atomicCAS(&lock,0,1));... // free lock lock = 0; }

CUDA Threads and Atomics – Slide 35 // global variable: 0 unlocked, 1 locked __device__ int lock=0; __global__ void kernel(...) {... // set lock do {} while(atomicCAS(&lock,0,1));... // free lock __threadfence(); // wait for writes to finish lock = 0; }

__threadfence_block(); wait until all global and shared memory writes are visible to all threads in block __threadfence(); wait until all global and shared memory writes are visible to all threads in block (or all threads, for global data) CUDA Threads and Atomics – Slide 36

lots of esoteric capabilities – don’t worry about most of them essential to understand warp divergence – can have a very big impact on performance __syncthreads(); is vital the rest can be ignored until you have a critical need – then read the documentation carefully and look for examples in the SDK CUDA Threads and Atomics – Slide 37

Based on original material from Oxford University: Mike Giles Stanford University Jared Hoberock, David Tarjan Revision history: last updated 8/8/2011. CUDA Threads and Atomics – Slide 38