GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

More on threads, shared memory, synchronization

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Martin Kruliš by Martin Kruliš (v1.0)1.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

© David Kirk/NVIDIA and Wen-mei W

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

CS/EE 217 – GPU Architecture and Parallel Programming

Sathish Vadhiyar Parallel Programming

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

L18: CUDA, cont. Memory Hierarchy and Examples

Lecture 5: GPU Compute Architecture for the last time

CUDA Parallelism Model

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Reduction)

ECE 498AL Lecture 15: Reductions and Their Implementation

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture 5: Synchronization and ILP

Synchronization These notes introduce:

Presentation transcript:

GPU Programming David Monismith Based on Notes from the Udacity Parallel Programming (cs344) Course

Recap A grid of blocks is 1, 2 or 3D A block of threads is 1, 2 or 3D arrayMult >> == arrayMult >> arrayMult >>(...) shmem is shared memory per block in bytes

CUDA Communication Discussed threads solving a problem by working together CUDA communication takes place through memory –Read from same input location –Write to same output location –Exchange partial results

Communication Patterns How to map tasks (threads) and their memory together. We have seen map. –Perform same task on each piece of data –E.g. in an array –In CUDA, have one thread do each task Many operations cannot be accomplished with map.

More Patterns Gather – write a result from many array locations into one location. Scatter – write/update many results from information in one array location. –Note that an array location may be updated. Stencil – compute a result using a fixed neighborhood in an array. –Examples include von Neumann and Moore Neighborhoods. Examples of each of these will be drawn on the board.

An OpenCV Pixel struct uchar4 { //Red unsigned char x; //Green unsigned char y; //Blue unsigned char z; //Alpha unsigned char w; }; To convert to grayscale, Red is x, Green is y, Blue is z I =.299f*Red +.587f*Green +.114f*Blue

Converting To Grayscale A grayscale conversion could be viewed as either a stencil or gather operation. grayscaleImg[i] = img[i].x + img[i].y + img[i].z; How could this be implemented in CUDA?

Transpose Often, it is necessary to perform a matrix transpose on data such as an image or a matrix. In CUDA, this information is often stored in a 1D array. This is often useful when transforming an array of structures into a structure of arrays.

Recap of Communication Map and Transpose – one to one Gather – many to one Scatter – one to many Stencil – specialized many to one Reduce and Scan – all to one Sorting – all to all

Programming Model and GPU HW Divide computations into kernels –C/C++ functions –Functions represent threads –Different threads may take different paths Groups of threads are called thread blocks –These threads work together to solve a particular problem or subproblem –GPU is responsible for assigning/allocating blocks to Streaming Multiprocessors

GPU Hardware GPUs may contain one or more Streaming Multiprocessors (SMs). SMs each have their own memory and their own simple processors (i.e. CUDA cores) –The CUDA cores in an SM may map to one or more thread blocks. The GPU is responsible for allocating blocks to SMs. SMs run in parallel and independently.

CUDA Guarantees and Advantages Advantages of CUDA paradigm –Flexibility –Scalability –Efficient Disadvantages –No communication between blocks. –No guarantees about where thread blocks will run. CUDA guarantees –All threads in a block run on the same SM at the same time. –All blocks in one kernel must finish before any block from the next kernel starts.

CUDA Compilation For Stampede –module load cuda –nvcc -arch=compute_30 -code=sm_30 -o –#SBATCH -p gpudev For LittleFe2 –nvcc -arch=compute_21 -code=sm_21 -o

CUDA Memory Access Consequences of CUDA paradigm –Cannot have communication between blocks –Threads and blocks must run to completion Threads have both local and shared memory –Shared memory is only shared between threads in the same block –All threads have access to global memory

GPU Layout | GPU | | Thread Global Memory | | | | | S | Local | | | | | | |Memory | | | | | V | | | | | | | | | | | | | | | | | | Shared | | | | | | S S S | | Memory | | | | | | | | | | | | | | | | | V V V | | | | | | | | | Thread Block ^ | | | Host | | Memory |

Thread Synchronization Barrier - a point in a program where all threads or processes stop and wait When all threads or processes reach the barrier, they may all continue syncthreads creates a barrier within a thread block

Barriers Need for barriers int idx = threadIdx.x; __shared__ int arr[128]; arr[idx] = threadIdx.x; if(idx > 0 && idx <= 127) arr[idx] = arr[idx-1];

Barriers Continued Should be rewritten as int idx = threadIdx.x; __shared__ int array[128]; __syncthreads(); array[idx] = threadIdx.x; if(idx > 0 && idx <= 127) { int temp = arr[idx-1]; __syncthreads(); arr[idx] = arr[idx-1]; __syncthreads(); }

Writing CUDA Programs CUDA is a hierarchy of computation, synchronization, and memory To write efficient programs use several high level strategies Maximize your program’s math intensity –Perform lots of math per unit of memory –Maximize compute operations per thread –Minimize time spent on memory per thread Move frequently accessed data to fast memory –Memory speed –local > shared >> global –local - registers/L1 cache –shared – per block memory

Local Memory Example __global__void locMemGPU(double in) { double f; //local memory f = in; //Local memory } int main(int argc, char ** argv) { locMemGPU >>(4.5); cudaSynchronize(); }

Global Memory Example __global__ void globalMemGPU(double * myArr) { myArr[threadIdx.x] = myArr[threadIdx.x]; //Array is in global memory } int main(int argc, char ** argv) { float * myHostArr = malloc(sizeof(float)*512); float * devArr; cudaMalloc((void **) &devArr, sizeof(float)*512); for(i = 0; i < 512; i++) myHostArr[i] = i; cudaMemcpy((void *) devArr, (void *) myHostArr, sizeof(float)*512, cudaMemcpyHostToDevice); globalMemGPU >>(devArr); cudaMemcpy((void *) devArr, (void *) myHostArr, sizeof(float)*512,cudaMemcpyDeviceToHost); }

CUDA Memory Access CUDA works well when threads have contiguous memory accesses. The GPU is most efficient when threads read or write to the same area of memory at the same time. Each thread, when it accesses global memory, must access a chunk of memory, not the single data item. Therefore, you should remember the following about memory access: –Contiguous is good, –Strided access is not so good, and –Random access is bad.

Memory Conflicts Example: Assume threads accessing/modifying 10 array elements. This problem can be solved with atomics. –atomicAdd() –atomicMin() –atomicXOR() –atomicCAS() - compare and swap Atomics are only provided for certain operations and data types. There is no atomic mod or exponentiation. Mostly operations are only available for integer types. Can implement any atomic op with CAS, quite complicated though. –Still no ordering constraints.

Even More Memory Issues Floating point arithmetic is non-associative (a + b) + c != a + (b + c) Synchronization of such operations serializes memory access –This makes atomic ops very slow

Example Try the following in CUDA: 10^6 threads incrementing 10^6 elements ( ms) 10^6 threads atomically incrementing 10^6 elements (0.1727ms) 10^6 threads incrementing 100 elements ( ms) 10^6 threads atomically incrementing 100 elements (0.372ms) 10^7 threads atomically incrementing 100 elements ( ms)

Improving CUDA Program Performance Avoid thread divergence. This means avoid if statements whenever possible. Divergence means threads that do different things. Divergence can happen in loops, too. Especially where loops may result for different numbers of iterations. All other threads have to wait until all divergent threads finish.