Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Things we need to consider: Control Synchronization Communication Parallel programming languages offer different ways of dealing with above CUDA Programming Basics – Slide 2

CUDA programming model – basic concepts and data types CUDA application programming interface - basic Simple examples to illustrate basic concepts and functionalities Performance features will be covered later CUDA Programming Basics – Slide 3

Basic kernels and execution on GPU Basic memory management Coordinating CPU and GPU execution See the programming guide for the full API CUDA Programming Basics – Slide 4

Integrated host + device application program in C Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Programming model Parallel code (kernel) is launched and executed on a device by many threads Launches are hierarchical Threads are grouped into blocks Blocks are grouped into grids Familiar serial code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables CUDA Programming Basics – Slide 5

CUDA Programming Basics – Slide 6 Serial Code (host)‏... Parallel Kernel (device)‏ KernelA >>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB >>(args);

A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) ‏ Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads CUDA Programming Basics – Slide 7

Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few CUDA Programming Basics – Slide 8

The future of GPUs is programmable processing So – build the architecture around the processor CUDA Programming Basics – Slide 9 L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB

Processors execute computing threads New operating mode/hardware interface for computing CUDA Programming Basics – Slide 10

CUDA Programming Basics – Slide 11 SMEM Global Memory CPU Chipset PCIe

CUDA Programming Basics – Slide 12 Thread Memory Threadblock Per-block Shared Memory SMEM Streaming ProcessorStreaming Multiprocessor Registers Memory

CUDA Programming Basics – Slide 13 Many blocks of threads... SMEM Global Memory

CUDA Programming Basics – Slide 14 Type Qualifiers global, device, shared, local, constant Keywords threadIdx, blockIdx Intrinsics __syncthreads Runtime API Memory, symbol, execution management Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M];... region[threadIdx] = image[i]; __syncthreads()... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve >> (myimage);

Mark Murphy, “NVIDIA’s Experience with Open64,” www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc CUDA Programming Basics – Slide 15 gcc / cl G80 SASS foo.sass OCG cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp Integrated source (foo.cu)

A CUDA kernel is executed by an array of threads All threads run the same code (SPMD) Each thread has an ID that it uses to compute memory addresses and make control decisions CUDA Programming Basics – Slide 16 76543210 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID

Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Threads in different blocks cannot cooperate CUDA Programming Basics – Slide 17 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID Thread Block 0 … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block N - 1 7654321076543210 76543210

Threads launched for a parallel section are partitioned into thread blocks Grid = all blocks for a given launch Thread block is a group of threads that can Synchronize their executions Communicate via shared memory CUDA Programming Basics – Slide 18

Any possible interleaving of blocks should be valid Presumed to run to completion without preemption Can run in any order Can run concurrently OR sequentially Blocks may coordinate but not synchronize Shared queue pointer: OK Shared lock: BAD … can easily deadlock Independence requirement gives scalability CUDA Programming Basics – Slide 19

A CUDA program has two pieces Host code on the CPU which interfaces to the GPU Kernel code which runs on the GPU At the host level, there is a choice of 2 APIs (Application Programming Interfaces): Runtime: simpler, more convenient Driver: much more verbose, more flexible, closer to OpenCL We will only use the Runtime API in this course CUDA Programming Basics – Slide 20

At the host code level, there are library routines for: memory allocation on graphics card data transfer to/from device memory constants texture arrays (useful for lookup tables) ordinary data error-checking timing There is also a special syntax for launching multiple copies of the kernel process on the GPU. CUDA Programming Basics – Slide 21

Each thread uses IDs to decide what data to work on Block ID: 1-D or 2-D Unique within a block Thread ID: 1-D, 2-D or 3-D Unique within a block Dimensions set at launch Can be unique for each grid CUDA Programming Basics – Slide 22

Built-in variables threadIdx, blockIdx blockDim, gridDim Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes … CUDA Programming Basics – Slide 23

In its simplest form launch of kernel looks like: kernel_routine >>(args); where gridDim is the number of copies of the kernel (the “grid” size”) blockDim is the number of threads within each copy (the “block” size) args is a limited number of arguments, usually mainly pointers to arrays in graphics memory, and some constants which get copied by value The more general form allows gridDim and blockDim to be 2-D or 3-D to simplify application programs CUDA Programming Basics – Slide 24

At the lower level, when one copy of the kernel is started on a SM it is executed by a number of threads, each of which knows about: some variables passed as arguments pointers to arrays in device memory (also arguments) global constants in device memory shared memory and private registers/local variables some special variables: gridDim size (or dimensions) of grid of blocks blockIdx index (or 2-D/3-D indices) of block blockDim size (or dimensions) of each block threadIdx index (or 2-D/3-D indices) of thread CUDA Programming Basics – Slide 25

Suppose we have 1000 blocks, and each one has 128 threads – how does it get executed? On current Tesla hardware, would probably get 8 blocks running at the same time on each SM, and each block has 4 warps => 32 warps running on each SM Each clock tick, SM warp scheduler decides which warp to execute next, choosing from those not waiting for data coming from device memory (memory latency) completion of earlier instructions (pipeline delay) Programmer doesn’t have to worry about this level of detail, just make sure there are lots of threads / warps CUDA Programming Basics – Slide 26

In the simplest case, we have a 1-D grid of blocks, and a 1-D set of threads within each block. If we want to use a 2-D set of threads, then blockDim.x, blockDim.y give the dimensions, and threadIdx.x, threadIdx.y give the thread indices To launch the kernel we would use somthing like dim3 nthreads(16,4); my_new_kernel >>(d_x); where dim3 is a special CUDA datatype with 3 components.x,.y,.z each initialized to 1. CUDA Programming Basics – Slide 27

CUDA Programming Basics – Slide 28 Launch with dim3 dimGrid(2, 2); dim3 dimBlock(4, 2, 2); kernelFunc >>(…); Zoomed in on block with blockIdx.x = blockIdx.y = 1, blockDim.x = 4, blockDim.y = blockDim.z = 2 Each thread in block has coordinates ( threadIdx.x, threadIdx.y, threadIdx.z )

A similar approach is used for 3-D threads and/or 2-D grids. This can be very useful in 2-D / 3-D finite difference applications. How do 2-D / 3-D threads get divided into warps? 1-D thread ID defined by threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y and this is then broken up into warps of size 32. CUDA Programming Basics – Slide 29

Thread (1, 0)‏ Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host Global memory Main means of communicating R/W data between host and device Contents visible to all threads Long latency access We will focus on global memory for now Constant and texture memory will come later CUDA Programming Basics – Slide 30 Thread (1, 0)‏

CUDA Programming Basics – Slide 31 Kernel 0... Per-device Global Memory... Kernel 1 Sequential Kernels

The API is an extension to the ANSI C programming language Low learning curve The hardware is designed to enable lightweight runtime and driver High performance CUDA Programming Basics – Slide 32

CPU and GPU have separate memory spaces Data is moved across the PCIe bus Use functions to allocate/set/copy memory on GPU Very similar to corresponding C functions Pointers are just addresses Can’t tell from the pointer value whether the address is on CPU or GPU Must exercise care when dereferencing Dereferencing CPU pointer on GPU will likely crash and vice- versa CUDA Programming Basics – Slide 33

Thread (1, 0)‏ Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host cudaMalloc() Allocates object in the device global memory Requires two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() Frees objects from device global memory Pointer to freed object CUDA Programming Basics – Slide 34 Thread (1, 0)‏

Code example Allocate a 64-by-64 single precision float array Attach the allocated storage to Md “d” is often used to indicate a device data structure CUDA Programming Basics – Slide 35 TILE_WIDTH = 64; float* Md; int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaMemset(Md, 0, size); cudaFree(Md);

Thread (1, 0)‏ Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host cudaMemcpy() Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to host Host to device Device to host Device to device Asynchronous transfer CUDA Programming Basics – Slide 36 Thread (1, 0)‏

cudaMemcpy() Returns after the copy is complete Blocks CPU thread until all bytes have been copied Doesn’t start copying until previous CUDA calls complete Non-blocking copies are also available CUDA Programming Basics – Slide 37 Device 0 memory Device 1 memory Host memory cudaMemcpy()

Code example Transfer a 64-by-64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost and cudaMemcpyDeviceToDevice are symbolic constants CUDA Programming Basics – Slide 38 cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

CUDA Programming Basics – Slide 39 #include int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }

C/C++ with some restrictions Can only access GPU memory No variable number of arguments No static variables No recursion No dynamic polymorphism Must be declared with a qualifier __global__ : launched by CPU, cannot be called from GPU __device__ : called from other GPU functions, cannot be called by the CPU __host__ : can be called by the CPU CUDA Programming Basics – Slide 40

__global__ defines a kernel function Must return void __device__ and __host__ can be used together Sample use: overloading operators CUDA Programming Basics – Slide 41 Executed on the Only callable from the __device__ float DeviceFunc() Device __global__ void KernelFunc() DeviceHost __host__ float HostFunc() Host

__device__ int reduction_lock = 0; The __device__ prefix tells nvcc this is a global variable in the GPU, not the CPU. The variable can be read and modified by any kernel Its lifetime is the lifetime of the whole application Can also declare arrays of fixed size Can read/write by host code using special routines cudaMemcpyToSymbol, cudaMemcpyFromSymbol or with standard cudaMemcpy in combination with cudaGetSymbolAddress CUDA Programming Basics – Slide 42

__device__ functions cannot have their address taken For functions executed on the device No recursion No static variable declarations inside the function No variable number of arguments CUDA Programming Basics – Slide 43

As seen a kernel function must be called with an execution configuration: Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking CUDA Programming Basics – Slide 44 __global__ void KernelFunc(…); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads/block size_t SharedMemBytes = 64; // 64 bytes shared memory KernelFunc >>(…);

The kernel code looks fairly normal once you get used to two things: code is written from the point of view of a single thread quite different to OpenMP multithreading similar to MPI, where you use the MPI “rank” to identify the MPI process all local variables are private to that thread need to think about where each variable lives any operation involving data in the device memory forces its transfer to/from registers in the GPU no cache on old hardware so a second operation with the same data will force a second transfer often better to copy the value into a local register variable CUDA Programming Basics – Slide 45

CUDA Programming Basics – Slide 46 // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { int N = 16; // total number of elements in the vector/array int TPB = 4; // number of threads per block // allocate and initialize host (CPU) memory float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; // allocate device (GPU) memory cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // assign values to d_A and d_B; // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice); // Run grid of N/4 blocks of 4 threads each vecAdd >>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); }

__global__ identifier says its a kernel function Each thread sets one element of C[] array Within each block of threads, threadIdx.x ranges from 0 to blockDim.x -1, so each thread has a unique value for i CUDA Programming Basics – Slide 47

CUDA Programming Basics – Slide 48 __global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = 7; } Output: 7 7 7 7 7 7 7 7 __global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = blockIdx.x; } Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 __global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = threadIdx.x; } Output: 0 1 2 3 0 1 2 3

CUDA Programming Basics – Slide 49 __global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; } int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel >>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0; }

A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity CUDA Programming Basics – Slide 50

P = M × N of size WIDTH-by- WIDTH Without tiling One thread calculates one element of P M and N are loaded WIDTH times from global memory CUDA Programming Basics – Slide 51 M N P WIDTH

CUDA Programming Basics – Slide 52 M 0,2 M 1,1 M 0,1 M 0,0 M 1,0 M 0,3 M 1,2 M 1,3 M 0,2 M 0,1 M 0,0 M 0,3 M 1,1 M 1,0 M 1,2 M 1,3 M 2,1 M 2,0 M 2,2 M 2,3 M 2,1 M 2,0 M 2,2 M 2,3 M 3,1 M 3,0 M 3,2 M 3,3 M 3,1 M 3,0 M 3,2 M 3,3 M

53 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M

CUDA Programming Basics – Slide 54 M N P WIDTH k j i k

CUDA Programming Basics – Slide 55 // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * Width + k]; double b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; }

CUDA Programming Basics – Slide 56 void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏ { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; // allocate and load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // allocate P on the device cudaMalloc(&Pd, size);

CUDA Programming Basics – Slide 57 // kernel invocation code – to be shown later (Step 5) … // read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd); }

CUDA Programming Basics – Slide 58 Md Nd Pd WIDTH k threadIdx.x threadIdx.y k threadIdx.x threadIdx.y

CUDA Programming Basics – Slide 59 // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel (float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k)‏ float Melement = Md[threadIdx.y * Width + k]; float Nelement = Nd[k * Width + threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y * Width + threadIdx.x] = Pvalue; }

CUDA Programming Basics – Slide 60 // Set up the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads MatrixMulKernel >>(Md, Nd, Pd, Width);

One block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Performs one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block CUDA Programming Basics – Slide 61

CUDA Programming Basics – Slide 62 Grid 1 Block 1 48 Thread (2, 2)‏ WIDTH Md Pd Nd

Have each 2-D thread block compute a (TILE_WIDTH)² sub-matrix (tile) of the result matrix Each has (TILE_WIDTH)² threads Generate a 2-D grid of (WIDTH / TILE_WIDTH)² blocks You still need to put a loop around the kernel call for cases where WIDTH / TILE_WIDTH is greater than the max grid size (64K) CUDA Programming Basics – Slide 63

CUDA Programming Basics – Slide 64 Md Nd Pd WIDTH ty tx by bx TILE_WIDTH Break-up Pd into tiles Each block calculates one tile Each thread calculates one element Block size equal tile size

CUDA Programming Basics – Slide 65 Pd 1,0 Pd 0,0 Pd 0,1 Pd 2,0 Pd 3,0 Pd 1,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3 Block(0,0)Block(1,0) Block(1,1)Block(0,1) TILE_WIDTH = 2 Pd 1,0 Md 2,0 Md 1,1 Md 1,0 Md 0,0 Md 0,1 Md 3,0 Md 2,1 Pd 0,0 Md 3,1 Pd 0,1 Pd 2,0 Pd 3,0 Nd 0,3 Nd 1,3 Nd 1,2 Nd 1,1 Nd 1,0 Nd 0,0 Nd 0,1 Nd 0,2 Pd 1,1 Pd 0,2 Pd 2,2 Pd 3,2 Pd 1,2 Pd 3,1 Pd 2,1 Pd 0,3 Pd 2,3 Pd 3,3 Pd 1,3

CUDA Programming Basics – Slide 66 // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel (float* Md, float* Nd, float* Pd, int Width)‏ { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }

All threads in a block execute the same kernel program (SPMD) Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1-D, 2-D, or 3-D Block dimensions in threads Threads have thread id numbers within block Thread program uses thread id to select work and address shared data CUDA Tools and Threads – Slide 67 CUDA Thread Block Thread Id #: 0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate Each block can execute in any order relative to other blocs! CUDA Tools and Threads – Slide 68 CUDA Thread Block Thread Id #: 0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

Hardware is free to assign blocks to any processor at any time A kernel scales across any number of parallel processors Each block can execute in any order relative to other blocks CUDA Tools and Threads – Slide 69 Device Block 0Block 1 Block 2Block 3 Block 4Block 5 Block 6Block 7 Kernel grid Block 0Block 1 Block 2Block 3 Block 4Block 5 Block 6Block 7 Device Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7 time

Processors execute computing threads New operating mode/hardware interface for computing CUDA Tools and Threads – Slide 70

Threads are assigned to streaming multiprocessors (SMs) in block granularity Up to 8 blocks to each SM as resource allows Each SM in G80 can take up to 768 threads Could be 256 (threads/block) × 3 blocks Or 128 (threads/block) × 6 blocks, etc. Threads run concurrently Each SM maintains thread/block id numbers Each SM manages/schedules thread execution CUDA Tools and Threads – Slide 71

CUDA Tools and Threads – Slide 72 t0 t1 t2 … tm Blocks SP Shared Memory MT IU SP Shared Memory MT IU t0 t1 t2 … tm Blocks SM 1SM 0 Flexible resource allocation

Each block is executed as 32-thread warps An implementation decision, not part of the CUDA programming model Warps are scheduling units in an SM If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? Each block is divided into 256/32 = 8 warps There are 8 × 3 = 24 warps CUDA Tools and Threads – Slide 73

CUDA Tools and Threads – Slide 74 … t0 t1 t2 … t31 … … … Block 1 WarpsBlock 2 Warps SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1 Streaming Multiprocessor Shared Memory … t0 t1 t2 … t31 … Block 1 Warps

Each SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by an SM Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected CUDA Tools and Threads – Slide 75

For matrix multiplication using multiple blocks, should I use 8 × 8, 16 × 16 or 32 × 32 blocks? For 8 × 8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16 × 16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. For 32 × 32, we have 1024 threads per Block. Not even one can fit into an SM! CUDA Tools and Threads – Slide 76

The API is an extension to the C programming language It consists of: Language extensions To target portions of the code for execution on the device A runtime library split into: A common component providing built-in vector types and a subset of the C runtime library in both host and device codes A host component to control and access one or more devices from the host A device component providing device-specific functions CUDA Tools and Threads – Slide 77

dim3 gridDim; Dimensions of the grid in blocks ( gridDim.z unused) dim3 blockDim; Dimensions of the block in threads dim3 blockIdx; Block index within the grid dim3 threadIdx; Thread index within the block CUDA Tools and Threads – Slide 78

pow, sqrt, cbrt, hypot exp, exp2, expm1 log, log2, log10, log1p sin, cos, tan, asin, acos, atan, atan2 sinh, cosh, tanh, asinh, acosh, atanh ceil, floor, trunc, round Etc. When executed on the host, a given function uses the C runtime implementation if available These functions are only supported for scalar types, not vector types CUDA Tools and Threads – Slide 79

Some mathematical functions (e.g. sin(x) ) have a less accurate, but faster device-only version (e.g. __sin(x) ) __pow __log, __log2, __log10 __exp __sin, __cos, __tan CUDA Tools and Threads – Slide 80

Provides functions to deal with: Device management (including multi-device systems) Memory management Error handling Initializes the first time a runtime function is called A host thread can invoke device code on only one device Multiple host threads required to run on multiple devices CUDA Tools and Threads – Slide 81

void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory Allowed in conditional constructs only if the conditional is uniform across the entire thread block CUDA Tools and Threads – Slide 82

memory allocation cudaMalloc((void **)&xd, nbytes); data copying cudaMemcpy(xh, xd, nbytes, cudaMemcpyDeviceToHost); reminder: d ( h ) to distinguish an array on the device (host) is not mandatory, just helpful labeling kernel routine is declared by __global__ prefix, and is written from point of view of a single thread CUDA Programming Basics – Slide 83

Reading: Chapters 3 and 4, “Programming Massively Parallel Processors” by Kirk and Hwu. Based on original material from The University of Illinois at Urbana-Champaign David Kirk, Wen-mei W. Hwu Oxford University: Mike Giles Stanford University Jared Hoberock, David Tarjan Revision history: last updated 6/22/2011. CUDA Programming Basics – Slide 84

Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Similar presentations

Presentation on theme: "Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Similar presentations

Presentation on theme: "Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron."— Presentation transcript:

Similar presentations

About project

Feedback