Intermediate GPGPU Programming in CUDA Supada Laosooksathit
NVIDIA Hardware Architecture Host memory Terminologies Global memory Shared memory SMs
Recall 5 steps for CUDA Programming Initialize device Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory
Initialize Device Calls To select the device associated to the host thread cudaSetDevice(device) This function must be called before any __global__ function, otherwise device 0 is automatically selected. To get number of devices cudaGetDeviceCount(&devicecount) To retrieve device’s property cudaGetDeviceProperties(&deviceProp, device)
Hello World Example Allocate host and device memory
Hello World Example Host code
Hello World Example Kernel code
To Try CUDA Programming SSH to 138.47.102.111 Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK Compile the following directories NVIDIA_GPU_Computing_SDK/shared/ NVIDIA_GPU_Computing_SDK/C/common/ The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/
Demo Hello World Vector Add Print out block and thread IDs C = A + B Show some real demos.. Above one and additional ones in the dirs
CUDA Language Concept CUDA programming model CUDA memory model
Some Terminologies Device = GPU = set of stream multiprocessors Stream Multiprocessor (SM) = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory
CUDA Programming Model Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread // Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }
Thread Hierarchy Threads launched for a parallel section are partition into thread blocks Thread block is a group of threads that can: Synchronize their execution Communicate via a low latency shared memory Grid = all thread blocks for a given launch
IDs and Dimensions Threads Blocks Dimensions are set at launch time 3D IDs Unique within a block – two threads from two different blocks cannot cooperate Blocks 2D and 3D IDs (depend on the hardware) Unique within a grid Dimensions are set at launch time Can be unique for each section Built-in variables: threadIdx, blockIdx blockDim, gridDim
Kernel 1 Kernel 2 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)
CUDA Memory Model Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host The host can R/W global, constant, and texture memories
Host memory
Device DRAM Global memory Texture and Constant Memories Main means of communicating R/W data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host
CUDA Global Memory Allocation cudaMalloc(pointer, memsize) Allocates object in the device Global Memory pointer = address of a pointer to the allocated object memsize = Size of allocated object cudaFree(pointer) Frees object from device Global Memory
CUDA Host-Device Data Transfer cudaMemcpy() Memory data transfer Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer: Host to Host, Host to Device, Device to Host, Device to Device
CUDA Function Declaration Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ defines a kernel function Must return void
CUDA Function Calls Restrictions __device__ functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments
Calling a Kernel Function – Thread Creation A kernel function must be called with an execution configuration: KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...); DimGrid = dimension and size of the grid DimBlock = dimension and size of each block SharedMemBytes specifies the number of bytes in shared memory (option) Streams specifies the associated stream (option)
NVIDIA Hardware Architecture Host memory Terminologies Global memory Shared memory SMs
NVIDIA Hardware Architecture Terminologies -Compute capability -Threads in a block will be grouped into a warp of 32 threads -Execute in the cores SM
Specifications of a Device Compute Capability 1.3 Compute Capability 2.0 Warp size 32 Max threads/block 512 1024 Max Blocks/grid 65535 Shared mem 16 KB/SM 48 KB/SM For more details deviceQuery in CUDA SDK Appendix F in Programming Guide 4.0
Demo deviceQuery Show hardware specifications in details
Memory Optimizations Reduce the time of memory transfer between host and device Use asynchronous memory transfer (CUDA streams) Use zero copy Reduce the number of transactions between on-chip and off-chip memory Memory coalescing Avoid bank conflicts in shared memory
Reduce Time of Host-Device Memory Transfer Regular memory transfer (synchronously)
Reduce Time of Host-Device Memory Transfer CUDA streams Allow overlapping between kernel and memory copy
CUDA Streams Example
CUDA Streams Example
GPU Timers CUDA Events CUDA timer calls An API Use the clock shade in kernel Accurate for timing kernel executions CUDA timer calls Libraries implemented in CUDA SDK
CUDA Events Example
Demo simpleStreams
Reduce Time of Host-Device Memory Transfer Zero copy Allow device pointers to access page-locked host memory directly Page-locked host memory is allocated by cudaHostAlloc()
Demo Zero copy
Reduce number of On-chip and Off-chip Memory Transactions Threads in a warp access global memory Memory coalescing Copy a bunch of words at the same time
Memory Coalescing Threads in a warp access global memory in a straight forward way (4-byte word per thread)
Memory Coalescing Memory addresses are aligned in the same segment but the accesses are not sequential
Memory Coalescing Memory addresses are not aligned in the same segment
Shared Memory 16 banks for compute capability 1.x, 32 banks for compute capability 2.x Help utilizing memory coalescing Bank conflicts may occur Two or more threads in access the same bank In compute capability 1.x, no broadcast In compute capability 2.x, broadcast the same data to many threads that request
Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: 1 Threads: Banks: 1 2 3 Threads: Banks: 1 2 3
Matrix Multiplication Example
Matrix Multiplication Example Reduce accesses to global memory (A.height/BLOCK_SIZE) times reading A (B.width/BLOCK_SIZE) times reading B
Demo Matrix Multiplication With and without shared memory Different block sizes
Control Flow if, switch, do, for, while Branch divergence in a warp Threads in a warp issue different instruction sets Different execution paths will be serialized Increase number of instructions in that warp
Branch Divergence
Summary 5 steps for CUDA Programming NVIDIA Hardware Architecture Memory hierarchy: global memory, shared memory, register file Specifications of a device: block, warp, thread, SM
Summary Memory optimization Control flow Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy Reduce the number of transactions between on-chip and off-chip memory by utilizing memory coalescing (shared memory) Try to avoid bank conflicts in shared memory Control flow Try to avoid branch divergence in a warp
References http://docs.nvidia.com/cuda/cuda-c-programming-guide/ http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ http://www.developer.nvidia.com/cuda-toolkit