Intermediate GPGPU Programming in CUDA

Intermediate GPGPU Programming in CUDA
Supada Laosooksathit

NVIDIA Hardware Architecture
Host memory Terminologies Global memory Shared memory SMs

Recall 5 steps for CUDA Programming Initialize device
Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory

Initialize Device Calls
To select the device associated to the host thread cudaSetDevice(device) This function must be called before any __global__ function, otherwise device 0 is automatically selected. To get number of devices cudaGetDeviceCount(&devicecount) To retrieve device’s property cudaGetDeviceProperties(&deviceProp, device)

Hello World Example Allocate host and device memory

Hello World Example Host code

Hello World Example Kernel code

To Try CUDA Programming
SSH to Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK Compile the following directories NVIDIA_GPU_Computing_SDK/shared/ NVIDIA_GPU_Computing_SDK/C/common/ The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

Demo Hello World Vector Add Print out block and thread IDs C = A + B
Show some real demos.. Above one and additional ones in the dirs

CUDA Language Concept CUDA programming model CUDA memory model

Some Terminologies Device = GPU = set of stream multiprocessors
Stream Multiprocessor (SM) = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

CUDA Programming Model
Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread // Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }

Thread Hierarchy Threads launched for a parallel section are partition into thread blocks Thread block is a group of threads that can: Synchronize their execution Communicate via a low latency shared memory Grid = all thread blocks for a given launch

IDs and Dimensions Threads Blocks Dimensions are set at launch time
3D IDs Unique within a block – two threads from two different blocks cannot cooperate Blocks 2D and 3D IDs (depend on the hardware) Unique within a grid Dimensions are set at launch time Can be unique for each section Built-in variables: threadIdx, blockIdx blockDim, gridDim

Kernel 1 Kernel 2 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1)
Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

CUDA Memory Model Each thread can:
R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host The host can R/W global, constant, and texture memories

Host memory

Device DRAM Global memory Texture and Constant Memories
Main means of communicating R/W data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host

CUDA Global Memory Allocation
cudaMalloc(pointer, memsize) Allocates object in the device Global Memory pointer = address of a pointer to the allocated object memsize = Size of allocated object cudaFree(pointer) Frees object from device Global Memory

CUDA Host-Device Data Transfer
cudaMemcpy() Memory data transfer Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer: Host to Host, Host to Device, Device to Host, Device to Device

CUDA Function Declaration
Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ defines a kernel function Must return void

CUDA Function Calls Restrictions
__device__ functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments

Calling a Kernel Function – Thread Creation
A kernel function must be called with an execution configuration: KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...); DimGrid = dimension and size of the grid DimBlock = dimension and size of each block SharedMemBytes specifies the number of bytes in shared memory (option) Streams specifies the associated stream (option)

Host memory Terminologies Global memory Shared memory SMs

Terminologies -Compute capability -Threads in a block will be grouped into a warp of 32 threads -Execute in the cores SM

Specifications of a Device
Compute Capability 1.3 Compute Capability 2.0 Warp size 32 Max threads/block 512 1024 Max Blocks/grid 65535 Shared mem 16 KB/SM 48 KB/SM For more details deviceQuery in CUDA SDK Appendix F in Programming Guide 4.0

Demo deviceQuery Show hardware specifications in details

Memory Optimizations Reduce the time of memory transfer between host and device Use asynchronous memory transfer (CUDA streams) Use zero copy Reduce the number of transactions between on-chip and off-chip memory Memory coalescing Avoid bank conflicts in shared memory

Reduce Time of Host-Device Memory Transfer
Regular memory transfer (synchronously)

CUDA streams Allow overlapping between kernel and memory copy

CUDA Streams Example

GPU Timers CUDA Events CUDA timer calls An API
Use the clock shade in kernel Accurate for timing kernel executions CUDA timer calls Libraries implemented in CUDA SDK

CUDA Events Example

Demo simpleStreams

Zero copy Allow device pointers to access page-locked host memory directly Page-locked host memory is allocated by cudaHostAlloc()

Demo Zero copy

Reduce number of On-chip and Off-chip Memory Transactions
Threads in a warp access global memory Memory coalescing Copy a bunch of words at the same time

Memory Coalescing Threads in a warp access global memory in a straight forward way (4-byte word per thread)

Memory Coalescing Memory addresses are aligned in the same segment but the accesses are not sequential

Memory Coalescing Memory addresses are not aligned in the same segment

Shared Memory 16 banks for compute capability 1.x, 32 banks for compute capability 2.x Help utilizing memory coalescing Bank conflicts may occur Two or more threads in access the same bank In compute capability 1.x, no broadcast In compute capability 2.x, broadcast the same data to many threads that request

Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: 1
Threads: Banks: 1 2 3 Threads: Banks: 1 2 3

Matrix Multiplication Example

Matrix Multiplication Example
Reduce accesses to global memory (A.height/BLOCK_SIZE) times reading A (B.width/BLOCK_SIZE) times reading B

Demo Matrix Multiplication With and without shared memory
Different block sizes

Control Flow if, switch, do, for, while Branch divergence in a warp
Threads in a warp issue different instruction sets Different execution paths will be serialized Increase number of instructions in that warp

Branch Divergence

Summary 5 steps for CUDA Programming NVIDIA Hardware Architecture
Memory hierarchy: global memory, shared memory, register file Specifications of a device: block, warp, thread, SM

Summary Memory optimization Control flow
Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy Reduce the number of transactions between on-chip and off-chip memory by utilizing memory coalescing (shared memory) Try to avoid bank conflicts in shared memory Control flow Try to avoid branch divergence in a warp

References http://docs.nvidia.com/cuda/cuda-c-programming-guide/

Intermediate GPGPU Programming in CUDA

Similar presentations

Presentation on theme: "Intermediate GPGPU Programming in CUDA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intermediate GPGPU Programming in CUDA

Similar presentations

Presentation on theme: "Intermediate GPGPU Programming in CUDA"— Presentation transcript:

Similar presentations

About project

Feedback