Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intermediate GPGPU Programming in CUDA

Similar presentations

Presentation on theme: "Intermediate GPGPU Programming in CUDA"— Presentation transcript:

1 Intermediate GPGPU Programming in CUDA
Supada Laosooksathit

2 NVIDIA Hardware Architecture
Host memory Terminologies Global memory Shared memory SMs

3 Recall 5 steps for CUDA Programming Initialize device
Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory

4 Initialize Device Calls
To select the device associated to the host thread cudaSetDevice(device) This function must be called before any __global__ function, otherwise device 0 is automatically selected. To get number of devices cudaGetDeviceCount(&devicecount) To retrieve device’s property cudaGetDeviceProperties(&deviceProp, device)

5 Hello World Example Allocate host and device memory

6 Hello World Example Host code

7 Hello World Example Kernel code

8 To Try CUDA Programming
SSH to Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK Compile the following directories NVIDIA_GPU_Computing_SDK/shared/ NVIDIA_GPU_Computing_SDK/C/common/ The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

9 Demo Hello World Vector Add Print out block and thread IDs C = A + B
Show some real demos.. Above one and additional ones in the dirs

10 CUDA Language Concept CUDA programming model CUDA memory model

11 Some Terminologies Device = GPU = set of stream multiprocessors
Stream Multiprocessor (SM) = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

12 CUDA Programming Model
Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread // Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }

13 Thread Hierarchy Threads launched for a parallel section are partition into thread blocks Thread block is a group of threads that can: Synchronize their execution Communicate via a low latency shared memory Grid = all thread blocks for a given launch


15 IDs and Dimensions Threads Blocks Dimensions are set at launch time
3D IDs Unique within a block – two threads from two different blocks cannot cooperate Blocks 2D and 3D IDs (depend on the hardware) Unique within a grid Dimensions are set at launch time Can be unique for each section Built-in variables: threadIdx, blockIdx blockDim, gridDim

16 Kernel 1 Kernel 2 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1)
Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)



19 CUDA Memory Model Each thread can:
R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host The host can R/W global, constant, and texture memories

20 Host memory

21 Device DRAM Global memory Texture and Constant Memories
Main means of communicating R/W data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host

22 CUDA Global Memory Allocation
cudaMalloc(pointer, memsize) Allocates object in the device Global Memory pointer = address of a pointer to the allocated object memsize = Size of allocated object cudaFree(pointer) Frees object from device Global Memory

23 CUDA Host-Device Data Transfer
cudaMemcpy() Memory data transfer Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer: Host to Host, Host to Device, Device to Host, Device to Device

24 CUDA Function Declaration
Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() host __host__ float HostFunc() __global__ defines a kernel function Must return void

25 CUDA Function Calls Restrictions
__device__ functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments

26 Calling a Kernel Function – Thread Creation
A kernel function must be called with an execution configuration: KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...); DimGrid = dimension and size of the grid DimBlock = dimension and size of each block SharedMemBytes specifies the number of bytes in shared memory (option) Streams specifies the associated stream (option)

27 NVIDIA Hardware Architecture
Host memory Terminologies Global memory Shared memory SMs

28 NVIDIA Hardware Architecture
Terminologies -Compute capability -Threads in a block will be grouped into a warp of 32 threads -Execute in the cores SM

29 Specifications of a Device
Compute Capability 1.3 Compute Capability 2.0 Warp size 32 Max threads/block 512 1024 Max Blocks/grid 65535 Shared mem 16 KB/SM 48 KB/SM For more details deviceQuery in CUDA SDK Appendix F in Programming Guide 4.0

30 Demo deviceQuery Show hardware specifications in details

31 Memory Optimizations Reduce the time of memory transfer between host and device Use asynchronous memory transfer (CUDA streams) Use zero copy Reduce the number of transactions between on-chip and off-chip memory Memory coalescing Avoid bank conflicts in shared memory

32 Reduce Time of Host-Device Memory Transfer
Regular memory transfer (synchronously)

33 Reduce Time of Host-Device Memory Transfer
CUDA streams Allow overlapping between kernel and memory copy

34 CUDA Streams Example

35 CUDA Streams Example

36 GPU Timers CUDA Events CUDA timer calls An API
Use the clock shade in kernel Accurate for timing kernel executions CUDA timer calls Libraries implemented in CUDA SDK

37 CUDA Events Example

38 Demo simpleStreams

39 Reduce Time of Host-Device Memory Transfer
Zero copy Allow device pointers to access page-locked host memory directly Page-locked host memory is allocated by cudaHostAlloc()

40 Demo Zero copy

41 Reduce number of On-chip and Off-chip Memory Transactions
Threads in a warp access global memory Memory coalescing Copy a bunch of words at the same time

42 Memory Coalescing Threads in a warp access global memory in a straight forward way (4-byte word per thread)

43 Memory Coalescing Memory addresses are aligned in the same segment but the accesses are not sequential

44 Memory Coalescing Memory addresses are not aligned in the same segment

45 Shared Memory 16 banks for compute capability 1.x, 32 banks for compute capability 2.x Help utilizing memory coalescing Bank conflicts may occur Two or more threads in access the same bank In compute capability 1.x, no broadcast In compute capability 2.x, broadcast the same data to many threads that request

46 Bank Conflicts No bank conflict 2-way bank conflict Threads: Banks: 1
Threads: Banks: 1 2 3 Threads: Banks: 1 2 3

47 Matrix Multiplication Example

48 Matrix Multiplication Example
Reduce accesses to global memory (A.height/BLOCK_SIZE) times reading A (B.width/BLOCK_SIZE) times reading B

49 Demo Matrix Multiplication With and without shared memory
Different block sizes

50 Control Flow if, switch, do, for, while Branch divergence in a warp
Threads in a warp issue different instruction sets Different execution paths will be serialized Increase number of instructions in that warp

51 Branch Divergence

52 Summary 5 steps for CUDA Programming NVIDIA Hardware Architecture
Memory hierarchy: global memory, shared memory, register file Specifications of a device: block, warp, thread, SM

53 Summary Memory optimization Control flow
Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy Reduce the number of transactions between on-chip and off-chip memory by utilizing memory coalescing (shared memory) Try to avoid bank conflicts in shared memory Control flow Try to avoid branch divergence in a warp

54 References



Download ppt "Intermediate GPGPU Programming in CUDA"

Similar presentations

Ads by Google