Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intermediate GPGPU Programming in CUDA Supada Laosooksathit.

Similar presentations


Presentation on theme: "Intermediate GPGPU Programming in CUDA Supada Laosooksathit."— Presentation transcript:

1 Intermediate GPGPU Programming in CUDA Supada Laosooksathit

2 NVIDIA Hardware Architecture Host memory

3 Recall 5 steps for CUDA Programming – Initialize device – Allocate device memory – Copy data to device memory – Execute kernel – Copy data back from device memory

4 Initialize Device Calls To select the device associated to the host thread – cudaSetDevice(device) – This function must be called before any __global__ function, otherwise device 0 is automatically selected. To get number of devices – cudaGetDeviceCount(&devicecount) To retrieve devices property – cudaGetDeviceProperties(&deviceProp, device)

5 Hello World Example Allocate host and device memory

6 Hello World Example Host code

7 Hello World Example Kernel code

8 To Try CUDA Programming SSH to Set environment vals in.bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK Compile the following directories – NVIDIA_GPU_Computing_SDK/shared/ – NVIDIA_GPU_Computing_SDK/C/common/ The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

9 Demo Hello World – Print out block and thread IDs Vector Add – C = A + B

10 CUDA Language Concept CUDA programming model CUDA memory model

11 Some Terminologies Device = GPU = set of stream multiprocessors Stream Multiprocessor (SM) = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

12 CUDA Programming Model Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread // Kernel definition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; }

13 Thread Hierarchy Threads launched for a parallel section are partition into thread blocks Thread block is a group of threads that can: – Synchronize their execution – Communicate via a low latency shared memory Grid = all thread blocks for a given launch

14

15 IDs and Dimensions Threads – 3D IDs – Unique within a block – two threads from two different blocks cannot cooperate Blocks – 2D and 3D IDs (depend on the hardware) – Unique within a grid Dimensions are set at launch time – Can be unique for each section Built-in variables: – threadIdx, blockIdx – blockDim, gridDim

16 Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

17

18

19 CUDA Memory Model Each thread can: – R/W per-thread registers – R/W per-thread local memory – R/W per-block shared memory – R/W per-grid global memory – Read only per-grid constant memory – Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories

20 Host memory

21 Device DRAM Global memory – Main means of communicating R/W data between host and device – Contents visible to all threads Texture and Constant Memories – Constants initialized by host – Contents visible to all threads

22 CUDA Global Memory Allocation cudaMalloc(pointer, memsize) – Allocates object in the device Global Memory – pointer = address of a pointer to the allocated object – memsize = Size of allocated object cudaFree(pointer) – Frees object from device Global Memory

23 CUDA Host-Device Data Transfer cudaMemcpy() – Memory data transfer – Requires four parameters Pointer to source Pointer to destination Number of bytes copied Type of transfer: Host to Host, Host to Device, Device to Host, Device to Device

24 CUDA Function Declaration __global__ defines a kernel function – Must return void Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc() host

25 CUDA Function Calls Restrictions __device__ functions cannot have their address taken For functions executed on the device: – No recursion – No static variable declarations inside the function – No variable number of arguments

26 Calling a Kernel Function – Thread Creation A kernel function must be called with an execution configuration: KernelFunc >>(...); – DimGrid = dimension and size of the grid – DimBlock = dimension and size of each block – SharedMemBytes specifies the number of bytes in shared memory (option) – Streams specifies the associated stream (option)

27 NVIDIA Hardware Architecture Host memory

28 NVIDIA Hardware Architecture SM

29 Specifications of a Device For more details – deviceQuery in CUDA SDK – Appendix F in Programming Guide 4.0 SpecificationsCompute Capability 1.3 Compute Capability 2.0 Warp size32 Max threads/block Max Blocks/grid65535 Shared mem16 KB/SM48 KB/SM

30 Demo deviceQuery – Show hardware specifications in details

31 Memory Optimizations Reduce the time of memory transfer between host and device – Use asynchronous memory transfer (CUDA streams) – Use zero copy Reduce the number of transactions between on-chip and off-chip memory – Memory coalescing Avoid bank conflicts in shared memory

32 Reduce Time of Host-Device Memory Transfer Regular memory transfer (synchronously)

33 Reduce Time of Host-Device Memory Transfer CUDA streams – Allow overlapping between kernel and memory copy

34 CUDA Streams Example

35

36 GPU Timers CUDA Events – An API – Use the clock shade in kernel – Accurate for timing kernel executions CUDA timer calls – Libraries implemented in CUDA SDK

37 CUDA Events Example

38 Demo simpleStreams

39 Reduce Time of Host-Device Memory Transfer Zero copy – Allow device pointers to access page-locked host memory directly – Page-locked host memory is allocated by cudaHostAlloc()

40 Demo Zero copy

41 Reduce number of On-chip and Off-chip Memory Transactions Threads in a warp access global memory Memory coalescing – Copy a bunch of words at the same time

42 Memory Coalescing Threads in a warp access global memory in a straight forward way (4-byte word per thread)

43 Memory Coalescing Memory addresses are aligned in the same segment but the accesses are not sequential

44 Memory Coalescing Memory addresses are not aligned in the same segment

45 Shared Memory 16 banks for compute capability 1.x, 32 banks for compute capability 2.x Help utilizing memory coalescing Bank conflicts may occur – Two or more threads in access the same bank – In compute capability 1.x, no broadcast – In compute capability 2.x, broadcast the same data to many threads that request

46 Bank Conflicts 0 0 Threads:Banks: Threads:Banks: No bank conflict2-way bank conflict

47 Matrix Multiplication Example

48 Reduce accesses to global memory – (A.height/BLOCK_SIZE) times reading A – (B.width/BLOCK_SIZE) times reading B

49 Demo Matrix Multiplication – With and without shared memory – Different block sizes

50 Control Flow if, switch, do, for, while Branch divergence in a warp – Threads in a warp issue different instruction sets Different execution paths will be serialized Increase number of instructions in that warp

51 Branch Divergence

52 Summary 5 steps for CUDA Programming NVIDIA Hardware Architecture – Memory hierarchy: global memory, shared memory, register file – Specifications of a device: block, warp, thread, SM

53 Summary Memory optimization – Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy – Reduce the number of transactions between on- chip and off-chip memory by utilizing memory coalescing (shared memory) – Try to avoid bank conflicts in shared memory Control flow – Try to avoid branch divergence in a warp

54 References programming-guide/ programming-guide/ practices-guide/ practices-guide/ toolkit toolkit

55

56


Download ppt "Intermediate GPGPU Programming in CUDA Supada Laosooksathit."

Similar presentations


Ads by Google