Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation

Outline Brief introduction to what is CUDA
What is a CUDA code look like How CUDA run CUDA detail Optimization

What is CUDA A architecture for programming GPU
Released by NVIDIA first in 2007 C / C++ extension There are another GPU programming architectures before CUDA, ex: Cg, Brooks

Applications Bioinformatics Computational Finance Deep learning etc

GPGPU GPU becomes more powerful General Purpose GPU (GPGPU)
Computing power Memory bandwidth (on chip) General Purpose GPU (GPGPU) Hundreds of thousands of cores inside, run threads concurrently. A core is not powerful than CPU, but Two hands are better than one

Hardware Requirement To run CUDA, your GPU need to meet some requirements Architecture compatible: AMD GPUs are not compatible now; instead, they release Boltzmann Initiative NVIDIA GPU cards need architecture after Fermi architecture: latter generation architectures support more CUDA features

Terminology Host: The CPU and it’s memory
Device: The GPU and it’s memory

Heterogeneous Computing
Serial Code Parallel Code Serial Code

Processing Flow Copy data to GPU memory
Execute hundreds of thousands of threads in GPU Copy result back to CPU Kernel: the code GPU will execute

Hello World! Our first simple example: simple_add
Parallel code: kernel

Serial code: setup

Serial code: launch kernel
collect result

d_c is just a pointer points to a device memory, store the result of add
If we need to use the value of d_c, we just copy it back with cudaMemcpy, then continue our serial (CPU) code If we direct use *d_c, because it stores device memory address, you’ll get a

><><><><><><><><><
To launch kernel, we execute a __global__ function with <<<>>> simple_add<<<1, 1>>>(d_a, d_b, d_c); Decide how much thread to launch Kernel function Function parameters

Thread, Block, Grid We saw <<<1, 1>>> decide to launch 1 thread, but what does <<<1, 1>>> mean? CUDA use hierarchy structure to manage threads: grid, block, thread

Grid Block 0 Block 1 Block 2 Block … thread 0 thread 1 thread 2

Grid can be consisted as at most 3D blocks
Blocks can be consisted as at most 3D threads

<<<blocks pre grid, threads pre block>>>
<<<1, 1>>> : a grid with 1 block inside, and one block is consisted of 1 thread. Total threads: 1 <<<2, 3>>>: a grid with 2 blocks inside, and one block is consisted of 3 threads. Total threads: 6 Why this kind of management? We’ll talk about it later

vector_add The simple_add example is boring, let’s do a more interesting example threadIdx.x is the x number of the thread

BlockIdx.x is the x number of the block
BlockDim.x is the total threads in x dimension (width) If we launch vector_add<<<2, 3>>> For first thread (block(0), thread(0)): idx = * 3 = 0 For fourth thread (block(1), thread(0)): idx = * 3 = 3

We have added the vector concurrently!
Remember to malloc right size Also, memcpy with the right size

Memory Types CUDA has 5 types of memory, each of them has different properties Key properties: Size Access speed Read/write, read only

Memory Types Global memory: cudaMalloc memory, the size is large, but slow (has cache) Texture memory: read only, cache optimized for 2D access pattern Constant memory: slow but with cache (8KB)

Memory Types Global memory is accessible to all threads once the kernel call pass the pointer points to it Constant memory is accessible to all threads even without passing pointer to the kernel Texture memory is the same as texture memory

Memory Types Local memory: local to thread, but is as slow as global memory Shared memory: 100x fast to global memory, but is accessible to all threads in one block

Memory Types Shared memory is very fast, but usually only has 49KB (can be configured to 64KB) Actually, shared memory is the same as “L1 cache” as CPU, but controllable by user One block has one shared memory, that’s one reason why we manage the threads in grid and block way!

Shared Memory Example 1D stencil:

If we program this way Straight forward, but very slow!

With shared memory: With this little change, we can reduce about 2*RADIUS latency!

But actually, the result is wrong…
If thread 5 done copy, and starts calculate (tmp[3] + tmp[4] + tmp[5]), but thread 6 did not finish copy tmp[5]? __syncthreads() makes all threads in a block synchorize! __syncthreads() is a barrier, will block all threads wait until all threads reach the line

Then we have the correct 1D stencil

Reduction CUDA is running multi-threads, like Hadoop (map and reduce), you can do the reduce to do: Summation Search Etc Of course, with __syncthreads()

CUDA-GDB CUDA kernel runs on GPU, so native system API does not apply here (ex: cout) Although CUDA compute capability > 2.x support printf, but using printf to debug is not realistic. CUDA-gdb is a debugger based on gdb, and simulates GPU threads in CPU threads

Optimization To make your CUDA program fast, you need:
Avoid memory copy between CPU and GPU memory Use cache (shared memory) in your kernel Choose block number Array alignment Continuous memory access Use CUDA APIs

Optimization Avoid memory copy between CPU and GPU memory
Memory copy between CPU and GPU is expensive (but not extreme expensive, you can still use it, but try to avoid it)

Optimization Use cache (shared memory) in your kernel This can be the key to optimize the CUDA program since avoid memory copy between CPU and GPU is not that hard In most case, the cache is hard to implement due to the size limit, or you don’t know how to make cache plan

Optimization Choose block number More block or more threads is hard to choose, actually, it is problem dependent. We need to understand how CUDA grid, block, thread map to read GPU cores

In GPU, the unit of process is SP (streaming processor); several SP and some components are composed as a SM (streaming multiprocessor); several SM is composed are composed as TPC (texture processing cluster) In CUDA, we can roughly say that a grid was processed by whole GPU, block is processed by SM, and thread is processed by SP.

Every 32 threads are composed as a wrap
Every 32 threads are composed as a wrap. If you choose a number can not be divided by 32, the rest of the treads are composed as a wrap Every time, a SM only processed a wrap, thus if a wrap has less then 32 threads, you will make some SP idle, and make waste.

When a thread is waiting for data, the SM will chose another threads to execute, thus hide the memory access latency Thus, more threads in one block can hide such latency more; but more threads in one block means the available shared memory per threads is less. From NVIDIA’s suggestion, one block need at least 196 threads to hide the memory access latency

Optimization Array alignment Memory access can have better performance if the data items are aligned at 64 byte boundary Hence, align 2D array that each row starts at a 64 byte address will import performance But this is difficult for programmer!

We will pad some dummy byte to each data item
pitch

Then use cudaMemcpy2D with the pitch size to copy data
You can use cudaMallocPitch to let GPU help decide the pitch, and allocate memory with better access performance Then use cudaMemcpy2D with the pitch size to copy data Disadvantage: Harder for programmer Waste memory

Optimization Continuous memory access If we access the memory in a continuous pattern, we can improve performance ex: memcpy a block of continuous memory

Optimization Use CUDA APIs CUDA API supports lots of basic math functions, like sin, log, cuRand (random library), etc. Using these APIs will increase the performance

Conclusion GPU is a powerful computing tool with hundreds of thousands of threads; but programming GPU is not a simply thing, incorrect programming pattern will even decrease the performance CUDA is a GPGPU computing model, acting as a computing assistant of CPU GPGPU is very powerful, but only in some area (scientific applicaions)

References http://neuralnetworksanddeeplearning.com/chap6.html

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Similar presentations

Presentation on theme: "Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Similar presentations

Presentation on theme: "Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation."— Presentation transcript:

Similar presentations

About project

Feedback