Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPU Architecture and Programming

Similar presentations


Presentation on theme: "GPU Architecture and Programming"— Presentation transcript:

1 GPU Architecture and Programming

2 GPU vs CPU

3 GPU Architecture GPU (Graphics Processing Unit) were originally designed as graphics accelerators, used for real-time graphics rendering. Starting in the late 1990s, the hardware became increasingly programmable, culminating in NVIDIA's first GPU in 1999.

4 CPU + GPU is a powerful combination
CPUs consist of a few cores optimized for serial processing, GPUs consist of thousands of smaller, more efficient cores designed for parallel performance. Serial portions of the code run on the CPU while parallel portions run on the GPU

5 Architecture of GPU NVIDIA GPUs have a number of multiprocessors, each of which executes in parallel with the others. - The high end Tesla accelerators have 30 multiprocessors - The high end Fermi has 16 multiprocessors Each multipprocessor has a group of stream processors On Tesla, each multiprocessor has a group of 8 stream processors (cores); On Fermi, each multiprocessor has two groups of 16 stream processors (cores). So, the high end Tesla accelerators have 30 x 8 = 240 cores; the high end Fermi has 16 x 2 x 16 = 512 cores. Each core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles double-precision operations. Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors. 32 threads are one execution unit, which is called a warp. Codes are executed in groups of warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor. Reference: (Point is that GPU can parallely execute many threads parallelly) Image copied from Image copied from

6 CUDA Programming CUDA (Compute Unified Device Architecture) is a parallel programming platform created by NVIDIA based on its GPUs. By using CUDA, you can write programs that directly access GPU. CUDA platform is accessible to programmers via CUDA libraries and extensions to programming languages like C, C++ AND Fortran. C/C++ programmers use “CUDA C/C++”, compiled with nvcc compiler Fortran programmers can use CUDA Fortran, compiled with PGI CUDA Fortran GPU

7 Terminology: Host: The CPU and its memory (host memory)
Device: The GPU and its memory (device memory)

8 Programming Paradigm Parallel function of application: execute as a kernel Copy from

9 Programming Flow Copy input data from CPU memory to GPU memory
Load GPU program and execute Copy results from GPU memory to CPU memory

10 Each parallel function of application is execute as a kernel
That means GPUs are programmed as a sequence of kernels; typically, each kernel completes execution before the next kernel begins. Fermi has some support for multiple, independent kernels to execute simultaneously, but most kernels are large enough to fill the entire machine.

11 The host program launches a sequence of kernels.
The execution of a kernel should be divided into the exectuation of many threads on GPU. Lets see how we organize threads for a kernel. Overall, we can say: Threads are grouped into blocks, and multiple blocks form a grid. Each thread has a unique local index in its block, and each block has a unique index in the grid. Kernels can use these indices to compute array subscripts. Threads in a single block will be executed on a single multiprocessor; a warp will always be a subset of threads from a single block. There is a hard limit on the size of a thread block, 512 threads or 16 warps for Tesla, 1024 threads or 32 warps for Fermi. A Tesla multiprocessor can have 1024 threads simultaneously active or 32 warps, from up to 8 thread blocks A Fermi multiprocessor can have 48 simultaneously active warps, equivalent to 1536 threads, from up to 8 thread blocks Image copied from

12 Hello World! Example _ _global_ _ is a CUDA C/C++ keyword meaning
mykernel() will be exectued on the device mykernel() will be called from the host Copy from

13 Addition Example Since add runs on device, pointers a, b, and c must point to device memory Copy from

14 CUDA API for managing device memory
cudaMalloc (), cudaFree(), cudaMemcpy() Similar to the C equivalents malloc(), free(), memcpy() Copy from

15 Vector Addition Example
Kernel Function: The execution of the kernel on GPU is actually the exetuation of many threads. This statement specifies what each thread needs to do. Each thread needs an index of the data it will manipulate. Each thread will have a global unique thread id, so we can map the thread ID to the index of the data it manipulates. For this special case, we need to have N threads (assume the size of the array is N), then one way we can organize the these N threads is: Create N blocks in one dimension, and each block has 1 thread. The id of each block is represented by a variable blockIdx: block ID (blockIdx.x, blockId.y) For this case, therefore, the ids for all the blocks are like: (0,0), (1, 0), (2, 0), …, (n,0). The id of the thread within a block is represented by a variable threadIdx: threadId id within a block (threadIdx.x, threadIdx.y, threadIdx.z) For this case, therefore, the index for a thread within a block is (0, 0, 0). Then, each thread will have a global unique id, which can be calculated from the corresponding block id and its internal thread id. So a thread can be globally indentified by blockIdx.x + trheadIdx.x = blockIdx.x Then we can map the threadid to the index of the data it will process. Copy from

16 main: All threads are grouped into N blocks. Each block contain 1 thread. Copy from

17 Alternative 1: Alternatively, having n threads, we can have one block and the block contains n threads. This block id is (0 ,0). Within this block, the thread id is (0, 0, 0), (1, 0, 0), (2, 0, 0), (3, 0, 0), … (n, 0, 0). The gloabal id for each thread will be 0, 1, 2, …, n Then map the global thread id to the index of the data it will manipulate. Copy from

18 Alternative 2: Assume that we will have multiple blocks, each block will have multiple threads. Assume the number of threads in a block is M. Then the total number of blocks we need is N/M. int globalThreadId = threadIdx.x + blockIdx.x * M //M is the number of threads in a block Int globalThreadId = threadIdx.x + blockIdx.x * blockDim.x Copy from

19 So the kernel becomes Copy from

20 The main becomes Copy from

21 Handling Arbitrary Vector Sizes
Copy from


Download ppt "GPU Architecture and Programming"

Similar presentations


Ads by Google