Presentation is loading. Please wait.

Presentation is loading. Please wait.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Similar presentations


Presentation on theme: "Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM."— Presentation transcript:

1 Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching Center @ UoM

2 Training Program on GPU Programming with CUDA Sanath Jayasena CUDA Teaching Center @ UoM Day 1, Session 2 CUDA Programming Model CUDA Threads

3 Outline for Day 1 Session 2 CUDA Programming Model, CUDA Threads Data Parallelism CUDA Program Structure Memory Model & Data Transfer (Brief) Kernel Functions & Threading (Discussion with Example: Matrix Multiplication) July-Aug 20113CUDA Training Program

4 Data Parallelism – A problem/program property – Many arithmetic operations can be safely performed on the data structures simultaneously – Example: matrix multiplication (next slide) CUDA devices can exploit data parallelism to accelerate execution of applications July-Aug 2011CUDA Training Program4

5 Example: Matrix Multiplication July-Aug 2011CUDA Training Program 5 MP N width P = M · N Each element in P is computed as dot product between a row of M and a column of N All elements in P can be computed independently and simultaneously

6 CUDA Program Structure A CUDA program consists of one or more phases executed on either the host (CPU) or a device (GPU), supplied as a single source code Little or no data parallelism  host code – ANSI C, compiled with standard compiler Significant data parallelism  device code – Extended ANSI C to specify kernels, data structs NVIDIA C Complier separates the two and … July-Aug 2011CUDA Training Program6

7 Execution of a CUDA Program July-Aug 2011CUDA Training Program7

8 Execution of a CUDA Program Execution starts with host (CPU) When a kernel is invoked, execution moves to the device (GPU) – A large number of threads generated – Grid : collection of all threads generated by kernel – (Previous slide shows two grids of threads) Once all threads in a grid complete execution, the grid terminates and execution continues on the host July-Aug 2011CUDA Training Program8

9 Example: Matrix Multiplication int main (void) { 1. // Allocate and initialize matrices M, N, P // I/O to read the input matrices M and N …. 2. // M * N on the device MatrixMulOnDevice (M, N, P, width); 3. // I/O to write the output matrix P // Free matrices M, N, P … return 0; } July-Aug 2011CUDA Training Program9 A simple CUDA host code skeleton for matrix multiplication

10 CUDA Device Memory Model Host, devices have separate memory spaces – E.g., hardware cards with their own DRAM To execute a kernel on a device – Need to allocate memory on device – Transfer data: host memory  device memory After device execution – Transfer results: device memory  host memory – Free device memory no longer needed July-Aug 2011CUDA Training Program10

11 CUDA Device Memory Model July-Aug 2011CUDA Training Program11

12 CUDA API : Memory Mgt. July-Aug 2011CUDA Training Program12

13 CUDA API : Memory Mgt. Example float *Md; int size = Width * Width * sizeof(float); cudaMalloc((void**)&Md, size); … cudaFree(Md); July-Aug 2011CUDA Training Program13

14 CUDA API : Data Transfer July-Aug 2011CUDA Training Program14

15 Example: Matrix Multiplication July-Aug 2011CUDA Training Program15

16 Kernel Functions & Threading A kernel function specifies the code to be executed by all threads of a parallel phase – All threads of a parallel phase execute the same code  single-program multiple-data (SPMD), a popular programming style for parallel computing Need a mechanism to – Allow threads to distinguish themselves – Direct themselves to specific parts of data they are supposed to work on July-Aug 2011CUDA Training Program16

17 Kernel Functions & Threading Keywords “threadIdx.x” and “threadIdx.y” – Thread indices of a thread – Allow a thread to identify itself at runtime (by accessing hardware registers associated with it) Can refer a thread as Thread threadIdx.x,threadIdx.y Thread indices reflect a multi-dimensional organization for threads July-Aug 2011CUDA Training Program17

18 Example: Matrix Multiplication Kernel July-Aug 2011CUDA Training Program18 See next slide for more details on accessing relevant data

19 Thread Indices & Accessing Data Relevant to a Thread July-Aug 2011CUDA Training Program 19 MdPd Nd width tx ty x y Pd row 0row 1 ty * width tx How matrix Pd would be laid out in memory (as it is a 1-D array) Each thread uses tx, ty to identify the relevant row of Md, column of Nd and the element of Pd in the for loop E.g., Thread 2,3 will perform dot product between row 2 of Md and column 3 of Nd and write the result into element (2,3) of Pd

20 Threading & Grids When a kernel is invoked/launched, it is executed as a grid of parallel threads A CUDA thread grid can have millions of lightweight GPU threads per kernel invocation – To fully utilize hardware  enough threads required  large data parallelism required Threads in a grid has a two-level hierarchy – A grid consists of 1 or more thread blocks – All blocks in a grid have same # of threads July-Aug 2011CUDA Training Program20

21 CUDA Thread Organization July-Aug 2011CUDA Training Program21

22 Threading with Grids & Blocks Each thread block has a unique 2-D coordinate given by CUDA keywords “blockIdx.x” and “blockIdx.y” – All blocks must have the same structure, thread # Each block has a 3-D array of threads up to a total of 1024 threads max – Coordinates of threads in a block are defined by indices: threadIdx.x, threadIdx.y, threadIdx.z – (Not all apps will use all 3 dimensions) July-Aug 2011CUDA Training Program22

23 Our Example: Matrix Multiplication The kernel is shown 5 slides before (slide 18) – This can only use one thread block – The block is organized as a 2D-array The code can compute a product matrix Pd of only up to 1024 elements – As a block can have a max of 1024 threads – Each thread computes one element in Pd – Is this sufficient / acceptable? July-Aug 2011CUDA Training Program23

24 Our Example: Matrix Multiplication When host code invokes the kernel, the grid and block dimensions are set by passing them as parameters Example // Setup the execution configuration dim3 dimBlock(16, 16, 1); //Width=16, as example dim3 dimGrid(1, 1, 1); //last 1 ignored // Launch the device computation threads! MatrixMulKernel >>(Md,Nd,Pd,16); July-Aug 2011CUDA Training Program24

25 Here is an Exercise… Implement Matrix Multiplication – Execute it with different matrix dimensions using (a) CPU only, (b) GPUs and (c) GPUs with different grid/block organizations Fill a table like the following July-Aug 2011CUDA Training Program25 Dimensions (M, N)CPU time (s) GPU time (s) Speedup [400,800], [400, 400] [800,1600], [800, 800] …. [2400,4800], [2400, 4800]

26 Conclusion We discussed CUDA Programming Model and CUDA Thread Basics – Data Parallelism – CUDA Program Structure – Memory Model & Data Transfer (briefly) – Kernel Functions & Threading – (Discussion with Example: Matrix Multiplication) July-Aug 2011CUDA Training Program26

27 References for this Session Chapter 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010 Chapters 4-5 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010 Chapter 2 of: NVIDIA CUDA C Programming Guide, V. 3.2/4.0, NVIDIA Corp., 2010-2011 July-Aug 2011CUDA Training Program27


Download ppt "Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM."

Similar presentations


Ads by Google