Presentation is loading. Please wait.

Presentation is loading. Please wait.

CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Similar presentations


Presentation on theme: "CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication."— Presentation transcript:

1 CUDA - 101 Basics

2 Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

3 GPU revised!

4 What is CUDA? Compute Device Unified Architecture Programming interface to GPU Supports C/C++ and Fortran natively – Third party wrappers for Python, Java, MATLAB etc Various libraries available – cuBLAS, cuFFT and many more… – https://developer.nvidia.com/gpu-accelerated- libraries https://developer.nvidia.com/gpu-accelerated- libraries

5 CUDA computing stack

6

7

8

9 Data Parallel programming i1 Kernel i2 i3… iN o1 o2 o3… oN

10 Data parallel algorithm Dot product : C = A. B A1 B1… C1 C2 C3… CN A2 B2 A3 B3 AN BN + ++ ++ Kernel

11 Host-Device model CPU (Host) GPU (Device)

12 Threads A thread is an instance of the kernel program – Independent in a data parallel model – Can be executed on a different core Host tells the device to run a kernel program – And how many threads to launch

13 Matrix-Multiplication

14 CPU-only MatrixMultiplication Execute this code For all elements of P

15 Memory Indexing in C (and CUDA) M(i, j) = M[i + j * width]

16 CUDA version - I

17 CUDA program flow Allocate input and output memory on host – Do the same for device Transfer input data from host -> device Launch kernel on device Transfer output data from device -> host

18 Allocating Device memory Host tells the device when to allocate and free memory in device Functions for host-program – cudaMalloc(memory reference, size) – cudaFree(memory reference)

19 Transfer Data to/from device Again, host tells device when to transfer data cudaMemcpy(target, source, size, flag)

20 CUDA version - 2 Host Memory Device Memory Allocate matrix M on device Transfer M from host -> Device Allocate matrix N on device Transfer N from host -> Device Allocate matrix P on device Execute Kernel on Device Transfer P from Device-> Host Free Device memories for M, N and P

21 Matrix Multiplication Kernel Kernel specifies the function to be executed on Device Parameters = Device memories, width Thread = Each element of output matrix P Dot product of M’s row and N’s column Write dot product at current location

22 Extensions : Function qualifiers

23 Extensions : Thread indexing All threads execute the same code – But they need work on separate memory data threadId.x & threadId.y – These variables automatically receive corresponding values for their threads

24 Thread Grid Represents group of all threads to be executed for a particular kernel Two level hierarchy – Grid is composed of Blocks – Each Block is composed of threads

25 Thread Grid 0, 01, 02, 0 width-1, 0 0, 1 width–1, 1 0, 2 0, width-1 width – 1, width - 1

26 Conclusion Sample code and tutorials CUDA nodes? Programming guide – http://docs.nvidia.com/cuda/cuda-c- programming-guide/ http://docs.nvidia.com/cuda/cuda-c- programming-guide/ SDK – https://developer.nvidia.com/cuda-downloads – Available for windows, Mac and Linux – Lot of sample programs

27 QUESTIONS?


Download ppt "CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication."

Similar presentations


Ads by Google