Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic CUDA Programming

Similar presentations


Presentation on theme: "Basic CUDA Programming"— Presentation transcript:

1 Basic CUDA Programming
Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao Tung University

2 What will you learn in this lab?
Concept of multicore accelerator Multithreaded/multicore programming Memory optimization

3 Slides Mostly from Prof. Wen-Mei Hwu of UIUC

4 CUDA – Hardware? Software?

5 Host-Device Architecture
CPU (host) GPU w/ local DRAM (device)

6 G80 CUDA mode – A Device Example
Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache

7 Functional Units in G80 Streaming Multiprocessor (SM)
1 instruction decoder ( 1 instruction / 4 cycle ) 8 streaming processor (SP) Shared memory SM 0 SM 1 SP Shared Memory MT IU SP Shared Memory MT IU t0 t1 t2 … tm t0 t1 t2 … tm Blocks Blocks

8 Setup CUDA for Windows

9 CUDA Environment Setup
Get GPU that support CUDA Download CUDA CUDA driver CUDA toolkit CUDA SDK (optional) Install CUDA Test CUDA Device Query

10 Setup CUDA for Visual Studio
From scratch CUDA VS Wizard Modified from existing project

11 Lab1: First CUDA Program

12 CUDA Computing Model

13 Data Manipulation between Host and Device
cudaError_t cudaMalloc( void** devPtr, size_t count ) Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) Copies count bytes from memory area pointed to by src to the memory area pointed to by dst kind indicates the type of memory transfer cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaError_t cudaFree( void* devPtr ) Frees the memory space pointed to by devPtr

14 Example Functionality: Given an integer array A holding 8192 elements
For each element in array A, calculate A[i]256 and leave the result in B[i]

15 Now, go and finish your first CUDA program !!!

16 Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zip
Open project with Visual C ( lab1/cuda_lab/cuda_lab.vcproj ) main.cu Random input generation, output validation, result reporting device.cu Lunch GPU kernel, GPU kernel code parameter.h Fill in appropriate APIs GPU_kernel() in device.cu

17 Lab2: Make the Parallel Code Faster

18 Parallel Processing in CUDA
Parallel code can be partitioned into blocks and threads cuda_kernel<<<nBlk, nTid>>>(…) Multiple tasks will be initialized, each with different block id and thread id The tasks are dynamically scheduled Tasks within the same block will be scheduled on the same stream multiprocessor Each task take care of single data partition according to its block id and thread id

19 Locate Data Partition by Built-in Variables
gridDim x, y blockIdx blockDim x, y, z threadIdx

20 Data Partition for Previous Example
When processing 64 integer data: cuda_kernel<<<2, 2>>>(…) int total_task = gridDim.x * blockDim.x ; int task_sn = blockIdx.x * blockDim.x + threadIdx.x ; int length = SIZE / total_task ; int head = task_sn * length ;

21 Processing Single Data Partition

22 Parallelize Your Program !!!

23 Partition kernel into threads
Increase nTid from 1 to 512 Keep nBlk = 1 Group threads into blocks Adjust nBlk and see if it helps Maintain total number of threads below 512, e.g. nBlk * nTid < 512

24 Lab3: Resolve Memory Contention

25 Parallel Memory Architecture
Memory is divided into banks to achieve high bandwidth Each bank can service one address per cycle Successive 32-bit words are assigned to successive banks

26 Lab2 Review When processing 64 integer data:
cuda_kernel<<<1, 4>>>(…)

27 How about Interleave Accessing?
When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)

28 Implementation of Interleave Accessing
cuda_kernel<<<1, 4>>>(…) head = task_sn stripe = total_task

29 Improve Your Program !!!

30 Modify original kernel code in interleaving manner
cuda_kernel() in device.cu Adjusting nBlk and nTid as in Lab2 and examine the effect Maintain total number of threads below 512, e.g. nBlk * nTid < 512

31 Thank You http://twins.ee.nctu.edu.tw/~skchen/lab3.zip
Final project issue Subject: Porting & optimizing any algorithm on any multi-core Demo: 1 week after final ED412 Group: 1 ~ 2 person per group * Group member & demo time should be registered after final ED412


Download ppt "Basic CUDA Programming"

Similar presentations


Ads by Google