Basic CUDA Programming

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
GPU Programming with CUDA
CUDA Programming Model
CS427 Multicore Architecture and Parallel Computing
Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Some things are naturally parallel
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Mattan Erez The University of Texas at Austin
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Basic CUDA Programming Shin-Kai Chen skchen@twins.ee.nctu.edu.tw VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao Tung University

What will you learn in this lab? Concept of multicore accelerator Multithreaded/multicore programming Memory optimization

Slides Mostly from Prof. Wen-Mei Hwu of UIUC http://courses.ece.uiuc.edu/ece498/al/Syllabus.html

CUDA – Hardware? Software?

Host-Device Architecture CPU (host) GPU w/ local DRAM (device)

G80 CUDA mode – A Device Example Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache

Functional Units in G80 Streaming Multiprocessor (SM) 1 instruction decoder ( 1 instruction / 4 cycle ) 8 streaming processor (SP) Shared memory SM 0 SM 1 SP Shared Memory MT IU SP Shared Memory MT IU t0 t1 t2 … tm t0 t1 t2 … tm Blocks Blocks

Setup CUDA for Windows

CUDA Environment Setup Get GPU that support CUDA http://www.nvidia.com/object/cuda_learn_products.html Download CUDA http://www.nvidia.com/object/cuda_get.html CUDA driver CUDA toolkit CUDA SDK (optional) Install CUDA Test CUDA Device Query

Setup CUDA for Visual Studio From scratch http://forums.nvidia.com/index.php?showtopic=30273 CUDA VS Wizard http://sourceforge.net/projects/cudavswizard/ Modified from existing project

Lab1: First CUDA Program

CUDA Computing Model

Data Manipulation between Host and Device cudaError_t cudaMalloc( void** devPtr, size_t count ) Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) Copies count bytes from memory area pointed to by src to the memory area pointed to by dst kind indicates the type of memory transfer cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaError_t cudaFree( void* devPtr ) Frees the memory space pointed to by devPtr

Example Functionality: Given an integer array A holding 8192 elements For each element in array A, calculate A[i]256 and leave the result in B[i]

Now, go and finish your first CUDA program !!!

Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zip Open project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj ) main.cu Random input generation, output validation, result reporting device.cu Lunch GPU kernel, GPU kernel code parameter.h Fill in appropriate APIs GPU_kernel() in device.cu

Lab2: Make the Parallel Code Faster

Parallel Processing in CUDA Parallel code can be partitioned into blocks and threads cuda_kernel<<<nBlk, nTid>>>(…) Multiple tasks will be initialized, each with different block id and thread id The tasks are dynamically scheduled Tasks within the same block will be scheduled on the same stream multiprocessor Each task take care of single data partition according to its block id and thread id

Locate Data Partition by Built-in Variables gridDim x, y blockIdx blockDim x, y, z threadIdx

Data Partition for Previous Example When processing 64 integer data: cuda_kernel<<<2, 2>>>(…) int total_task = gridDim.x * blockDim.x ; int task_sn = blockIdx.x * blockDim.x + threadIdx.x ; int length = SIZE / total_task ; int head = task_sn * length ;

Processing Single Data Partition

Parallelize Your Program !!!

Partition kernel into threads Increase nTid from 1 to 512 Keep nBlk = 1 Group threads into blocks Adjust nBlk and see if it helps Maintain total number of threads below 512, e.g. nBlk * nTid < 512

Lab3: Resolve Memory Contention

Parallel Memory Architecture Memory is divided into banks to achieve high bandwidth Each bank can service one address per cycle Successive 32-bit words are assigned to successive banks

Lab2 Review When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)

How about Interleave Accessing? When processing 64 integer data: cuda_kernel<<<1, 4>>>(…)

Implementation of Interleave Accessing cuda_kernel<<<1, 4>>>(…) head = task_sn stripe = total_task

Improve Your Program !!!

Modify original kernel code in interleaving manner cuda_kernel() in device.cu Adjusting nBlk and nTid as in Lab2 and examine the effect Maintain total number of threads below 512, e.g. nBlk * nTid < 512

Thank You http://twins.ee.nctu.edu.tw/~skchen/lab3.zip Final project issue Subject: Porting & optimizing any algorithm on any multi-core Demo: 1 week after final exam @ ED412 Group: 1 ~ 2 person per group * Group member & demo time should be registered after final exam @ ED412