Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

List Ranking and Parallel Prefix

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy NVIDIA.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

CIS 565 Fall 2011 Qing Sun

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

CS 193G Lecture 2: GPU History & CUDA Programming Basics.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Martin Kruliš by Martin Kruliš (v1.0)1.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.

Basic CUDA Programming

Some things are naturally parallel

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CUDA Execution Model – III Streams and Events

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Chapter 4:Parallel Programming in CUDA C

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

Presentation transcript:

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao Tung University

What will you learn in this lab? Concept of multicore accelerator Multithreaded/multicore programming Memory optimization

Slides Mostly from Prof. Wen-Mei Hwu of UIUC – Syllabus.htmlhttp://courses.ece.uiuc.edu/ece498/al/ Syllabus.html

CUDA – Hardware? Software?

Host-Device Architecture CPU (host) GPU w/ local DRAM (device)

G80 CUDA mode – A Device Example

Functional Units in G80 Streaming Multiprocessor (SM) –1 instruction decoder ( 1 instruction / 4 cycle ) –8 streaming processor (SP) –Shared memory t0 t1 t2 … tm Blocks SP Shared Memory MT IU SP Shared Memory MT IU t0 t1 t2 … tm Blocks SM 1SM 0

Setup CUDA for Windows

CUDA Environment Setup Get GPU that support CUDA – ducts.htmlhttp:// ducts.html Download CUDA – CUDA driver CUDA toolkit CUDA SDK (optional) Install CUDA Test CUDA –Device Query

Setup CUDA for Visual Studio From scratch – wtopic=30273http://forums.nvidia.com/index.php?sho wtopic=30273 CUDA VS Wizard – wizard/ wizard/ Modified from existing project

Lab1: First CUDA Program

CUDA Computing Model

Data Manipulation between Host and Device cudaError_t cudaMalloc( void** devPtr, size_t count ) –Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind) –Copies count bytes from memory area pointed to by src to the memory area pointed to by dst –kind indicates the type of memory transfer cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaError_t cudaFree( void* devPtr ) –Frees the memory space pointed to by devPtr

Example Functionality: –Given an integer array A holding 8192 elements –For each element in array A, calculate A[i] 256 and leave the result in B[i]

Now, go and finish your first CUDA program !!!

Download lab1.zip lab1.zip Open project with Visual C ( lab1/cuda_lab/cuda_lab.vcproj ) –main.cu Random input generation, output validation, result reporting –device.cu Lunch GPU kernel, GPU kernel code –parameter.h Fill in appropriate APIs –GPU_kernel() in device.cu

Lab2: Make the Parallel Code Faster

Parallel Processing in CUDA Parallel code can be partitioned into blocks and threads –cuda_kernel >>( … ) Multiple tasks will be initialized, each with different block id and thread id The tasks are dynamically scheduled –Tasks within the same block will be scheduled on the same stream multiprocessor Each task take care of single data partition according to its block id and thread id

Locate Data Partition by Built-in Variables Built-in Variables –gridDim x, y –blockIdx x, y –blockDim x, y, z –threadIdx x, y, z

Data Partition for Previous Example When processing 64 integer data: cuda_kernel >>(…) int total_task = gridDim.x * blockDim.x ; int task_sn = blockIdx.x * blockDim.x + threadIdx.x ; int length = SIZE / total_task ; int head = task_sn * length ;

Processing Single Data Partition

Parallelize Your Program !!!

Partition kernel into threads –Increase nTid from 1 to 512 –Keep nBlk = 1 Group threads into blocks –Adjust nBlk and see if it helps Maintain total number of threads below 512, e.g. nBlk * nTid < 512

Lab3: Resolve Memory Contention

Parallel Memory Architecture Memory is divided into banks to achieve high bandwidth Each bank can service one address per cycle Successive 32-bit words are assigned to successive banks

Lab2 Review When processing 64 integer data: cuda_kernel >>(…)

How about Interleave Accessing? When processing 64 integer data: cuda_kernel >>(…)

Implementation of Interleave Accessing head = task_sn stripe = total_task cuda_kernel >>( … )

Improve Your Program !!!

Modify original kernel code in interleaving manner –cuda_kernel() in device.cu Adjusting nBlk and nTid as in Lab2 and examine the effect –Maintain total number of threads below 512, e.g. nBlk * nTid < 512

Thank You lab3.ziphttp://twins.ee.nctu.edu.tw/~skchen/ lab3.zip Final project issue Group issue