First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.
An Introduction to Programming with CUDA Paul Richmond
GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU History CUDA Intro. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene.
Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Programming Model
Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.
Basic CUDA Programming
Device Routines and device variables
Slides from “PMPP” book
Device Routines and device variables
CUDA Execution Model – III Streams and Events
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
CUDA Programming Model
Parallel Computing 18: CUDA - I
Presentation transcript:

First CUDA Program

#include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) { kernel >> (); printf("Hello World!\n"); return 0; } First C program First CUDA program Compilation nvcc -o first first.cu./first Compilation gcc -o first first.c./first

Kernels CUDA C extends C by allowing the programmer to define C functions, called kernels, – when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. A kernel is defined using the _ _global_ _ declaration specifier. The number of CUDA threads that execute that kernel for a given kernel call is specified using a new >> execution configuration syntax Amrita School of Biotechnology

Example Program 1 “__global__” says the function is to be compiled to run on a “device” (GPU), not “host” (CPU) Angle brackets “ >>” for passing params/args to runtime #include __global__ void kernel (void) { } int main (void) { kernel >> (); printf("Hello World!\n"); return 0; } A function executed on the GPU (device) is usually called a “kernel” Amrita School of Biotechnology

Example Program 2 We can pass parameters to a kernel as we would with any C function _ _global_ _ void add(int a, int b, int *c) { *c = a+b; } int main (void) { int c, *dev_c; cudaMalloc ((void **) &dev_c, sizeof (int)); add >> (2,7, dev_c); cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost); printf(“2 + 7 = %d\n“, c); cudaFree(dev_c); return 0; } We need to allocate memory to do anything useful on a device BlocksPerGrid, threadsPerBlock Amrita School of Biotechnology

CUDA Device Memory Allocation cudaMalloc() : cudaError_t cudaMalloc ( void ** devPtr, size_t size ) Global Memory – Allocates object in the device Global Memory – Allocates size bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. – Requires two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() cudaError_t cudaFree ( void * devPtr ) – Frees object from device Global Memory Pointer to freed object Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host Amrita School of Biotechnology

CUDA Device Memory Allocation (cont.)‏ Code example: –Allocate a 64 * 64 single precision float array –Attach the allocated storage to Md –“d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md); Amrita School of Biotechnology

Thread Per-thread Local Memory Block Per-block Shared Memory Memory model Kernel 0... Per-device Global Memory... Kernel 1 Sequential Kernels Device 0 memory Device 1 memory Host memory cudaMemcpy () There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are persistent across kernel launches by the same application. The CUDA programming model assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively.

Amrita School of Biotechnology Each thread can: – Read/write per-thread registers – Read/write per-thread local memory – Read/write per-block shared memory – Read/write per-grid global memory – Read/only per-grid constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory

CUDA Host-Device Data Transfer cudaMemcpy() : –memory data transfer cudaError_t cudaMemcpy ( void * dst, const void * src, size_t count, enum cudaMemcpyKind kind ) –Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer –Host to Host, cudaMemcpyHostToHost –Host to Device: cudaMemcpyHostToDevice –Device to Host: cudaMemcpyDeviceToHost –Device to Device : cudaMemcpyDeviceToDevice Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host Amrita School of Biotechnology

CUDA Host-Device Data Transfer (cont.) Code example: –Transfer a 64 * 64 single precision float array –M is in host memory and Md is in device memory –cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost); Amrita School of Biotechnology

Summing Vectors A simple example to illustrate threads and how we use them to code with CUDA C. Amrita School of Biotechnology

#define N 10 void add( int *a, int *b, int *c ) { int tid = 0; // this is CPU zero, so we start at zero while (tid < N) { c[tid] = a[tid] + b[tid]; tid += 1; // we have one CPU, so we increment by one } int main( void ) { int a[N], b[N], c[N]; for (int i=0; i<N; i++) { // fill the arrays 'a' and 'b' on the CPU a[i] = -i; b[i] = i * i; } add( a, b, c ); // display the results for (int i=0; i<N; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } return 0; } Traditional C code in CPU: Amrita School of Biotechnology

GPU Vector Sums We can accomplish the same addition very similarly on a GPU by writing add() as a device function. #define N 10 int main( void ) { int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; // allocate the memory on the GPU cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ; // fill the arrays 'a' and 'b' on the CPU for (int i=0; i<N; i++) { a[i] = -i; b[i] = i * i; } // copy the arrays 'a' and 'b' to the GPU cudaMemcpy( dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice ) ; cudaMemcpy( dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice ) ; add >>( dev_a, dev_b, dev_c ); Amrita School of Biotechnology

// copy the array 'c' back from the GPU to the CPU cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ) ; // display the results for (int i=0; i<N; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; } // Kernel definition __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; // handle the data at this index if (tid < N) c[tid] = a[tid] + b[tid]; } Amrita School of Biotechnology kernel callable from host __global__ void KernelFunc(...); function callable on device __device__ void DeviceFunc(...); variable in device memory __device__ int GlobalVar; in per-block shared memory __shared__ int SharedVar;

In – add >>( dev_a, dev_b, dev_c ); N is the number of blocks that we want to run in parallel. – If we call add >>(..), the function will have four copies running in parallel, where each copy is named a block. Thread block = a (data) parallel task – all blocks in kernel have the same entry point – but may execute any code they want Amrita School of Biotechnology

This is what the actual code being executed in each of the four parallel blocks looks like after the runtime substitutes the appropriate block index for blockIdx.x: Runtime system is already launching a different kernel where each block will have one of these indices, the work is done in parallel. Amrita School of Biotechnology