Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
Cuda Streams Presented by Savitha Parur Venkitachalam.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CUDA Misc Mergesort, Pinned Memory, Device Query, Multi GPU.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CUDA Programming Model
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Some things are naturally parallel
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Programming Massively Parallel Graphics Processors
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
General Purpose Graphics Processing Units (GPGPUs)
CUDA Programming Model
CUDA Programming Model
6- General Purpose GPU Programming
Presentation transcript:

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters 1-5 in Sanders & Kandrot, Wikipedia, and the CUDA C Programming Guide

Parallel Processing2 Introduction Objective: To introduce GPU Programming with CUDA Topics –GPU (Graphics Processing Unit) –Introduction to CUDA Kernel functions Compiling CUDA programs with nvcc Copy to/from device memory Blocks and threads Global and shared memory Synchronization –Parallel vector add –Parallel dot product –Float and Double (local GPU servers – 2 X GeForce GTX 580)

Parallel Processing3 GPUs Specialized processor designed for real-time high resolution 3D graphics –Compute intensive, data parallel tasks Highly parallel multi core systems that can operate on large blocks of data in parallel More effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel

Parallel Processing4 GTX 580 Third Generation Streaming Multiprocessor (SM) –32 CUDA cores per SM –16 SM for a total of 512 cores –Dual Warp Scheduler simultaneously schedules and dispatches instructions from two independent warps –64 KB of RAM with a configurable partitioning of shared memory and L1 cache

Parallel Processing5 CUDA CUDA (Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA –CUDA is the computing engine in Nvidia graphics processing units (GPUs) –CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. –CUDA provides both a low level API and a higher level API –Minimal set of extensions to C (define kernels to run on device and specify thread setup) –Runtime library to allocate and transfer memory –Bindings for other languages available Must have a CUDA enabled device Install drivers and CUDA SDK

Parallel Processing6 CUDA Model

Programming Model Parallel Processing7

8 CUDA Tool Chain Input files *.cu contain mix of host and device code Compiling with nvcc –Separate host and device code –Compile device code to PTX (CUDA ISA) or cubin (CUDA binary) –Modify host code to replace >> notation to calls to the runtime library to load and launch compiled kernels cudart (runtime library) –C functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc.

Parallel Processing9 CS Department GPU Servers float.cs.drexel.edu and double.cs.drexel.edu –Dual Intel Xeon Processor (8 cores total) – 2.13 GHz –12M Cache, 28 GB RAM –2 X GeForce GTX 580 GPUs CUDA SDK /usr/local/cuda –bin (executables including nvcc) –doc (CUDA, nvcc, ptx, CUBLAS, CUFFT, OpenCL, etc.) –include –lib (32 bit libraries) –lib64 (64 bit libraries) /usr/local/NVIDIA_GPU_Computing_SDK

Parallel Processing10 Using CUDA on float/double Make sure /usr/local/cuda/bin is in your path –PATH=$PATH:/usr/local/cuda/bin –export PATH Make sure /usr/local/cuda/lib64 is in your library path –LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 –export LD_LIBRARY_PATH You should make these changes in your.bashrc file –If you edit.bashrc and want to use CUDA right away you need to reexecute.bashrc –source.bashrc Make sure this was done properly –which nvcc –echo $PATH –echo $LD_LIBRARY_PATH

Parallel Processing11 Host Hello #include int main( void ) { printf( "Hello, World!\n" ); return 0; } To compile and run chapter03]$ nvcc hello_world.cu -o hello_world chapter03]$./hello_world Hello, World!

Parallel Processing12 GPU Hello #include "../common/book.h" #define N 15 __global__ void hello( char *dev_greetings) { dev_greetings[0] = 'h'; dev_greetings[1] = 'e'; dev_greetings[2] = 'l'; dev_greetings[3] = 'l'; dev_greetings[4] = 'o'; dev_greetings[5] = ' '; dev_greetings[6] = 'f'; dev_greetings[7] = 'r'; dev_greetings[8] = 'o'; dev_greetings[9] = 'm'; dev_greetings[10] = ' '; dev_greetings[11] = 'g'; dev_greetings[12] = 'p'; dev_greetings[13] = 'u'; dev_greetings[14] = '\0'; }

Parallel Processing13 GPU Hello int main( void ) { char greetings[N]; char *dev_greetings; // allocate the memory on the GPU HANDLE_ERROR( cudaMalloc( (void**)&dev_greetings, N ) ); hello >>( dev_greetings); // copy the greetings back from the GPU to the CPU HANDLE_ERROR( cudaMemcpy( greetings, dev_greetings, N, cudaMemcpyDeviceToHost ) ); // display the results printf( "%s\n", greetings); // free the memory allocated on the GPU HANDLE_ERROR( cudaFree( dev_greetings ) ); return 0; }

Parallel Processing14 CUDA Syntax & Functions Function type qualifiers –__global__ –__device__ Calling a kernel function –name >>() Runtime library functions –cudaMalloc –cudaMemcpy –cudaFree

Parallel Processing15 CUDA Device Properties #include "../common/book.h" int main( void ) { cudaDeviceProp prop; int count; HANDLE_ERROR( cudaGetDeviceCount( &count ) ); for (int i=0; i< count; i++) { HANDLE_ERROR( cudaGetDeviceProperties( &prop, i ) ); printf( " --- General Information for device %d ---\n", i ); printf( "Name: %s\n", prop.name ); printf( "Compute capability: %d.%d\n", prop.major, prop.minor ); printf( "Clock rate: %d\n", prop.clockRate );

Parallel Processing16 CUDA Device Properties (cont) printf( "Device copy overlap: " ); if (prop.deviceOverlap) printf( "Enabled\n" ); else printf( "Disabled\n"); printf( "Kernel execution timeout : " ); if (prop.kernelExecTimeoutEnabled) printf( "Enabled\n" ); else printf( "Disabled\n" );

Parallel Processing17 CUDA Device Properties (cont) printf( " --- Memory Information for device %d ---\n", i ); printf( "Total global mem: %ld\n", prop.totalGlobalMem ); printf( "Total constant Mem: %ld\n", prop.totalConstMem ); printf( "Max mem pitch: %ld\n", prop.memPitch ); printf( "Texture Alignment: %ld\n", prop.textureAlignment ) printf( " --- MP Information for device %d ---\n", i ); printf( "Multiprocessor count: %d\n", prop.multiProcessorCount ); printf( "Shared mem per mp: %ld\n", prop.sharedMemPerBlock ); printf( "Registers per mp: %d\n", prop.regsPerBlock ); printf( "Threads in warp: %d\n", prop.warpSize ); printf( "Max threads per block: %d\n", prop.maxThreadsPerBlock ); printf( "Max thread dimensions: (%d, %d, %d)\n", prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2] ); printf( "Max grid dimensions: (%d, %d, %d)\n", prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2] ); printf( "\n" ); } }

Grid of Thread Blocks Parallel Processing18

Parallel Processing19 Parallel Vector Add (blocks) #define N 10 __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; // this thread handles the data at its thread id if (tid < N) c[tid] = a[tid] + b[tid]; }

Parallel Processing20 Parallel Vector Add int main( void ) { int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; // allocate the memory on the GPU HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ); HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ); HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ); // fill the arrays 'a' and 'b' on the CPU for (int i=0; i<N; i++) { a[i] = -i; b[i] = i * i; } // copy the arrays 'a' and 'b' to the GPU HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice ) ); HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice ) );

Parallel Processing21 Parallel Vector Add (blocks) add >>( dev_a, dev_b, dev_c ); // copy the array 'c' back from the GPU to the CPU HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ) ); // display the results for (int i=0; i<N; i++) { printf( "%d + %d = %d\n", a[i], b[i], c[i] ); } // free the memory allocated on the GPU HANDLE_ERROR( cudaFree( dev_a ) ); HANDLE_ERROR( cudaFree( dev_b ) ); HANDLE_ERROR( cudaFree( dev_c ) ); return 0; }

Parallel Processing22 Parallel Vector Add Long (blocks) #define N (32 * 1024) __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; while (tid < N) { c[tid] = a[tid] + b[tid]; tid += gridDim.x; } … add >>( dev_a, dev_b, dev_c );

Parallel Processing23 Parallel Vector Add Long (threads) #define N (33 * 1024) __global__ void add( int *a, int *b, int *c ) { int tid = threadIdx.x + blockIdx.x * blockDim.x; while (tid < N) { c[tid] = a[tid] + b[tid]; tid += blockDim.x * gridDim.x; } … add >>( dev_a, dev_b, dev_c );

Parallel Processing24 CUDA Builtin Variables gridDim (number of blocks) blockIdx (block index within grid) blockDim (number of threads per block) threadIdx (thread index within block)

Memory Hierarchy Parallel Processing25

Parallel Processing26 Parallel Dot Product __global__ void dot( float *a, float *b, float *c ) { __shared__ float cache[threadsPerBlock]; int tid = threadIdx.x + blockIdx.x * blockDim.x; int cacheIndex = threadIdx.x; float temp = 0; while (tid < N) { temp += a[tid] * b[tid]; tid += blockDim.x * gridDim.x; } // set the cache values cache[cacheIndex] = temp; // synchronize threads in this block __syncthreads(); // for reductions, threadsPerBlock must be a power of 2 // because of the following code int i = blockDim.x/2; while (i != 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; } if (cacheIndex == 0) c[blockIdx.x] = cache[0]; }

Parallel Processing27 Parallel Dot Product // for reductions, threadsPerBlock must be a power of 2 // because of the following code int i = blockDim.x/2; while (i != 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; } if (cacheIndex == 0) c[blockIdx.x] = cache[0]; }

Parallel Processing28 Parallel Dot Product int main( void ) { float *a, *b, c, *partial_c; float *dev_a, *dev_b, *dev_partial_c; … dot >>( dev_a, dev_b, dev_partial_c ); … // copy the array 'c' back from the GPU to the CPU HANDLE_ERROR( cudaMemcpy( partial_c, dev_partial_c, blocksPerGrid*sizeof(float), cudaMemcpyDeviceToHost ) ); // finish up on the CPU side c = 0; for (int i=0; i<blocksPerGrid; i++) { c += partial_c[i]; }

Parallel Processing29 CUDA Shared Memory Memory qualifier –__shared__ Synchronization __syncthreads();

Parallel Processing30 Further Information cs/html/C/doc/CUDA_C_Getting_Started_Linux.pdfhttp://developer.download.nvidia.com/compute/DevZone/do cs/html/C/doc/CUDA_C_Getting_Started_Linux.pdf cs/html/C/doc/CUDA_C_Programming_Guide.pdfhttp://developer.download.nvidia.com/compute/DevZone/do cs/html/C/doc/CUDA_C_Programming_Guide.pdf Datasheet.pdfhttp:// Datasheet.pdf Textbook –Jason Sanders and Edward Kandro, CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional, 2010.