High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Some things are naturally parallel
NVIDIA Fermi Architecture
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Programming Massively Parallel Graphics Processors
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
General Purpose Graphics Processing Units (GPGPUs)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial: How To Program for GPUs Krešimir Ćosić 1, (1) University of Split, Croatia TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Overview CUDA Hardware architecture Programming model Convolution on GPU

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction CUDA  ‘Compute Unified Device Architecture’ – Hardware and software architecture for issuing and managing computations on GPU Massively parallel architecture – over 8000 threads is common C for CUDA (C++ for CUDA) – C/C++ language with some additions and restrictions Enables GPGPU – ‘General Purpose Computing on GPUs’

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction GPU: a multithreaded coprocessor SM streaming multiprocessor 32xSP (or 16, 48 or more) Fast local ‘shared memory’ (shared between SPs) 16 KiB (or 64 KiB) GLOBAL MEMORY (ON DEVICE) SM SP SHARED MEMORY SP: scalar processor ‘CUDA core’ Executes one thread

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction GDDR memory 512 MiB - 6 GiB GPU:  SMs o 30xSM on GT200, o 14xSM on Fermi  For example, GTX 480:  14 SMs x 32 cores = 448 cores on a GPU GLOBAL MEMORY (ON DEVICE) SM SP SHARED MEMORY

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction How To Program For GPUs Parallelization Decomposition to threads Memory shared memory, global memory GLOBAL MEMORY (ON DEVICE) SM SP SHARED MEMORY

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Important Things To Keep In Mind Avoid divergent branches Threads of single SM must be executing the same code Code that branches heavily and unpredictably will execute slowly Threads shoud be independent as much as possible Synchronization and communication can be done efficiently only for threads of single multiprocessor SM SP SHARED MEMORY

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction How To Program For GPUs Parallelization Decomposition to threads Memory shared memory, global memory Enormous processing power Avoid divergence Thread communication Synchronization, no interdependencies GLOBAL MEMORY (ON DEVICE) SM SP SHARED MEMORY

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Programming model

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Thread blocks Threads grouped in thread blocks 128, 192 or 256 threads in a block One thread block executes on one SM – All threads sharing the ‘shared memory’ – 32 threads are executed simultaneously (‘warp’) BLOCK 1 THREAD (0,0) THREAD (0,1) THREAD (0,2) THREAD (1,0) THREAD (1,1) THREAD (1,2)

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Thread blocks Blocks execute on SMs - execute in parallel - execute independently! BLOCK 1 THREAD (0,0) THREAD (0,1) THREAD (0,2) THREAD (1,0) THREAD (1,1) THREAD (1,2) Grid BLOCK 0BLOCK 1BLOCK 2 BLOCK 3BLOCK 4BLOCK 5 BLOCK 6BLOCK 7BLOCK 8 Blocks form a GRID Thread ID unique within block Block ID unique within grid

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Code that executes on GPU: Kernels Kernel - a simple C function - executes on GPU - Executes in parallel as many times as there are threads The keyword __global__ tells the compiler to make a function a kernel (and compile it for the GPU, instead of the CPU)

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Convolution To get one pixel of output image: - multiply (pixelwise) mask with image at corresponding position - sum the products

Kernel - Exam ple code pt 1 __global__ void Convolve( float* img, int imgW, int imgH, float* filt, int filtW, int filtH, float* out) { const int nThreads = blockDim.x * gridDim.x; const int idx = blockIdx.x * blockDim.x + threadIdx.x; const int outW = imgW – filtW + 1; const int outH = imgH – filtH + 1; const int nPixels = outW * outH; for(int curPixel = idx; curPixel < nPixels; curPixel += nThreads) { int x = curPixel % outW; int y = curPixel / outW; float sum = 0; for (int filtY = 0; filtY < filtH; filtY++) for (int filtX = 0; filtX < filtW; filtX++) { int sx = x + filtX; int sy = y + filtY; sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX]; } out[y * outW + x] = sum; } } for (int y = 0; y < outH; y++) for (int x = 0; x < outW; x++) {

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Setup and data transfer cudaMemcpy transfer data to and from GPU (global memory) cudaMalloc Allocate memory on GPU (global memory) GPU is the ‘device’, CPU is the ‘host’ Kernel call syntax

Examl e setup and data transf er 1 int main() {... float* img... int imgW, imgH... float* imgGPU; cudaMalloc((void**)& imgGPU, imgW * imgH * sizeof(float)); cudaMemcpy( imgGPU, // Destination img, // Source imgW * imgH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction ); float* filter... int filterW, filterH... float* filterGPU; cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float)); cudaMemcpy( filterGPU, // Destination filter, // Source filterW * filterH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction );

Examl e setup and data transf er 2 int resultW = imgW – filterW + 1; int resultH = imgH – filterH + 1; float* result = (float*) malloc(resultW * resultH * sizeof(float)); float* resultGPU; cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float)); /* Call the GPU kernel */ dim3 block(128); dim3 grid(30); Convolve >> ( imgGPU, imgW, imgH, filterGPU, filterW, filterH, resultGPU ); cudaMemcpy( result, // Desination resultGPU, // Source resultW * resultH * sizeof(float), // Size in bytes cudaMemcpyDeviceToHost // Direction ); cudaThreadExit();... }

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Speedup Linear combination of 3 filters sized 15x15 Image size: 2k x 2k CPU: Core 2.0 GHz (1 core) GPU: Tesla S1070 (GT200 ) 30xSM, 240 CUDA cores, 1.3 GHz CPU: 6.58 s 0.89 Mpixels/s GPU: 0.21 s Mpixels/s 31 times faster!

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction CUDA capabilities 1.0 GeForce 8800 Ultra/GTX/GTS 1.1 GeForce 9800 GT, GTX, GTS atomic instructions … 1.2 GeForce GT Tesla S1070, C1060, GeForce GTX 275,285 + double precision (slow) … 2.0 Tesla C2050, GeForce GTX 480, ECC, L1 and L2 cache, faster IMUL, faster atomics, faster double precision on Tesla cards …

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction CUDA essentials developer.nvidia.com/object/cuda_3_1_downloads.html Download Driver Toolkit (compiler nvcc) SDK (examples) (recommended) CUDA Programmers guide

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Other tools ‘Emulator’ Executes on CPU Slow Simple profiler cuda-gdb (Linux) Paralel Nsight (Vista) simple profiler on-device debugger

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction...

Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Logical thread hierarchy Thread ID – unique within block Block ID – unique within grid To get globally unique thread ID: Combine block ID and thread ID Threads can access both shared and global memory