Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

Similar presentations


Presentation on theme: "High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:"— Presentation transcript:

1 High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial: How To Program for GPUs Krešimir Ćosić 1, (1) University of Split, Croatia TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

2 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Overview CUDA Hardware architecture Programming model Convolution on GPU

3 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction CUDA  ‘Compute Unified Device Architecture’ – Hardware and software architecture for issuing and managing computations on GPU Massively parallel architecture – over 8000 threads is common C for CUDA (C++ for CUDA) – C/C++ language with some additions and restrictions Enables GPGPU – ‘General Purpose Computing on GPUs’

4 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction GPU: a multithreaded coprocessor SM streaming multiprocessor 32xSP (or 16, 48 or more) Fast local ‘shared memory’ (shared between SPs) 16 KiB (or 64 KiB) GLOBAL MEMORY (ON DEVICE) SM SP SHARED MEMORY SP: scalar processor ‘CUDA core’ Executes one thread

5 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction GDDR memory 512 MiB - 6 GiB GPU:  SMs o 30xSM on GT200, o 14xSM on Fermi  For example, GTX 480:  14 SMs x 32 cores = 448 cores on a GPU GLOBAL MEMORY (ON DEVICE) SM SP SHARED MEMORY

6 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction How To Program For GPUs Parallelization Decomposition to threads Memory shared memory, global memory GLOBAL MEMORY (ON DEVICE) SM SP SHARED MEMORY

7 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Important Things To Keep In Mind Avoid divergent branches Threads of single SM must be executing the same code Code that branches heavily and unpredictably will execute slowly Threads shoud be independent as much as possible Synchronization and communication can be done efficiently only for threads of single multiprocessor SM SP SHARED MEMORY

8 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction How To Program For GPUs Parallelization Decomposition to threads Memory shared memory, global memory Enormous processing power Avoid divergence Thread communication Synchronization, no interdependencies GLOBAL MEMORY (ON DEVICE) SM SP SHARED MEMORY

9 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Programming model

10 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Thread blocks Threads grouped in thread blocks 128, 192 or 256 threads in a block One thread block executes on one SM – All threads sharing the ‘shared memory’ – 32 threads are executed simultaneously (‘warp’) BLOCK 1 THREAD (0,0) THREAD (0,1) THREAD (0,2) THREAD (1,0) THREAD (1,1) THREAD (1,2)

11 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Thread blocks Blocks execute on SMs - execute in parallel - execute independently! BLOCK 1 THREAD (0,0) THREAD (0,1) THREAD (0,2) THREAD (1,0) THREAD (1,1) THREAD (1,2) Grid BLOCK 0BLOCK 1BLOCK 2 BLOCK 3BLOCK 4BLOCK 5 BLOCK 6BLOCK 7BLOCK 8 Blocks form a GRID Thread ID unique within block Block ID unique within grid

12 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Code that executes on GPU: Kernels Kernel - a simple C function - executes on GPU - Executes in parallel as many times as there are threads The keyword __global__ tells the compiler to make a function a kernel (and compile it for the GPU, instead of the CPU)

13 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Convolution To get one pixel of output image: - multiply (pixelwise) mask with image at corresponding position - sum the products

14 Kernel - Exam ple code pt 1 __global__ void Convolve( float* img, int imgW, int imgH, float* filt, int filtW, int filtH, float* out) { const int nThreads = blockDim.x * gridDim.x; const int idx = blockIdx.x * blockDim.x + threadIdx.x; const int outW = imgW – filtW + 1; const int outH = imgH – filtH + 1; const int nPixels = outW * outH; for(int curPixel = idx; curPixel < nPixels; curPixel += nThreads) { int x = curPixel % outW; int y = curPixel / outW; float sum = 0; for (int filtY = 0; filtY < filtH; filtY++) for (int filtX = 0; filtX < filtW; filtX++) { int sx = x + filtX; int sy = y + filtY; sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX]; } out[y * outW + x] = sum; } } for (int y = 0; y < outH; y++) for (int x = 0; x < outW; x++) {

15 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Setup and data transfer cudaMemcpy transfer data to and from GPU (global memory) cudaMalloc Allocate memory on GPU (global memory) GPU is the ‘device’, CPU is the ‘host’ Kernel call syntax

16 Examl e setup and data transf er 1 int main() {... float* img... int imgW, imgH... float* imgGPU; cudaMalloc((void**)& imgGPU, imgW * imgH * sizeof(float)); cudaMemcpy( imgGPU, // Destination img, // Source imgW * imgH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction ); float* filter... int filterW, filterH... float* filterGPU; cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float)); cudaMemcpy( filterGPU, // Destination filter, // Source filterW * filterH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction );

17 Examl e setup and data transf er 2 int resultW = imgW – filterW + 1; int resultH = imgH – filterH + 1; float* result = (float*) malloc(resultW * resultH * sizeof(float)); float* resultGPU; cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float)); /* Call the GPU kernel */ dim3 block(128); dim3 grid(30); Convolve >> ( imgGPU, imgW, imgH, filterGPU, filterW, filterH, resultGPU ); cudaMemcpy( result, // Desination resultGPU, // Source resultW * resultH * sizeof(float), // Size in bytes cudaMemcpyDeviceToHost // Direction ); cudaThreadExit();... }

18 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction

19 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Speedup Linear combination of 3 filters sized 15x15 Image size: 2k x 2k CPU: Core 2 @ 2.0 GHz (1 core) GPU: Tesla S1070 (GT200 ) 30xSM, 240 CUDA cores, 1.3 GHz CPU: 6.58 s 0.89 Mpixels/s GPU: 0.21 s 27.99 Mpixels/s 31 times faster!

20 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction

21 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction CUDA capabilities 1.0 GeForce 8800 Ultra/GTX/GTS 1.1 GeForce 9800 GT, GTX, GTS 250 + atomic instructions … 1.2 GeForce GT 220 1.3 Tesla S1070, C1060, GeForce GTX 275,285 + double precision (slow) … 2.0 Tesla C2050, GeForce GTX 480, 470 + ECC, L1 and L2 cache, faster IMUL, faster atomics, faster double precision on Tesla cards …

22 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction CUDA essentials developer.nvidia.com/object/cuda_3_1_downloads.html Download Driver Toolkit (compiler nvcc) SDK (examples) (recommended) CUDA Programmers guide

23 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Other tools ‘Emulator’ Executes on CPU Slow Simple profiler cuda-gdb (Linux) Paralel Nsight (Vista) simple profiler on-device debugger

24 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction...

25 Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Logical thread hierarchy Thread ID – unique within block Block ID – unique within grid To get globally unique thread ID: Combine block ID and thread ID Threads can access both shared and global memory


Download ppt "High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:"

Similar presentations


Ads by Google