GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Intermediate GPGPU Programming in CUDA
List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Analyzing CUDA Workloads Using a Detailed GPU Simulator
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Introduction to CUDA Programming
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Programming Massively Parallel Graphics Processors
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
6- General Purpose GPU Programming
Presentation transcript:

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing

GPU Graphical Processing Unit A single GPU consists of large number of cores – hundreds of cores. Whereas a single CPU can consist of 2, 4, 8 or 12 cores Cores? – Processing units in a chip sharing at least the memory and L1 cache

GPU and CPU, CUDA Typically GPU and CPU coexist in a heterogeneous setting “Less” computationally intensive part runs on CPU (coarse-grained parallelism), and more intensive parts run on GPU (fine-grained parallelism) NVIDIA’s GPU architecture is called CUDA (Compute Unified Device Architecture) architecture, accompanied by CUDA programming model, and CUDA C language

Example: SERC system SERC system had 4 nodes Each node has 16 CPU cores arranged as four quad cores Each node connected to a Tesla S1070 GPU system

Tesla S1070 Each Tesla S1070 system has 4 Tesla GPUs (T10 processors) in the system The system connected to one or two host systems via high speed interconnects Each GPU has 240 GPU cores. Hence a total of about 960 cores Frequency of processor cores – to 1.44 GHz Memory – 16 GB total, 4 GB per GPU

Tesla S1070 Architecture

Tesla T streaming processors/cores (SPs) organized as 30 streaming multiprocessors (SMs) in 10 independent processing units called Texture Processors/Clusters (TPCs) A TPC consists of 3 SMs; A SM consists of 8 SPs Collection of TPCs is called Streaming Processor Arrays (SPAs)

Tesla S1070 Architecture Details

T 10 Host

Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single block assigned to a single SM. Multiple blocks can be assigned to a SM. Block consists of elements Elements computed by threads A thread executes on a SP

Hierarchical Parallelism

Thread Blocks Thread block – an array of concurrent threads that execute the same program and can cooperate to compute the result Consists of 1 to 512 threads Has shape and dimensions (1d, 2d or 3d) for threads A thread ID has corresponding 1,2 or 3d indices Each SM executes up to eight thread blocks concurrently Threads of a thread block share memory

CUDA Programming Language Programming language for threaded parallelism for GPUs Minimal extension of C A serial program that calls parallel kernels Serial code executes on CPU Parallel kernels executed across a set of parallel threads on the GPU Programmer organizes threads into a hierarchy of thread blocks and grids

CUDA C Built-in variables:  threadIdx.{x,y,z} – thread ID within a block  blockIDx.{x,y,z} – block ID within a grid  blockDim.{x,y,z} – number of threads within a block  gridDim.{x,y,z} – number of blocks within a grid kernel >>(args)  Invokes a parallel kernel function on a grid of nBlocks where each block instantiates nThreads concurrent threads

Example: Summing Up kernel function grid of kernels

General CUDA Steps 1. Copy data from CPU to GPU 2. Compute on GPU 3. Copy data back from GPU to CPU By default, execution on host doesn’t wait for kernel to finish General rules:  Minimize data transfer between CPU & GPU  Maximize number of threads on GPU

CUDA Elements cudaMalloc – for allocating memory in device cudaMemCopy – for copying data to allocated memory from host to device, and from device to host cudaFree – freeing allocated memory void syncthreads__() – synchronizing all threads in a block like barrier

EXAMPLE 1: MATRIX MULTIPLICATION

Example 1: Matrix Multiplication

Example 1

EXAMPLE 2: REDUCTION

Example: Reduction Tree based approach used within each thread block In this case, partial results need to be communicated across thread blocks Hence, global synchronization needed across thread blocks

Reduction But CUDA does not have global synchronization –  expensive to build in hardware for large number of GPU cores Solution Decompose into multiple kernels Kernel launch serves as a global synchronization point

Illustration

Host Code int main(){ int* h_idata, h_odata; /* host data*/ Int *d_idata, d_odata; /* device data*/ /* copying inputs to device memory */ cudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice) ; cudaMemcpy(d_odata, h_idata, numBlocks*sizeof(int), cudaMemcpyHostToDevice) ; int numThreadsperBlock = (n < maxThreadsperBlock) ? n : maxThreadsperBlock; int numBlocks = n / numThreadsperBlock; dim3 dimBlock(numThreads, 1, 1); dim3 dimGrid(numBlocks, 1, 1); reduce >>(d_idata, d_odata);

Host Code int s=numBlocks; while(s > 1) { numThreadsperBlock = (s< maxThreadsperBlock) ? s : maxThreadsperBlock; numBlocks = s / numThreadsperBlock; dimBlock(numThreads, 1, 1); dimGrid(numBlocks, 1, 1); reduce >>(d_idata, d_odata); s = s / numThreadsperBlock; }

Device Code __global__ void reduce(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // load shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if ((tid % (2*s)) == 0) sdata[tid] += sdata[tid + s]; __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; }

For more information… CUDA SDK code samples – NVIDIA - les.html les.html

BACKUP

EXAMPLE 3: SCAN

Example: Scan or Parallel-prefix sum Using binary tree An upward reduction phase (reduce phase or up-sweep phase)  Traversing tree from leaves to root forming partial sums at internal nodes Down-sweep phase  Traversing from root to leaves using partial sums computed in reduction phase

Up Sweep

Down Sweep

Host Code int main(){ const unsigned int num_threads = num_elements / 2; /* cudaMalloc d_idata and d_odata */ cudaMemcpy( d_idata, h_data, mem_size, cudaMemcpyHostToDevice) ); dim3 grid(256, 1, 1); dim3 threads(num_threads, 1, 1); scan >> (d_odata, d_idata); cudaMemcpy( h_data, d_odata[i], sizeof(float) * num_elements, cudaMemcpyDeviceToHost /* cudaFree d_idata and d_odata */ }

Device Code __global__ void scan_workefficient(float *g_odata, float *g_idata, int n) { // Dynamically allocated shared memory for scan kernels extern __shared__ float temp[]; int thid = threadIdx.x; int offset = 1; // Cache the computational window in shared memory temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; // build the sum in place up the tree for (int d = n>>1; d > 0; d >>= 1) { __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; }

Device Code // scan back down the tree // clear the last element if (thid == 0) temp[n - 1] = 0; // traverse down the tree building the scan in place for (int d = 1; d < n; d *= 2) { offset >>= 1; __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } __syncthreads(); // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1]; }