More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Slides:



Advertisements
Similar presentations
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 5, 2011, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.
CUDA Grids, Blocks, and Threads
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CUDA Programming. Floating Point Operations for the CPU and the GPU.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
GPU Programming with CUDA – Optimisation Mike Griffiths
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Slides from “PMPP” book
CUDA Parallelism Model
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
CUDA Grids, Blocks, and Threads
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Memory and Data Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Grids, Blocks, and Threads
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

More CUDA Examples

Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads in a block – across blocks in a kernel Task parallelism – different blocks are independent – independent kernels Amrita School of Biotechnology

Thread Ids Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable. – threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two- dimensional, or three-dimensional thread block. – This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume. Block ID: 1D or 2D blockIdx.x {x,y} Thread ID: 1D, 2D, or 3D threadIdx.{x,y,z} Amrita School of Biotechnology

A general guidline is that a block should consist of at least 192 threads in order to hide memory access latency. Therefore, 256, and 512 threads are common and practical numbers. The following kernel used one block with N threads // Kernel invocation with N threads VecAdd >>(A, B, C); Here, each of the N threads that execute VecAdd() performs one pair-wise addition. __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; If(i < N) C[i] = A[i] + B[i]; } Simplest choice is to have each thread calculate one, and only one, element in the final result array Amrita School of Biotechnology

The number of threads per block and the number of blocks per grid specified in the >> syntax can be of type int or dim3. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable. Suppose we have elements and No of threads per blocks : 256 Then No of blocks required= / 256 = 40 An Array of 16 elements divided into 4 blocks N=16, blockDim=4 -> 4 blocks blockIdx.x=0 blockDim.x=4 threadIdx.x=0,1,2,3 idx=0,1,2,3 blockIdx.x=1 blockDim.x=4 threadIdx.x=0,1,2,3 idx=4,5,6,7 blockIdx.x=2 blockDim.x=4 threadIdx.x=0,1,2,3 idx=8,9,10,11 blockIdx.x=3 blockDim.x=4 threadIdx.x=0,1,2,3 idx=12,13,14,15 int idx = blockDim.x * blockIdx.x + threadIdx.x ; Amrita School of Biotechnology

2D Examples Add two matrices Case 1: matrix dimension and block dimension same Works for small matrices (dim. < 1024 * 1024) No of blocks needed: 1 – Dim3 threadsPerBlock(row,column) – Dim3 blocksPerGrid(1) – Kernel invocation AddMatrix >>(a,b,c,cols) Amrita School of Biotechnology

The following code adds two matrices A and B of size NxN and stores the result into matrix C: // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() {... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd >>(A, B, C);... } // N should be less then 1024, the max threads per block Amrita School of Biotechnology

// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() {... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd >>(A, B, C);... } Case 2: MatAdd() example to handle multiple blocks Total number of threads is equal to the number of threads per block times the number of blocks. Amrita School of Biotechnology

Block Dimensio n Thread index threadIddataIndex 1D(x) i = blockIdx.x * blockDim.x + threadIdx.x 2D with size (Dx, Dy), (x,y)(x + y Dx)i = blockIdx.x * blockDim.x + threadIdx.x j = blockIdx.y * blockDim.y + threadIdx.y There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. Amrita School of Biotechnology

If – Max Number of Threads per Block: 512 – Max Number of Blocks per Streaming Multiprocessor: 8 – Number of Streaming Multiprocessors: 30 Total Number of Threads Available = – 30 x 8 x 512 = Amrita School of Biotechnology

Technical Specifications Maximum dimensionality of a grid of thread blocks 23 Maximum x-, y-, or z- dimension of a grid of thread blocks Maximum dimensionality of a thread block 3 Maximum x- or y-dimension of a block Maximum z-dimension of a block 64 Maximum number of threads per block Compute Capability Compute Capability 1.x Thread dimension : 1D,2D or 3D Thread Block dimension: 1D or 2D Max Threads / block : 512 Compute Capability 2.x Thread dimension : 1D,2D or 3D Thread Block dimension: 1D,2D or 2D Max Threads / block : 1024 Amrita School of Biotechnology

12 M 2,0 M 1,1 M 1,0 M 0,0 M 0,1 M 3,0 M 2,1 M 3,1 Matrix Multiplication M 2,0 M 1,0 M 0,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 1,2 M 0,2 M 2,2 M 3,2 M 1,2 M 0,2 M 2,2 M 3,2 M 1,3 M 0,3 M 2,3 M 3,3 M 1,3 M 0,3 M 2,3 M 3,3 M Memory layout of a matrix (M X P) * (P X N) => (M X N) Matrices are stored in column major order in CUDA

13 Matrix Multiplication P N P M i k k j mat1 mat2 (M X P) * (P X N) => (M X N) M // Matrix multiplication on the (CPU) host void MatrixMulOnHost(float* mat1, float* mat2, float* R, int M,int P,int N)‏ { for (int i = 0; i < M; ++i)‏ for (int j = 0; j < N; ++j) { double sum = 0; for (int k = 0; k < P; ++k) { double a = mat1[i * P + k]; double b = mat2[k * N+ j]; sum += a * b; } R[i * N + j] = sum; } R N P

Amrita School of Biotechnology __global__ void MatrixMulOnDevice(float* mat1, float* mat2, float* R, int M,int P,int N)‏ { int sum = 0; int row = threadIdx.y; int col = threadIdx.x; for (int k = 0; k < P; ++k) { int a = mat1[row * P + k]; int b = mat2[k * N+ col]; sum += a * b; } R[row * N + col] = sum; } Matrix multiplication on GPU Each thread calculates one value in the resulting matrix MatrixMulOnDevice >>(A,B,C,m,p,n);

Limitation: – Size of a matrix is limited by the number of threads allowed in a thread block Solution: Use multiple thread blocks – Kernel invocation Int threads = 64; Dim3 threadsPerBlock(threads,threads); dim2 blocksPerGrid(m/threads,n/threads); Multiply >>(A,B,C,m,p,n); – threadIds int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; Amrita School of Biotechnology

Another solution: – Give each thread more work – Instead of doing one operation, each thread is assigned more jobs A tile of WIDTH * WIDTH entries Amrita School of Biotechnology

Write a program to implement the kernel function – Increment(a[],b) The function is to increment each elements of the array a by b units. – The array size need to be dynamically allocated – No of threads per block: 256 – No of blocks need to be dynamically allocated depending on the size of the array – Each thread should perform one increment operation in one array element Do the same in a two dimensional array – With one block – With a no. of blocks Question?? Amrita School of Biotechnology