GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 2: The CUDA Programming Model.

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

ME964 High Performance Computing for Engineering Applications Most of the time I don't have much fun. The rest of the time I don't have any fun at all.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

CIS 565 Fall 2011 Qing Sun

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

Matrix Multiplication in CUDA

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Lecture 8-2 : CUDA Programming Slide Courtesy : Dr. David Kirk and Dr. Wen-Mei Hwu and Mulphy Stein.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

CUDA C/C++ Basics Part 2 - Blocks and Threads

ME964 High Performance Computing for Engineering Applications

Basic CUDA Programming

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

Some things are naturally parallel

Slides from “PMPP” book

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Memory and Data Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

CUDA Programming Model

ECE 498AL Lecture 2: The CUDA Programming Model

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

Parallel Computing 18: CUDA - I

Presentation transcript:

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE

A simple main function for the matrix multiplication example int main(void){ 1. // Allocate and initialize the matrices M,N,P // I/O to read the input matrices M and N 2. // M*N on the device MatrixMultiplication(M,N,P,width); 3. // I/0 to write the output matrix P // Free matrix M, N, P … return 0; }

Step1: allocate and copy // Allocate device memory for M,N and P; // Copy M and N to allocated device memory locations

Q1: Why we have to allocate device memory? In CUDA, the host and devices have separate memory spaces. Devices is hardware cards(DRAM). It will help to execute a kernel on the device.

CUDA API Functions for device global memory management 1. cudaMalloc(): Allocates objects in the device global memory. Two parameters: (1)Address of a pointer to the allocated object (2)Size of allocated object in terms of bytes 2. cudaFree(): Free objects from device global memory Using the pointer to free the object

The program gives the function of allocation and copy as : Matrix AllocateMatrix(int height, int width, int init) // allocate M, N, P on the device void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) //transfer pertinent data from the host memory to the device memory void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) // copy P from the device memory when the matrixmultiplication is done

So I called the function to substitute cudaMalloc() and cudaMemcpy // 1.1 Transfer M and N to device memory Matrix AllocateDeviceMatrix(const Matrix M); Matrix CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost); Matrix AllocateDeviceMatrix(const Matrix N); Matrix CopyToDeviceMatrix(Matrix Ndevice, const Matrix Nhost); // 1.2 Allocate P on the device Matrix AllocateDeviceMatrix(const Matrix P);

Step2: Kernel invocation code __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { float Pvalue=0; int row=blockIdx.y * blockDim.y + threadIdx.y; int col=blockIdx.x * blockDim.x + threadIdx.x; //the i,j loops are replaced by the threadIdx.x and threadIdx.y for(int k=0;k<M.width;++k) Pvalue+=M.elements[row*M.width+k]*N.elements[k*N.width+c ol]; P.elements[row*P.width+col]=Pvalue; //Multiply the two matrices }

Q2: Where is the other two loops go? The other two levels are now replaced with the grid of threads. The original loop variables “i” and “j” are now replaced with threadIdx.x and threadIdx.y.

Setup and launch // setup the execution configuration dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);//#define BLOCK_SIZE 16; dim3 dimGrid(1,1); //Launch the device computation threads MatrixMulKernel<<<dimGrid, dimBlock>>>(Mdevice,Ndevice,Pdevice,1);

Return P to the Host cudaMemcpy(P.elements, Pdevice.elements, size, cudaMemcpyDeviceToHost); Or call the function in the program: CopyFromDeviceMatrix(Matrix Phost, const Matrix Pdevice)

The last one: Free the point cudaFree(Mdevice.elements); cudaFree(Ndevice.elements); cudaFree(Pdevice.elements);

The Questions 1. How many times is each element of the input matrix loaded during the execution of the kernel? The answer is width. For one element, the index of “i” or “j” is fixed. So the element can only be loaded width times.

2. What is the memory-access to floating- point computation ratio in each thread? The ratio is one.