Download presentation
Presentation is loading. Please wait.
Published byPentti Lehtilä Modified over 5 years ago
1
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE
2
A simple main function for the matrix multiplication example
int main(void){ 1. // Allocate and initialize the matrices M,N,P // I/O to read the input matrices M and N 2. // M*N on the device MatrixMultiplication(M,N,P,width); 3. // I/0 to write the output matrix P // Free matrix M, N, P … return 0; }
3
Step1: allocate and copy
// Allocate device memory for M,N and P; // Copy M and N to allocated device memory locations
4
Q1: Why we have to allocate device memory?
In CUDA, the host and devices have separate memory spaces. Devices is hardware cards(DRAM). It will help to execute a kernel on the device.
5
CUDA API Functions for device global memory management
1. cudaMalloc(): Allocates objects in the device global memory. Two parameters: (1)Address of a pointer to the allocated object (2)Size of allocated object in terms of bytes 2. cudaFree(): Free objects from device global memory Using the pointer to free the object
6
The program gives the function of allocation and copy as :
Matrix AllocateMatrix(int height, int width, int init) // allocate M, N, P on the device void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) //transfer pertinent data from the host memory to the device memory void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) // copy P from the device memory when the matrixmultiplication is done
7
So I called the function to substitute cudaMalloc() and cudaMemcpy
// 1.1 Transfer M and N to device memory Matrix AllocateDeviceMatrix(const Matrix M); Matrix CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost); Matrix AllocateDeviceMatrix(const Matrix N); Matrix CopyToDeviceMatrix(Matrix Ndevice, const Matrix Nhost); // 1.2 Allocate P on the device Matrix AllocateDeviceMatrix(const Matrix P);
8
Step2: Kernel invocation code
__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { float Pvalue=0; int row=blockIdx.y * blockDim.y + threadIdx.y; int col=blockIdx.x * blockDim.x + threadIdx.x; //the i,j loops are replaced by the threadIdx.x and threadIdx.y for(int k=0;k<M.width;++k) Pvalue+=M.elements[row*M.width+k]*N.elements[k*N.width+c ol]; P.elements[row*P.width+col]=Pvalue; //Multiply the two matrices }
9
Q2: Where is the other two loops go?
The other two levels are now replaced with the grid of threads. The original loop variables “i” and “j” are now replaced with threadIdx.x and threadIdx.y.
10
Setup and launch // setup the execution configuration
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);//#define BLOCK_SIZE 16; dim3 dimGrid(1,1); //Launch the device computation threads MatrixMulKernel<<<dimGrid, dimBlock>>>(Mdevice,Ndevice,Pdevice,1);
11
Return P to the Host cudaMemcpy(P.elements, Pdevice.elements, size, cudaMemcpyDeviceToHost); Or call the function in the program: CopyFromDeviceMatrix(Matrix Phost, const Matrix Pdevice)
12
The last one: Free the point
cudaFree(Mdevice.elements); cudaFree(Ndevice.elements); cudaFree(Pdevice.elements);
13
The Questions 1. How many times is each element of the input matrix loaded during the execution of the kernel? The answer is width. For one element, the index of “i” or “j” is fixed. So the element can only be loaded width times.
14
2. What is the memory-access to floating- point computation ratio in each thread?
The ratio is one.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.