ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications
CUDA API Sept. 18, 2008

Before we get started… Today Last Time The CUDA API
Start discussing CUDA programming model Today HW2 is due on Friday, 11:59 PM HW3 available on the class website (due Sept.25) A look at a matrix multiplication example The CUDA execution model 2

Going back to the G80 HW… split personality n.
Two distinct personalities in the same entity, each of which prevails at a particular time. 3 HK-UIUC

Calling a Kernel Function, and the Concept of Execution Configuration
Last slide of previous lecture Calling a Kernel Function, and the Concept of Execution Configuration A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); // declaration dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit sync needed for blocking 4 HK-UIUC

G80 Thread Computing Pipeline
The future of GPUs is programmable processing So – build the architecture around the processor Processors execute computing threads Alternative operating mode specifically for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host 5 HK-UIUC

Simple Example: Matrix Multiplication
A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local variable and register usage Thread ID usage Memory data transfer API between host and device 6 HK-UIUC

Square Matrix Multiplication Example
P = M * N of size WIDTH x WIDTH Software Design Decisions: One thread handles one element of P Each thread will access entries in M and N WIDTH times from global memory WIDTH M P WIDTH WIDTH WIDTH HK-UIUC 7

Step 1: Matrix Data Transfers (Host Side)
// Allocate the device memory where we will copy M to // Although not shown, you do the same for N and P matrices Matrix Md; #define WIDTH 16 Md.width = WIDTH; Md.height = WIDTH; Md.pitch = WIDTH; int size = WIDTH * WIDTH * sizeof(float); cudaMalloc((void**)&Md.elements, size); // Copy M from the host to the device cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice); ...//do your work here… (see next slides) // Read the result from the device to the host into P cudaMemcpy(P.elements, Pd.elements, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Md.elements); 8 HK-UIUC

Step 2: Matrix Multiplication A Simple Host Code in C
// Matrix multiplication on the (CPU) host in double precision; void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) { for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; //you’ll see a lot of this… double b = N.elements[k * N.width + j]; // and of this as well… sum += a * b; } P.elements[i * N.width + j] = sum; 9 HK-UIUC

Step 3: Matrix Multiplication, Host-side. Main Program Code
int main(void) { // Allocate and initialize the matrices. // The last argument in AllocateMatrix: should an initialization with // random numbers be done? Yes: 1. No: 0 (everything is set to zero) Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0); // M * N on the device MatrixMulOnDevice(M, N, P); // Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; } 10 HK-UIUC

Step 3: Matrix Multiplication Host-side code
// Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N); // Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // clear memory // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(WIDTH, WIDTH); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd); // Read P from the device CopyFromDeviceMatrix(P, Pd); // Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); } Continue here… Continue here… 11 HK-UIUC

Multiply Using One Thread Block
Grid 1 One Block of threads computes matrix P Each thread computes one element of P Each thread Loads a row of matrix M Loads a column of matrix N Perform one multiply and addition for each pair of M and N elements Compute to off-chip memory access ratio close to 1:1 Not that good… Size of matrix limited by the number of threads allowed in a thread block Block 1 Thread (2, 2) 48 M P BLOCK_SIZE 12 HK-UIUC

Step 4: Matrix Multiplication- Device-side Kernel Function
// Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue; N WIDTH M P tx WIDTH ty 13 HK-UIUC WIDTH WIDTH

Step 5: Some Loose Ends // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice; } // Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size, cudaMemcpyHostToDevice); // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size, cudaMemcpyDeviceToHost); // Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements); } void FreeMatrix(Matrix M) { free(M.elements); 14 HK-UIUC

Step 6: Handling Arbitrary Sized Square Matrices
Have each 2D thread block to compute a (BLOCK_WIDTH)2 sub-matrix (tile) of the result matrix Each has (BLOCK_WIDTH)2 threads Generate a 2D Grid of (WIDTH/BLOCK_WIDTH)2 blocks N WIDTH P M by NOTE: You still need to put a loop around the kernel call for cases where WIDTH is really large ty WIDTH bx tx WIDTH WIDTH 15 HK-UIUC

The Common Pattern to CUDA Programming
Phase 1: Allocate memory on the device and copy to the device the data required to carry out computation on the GPU Phase 2: Let the GPU crunch the numbers for you based on the kernel that you define Phase 3: Bring back the results from the GPU. Free memory on the device (clean up…). You’re done. 16

A Common Programming Pattern (Cntd
A Common Programming Pattern (Cntd.) BRINGING THE SHARED MEMORY INTO THE PICTURE Local and global memory reside in device memory (DRAM) - much slower access than shared memory An advantageous way of performing computation on the device is to block data to take advantage of fast shared memory: Partition data into data subsets that fit into shared memory Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory back to global memory 17 HK-UIUC

How about the memory not in registers or shared?
Device Stream Multiprocessor N Stream Multiprocessor 2 Stream Multiprocessor 1 Device memory Shared Memory Instruction Unit Processor 1 Registers … Processor 2 Processor M Constant Cache Texture The local, global, constant, and texture spaces are regions of device memory Beyond registers and shared memory, each multiprocessor has: Read/Write global memory There is a lot of this… A read-only constant cache To speed up access to the constant memory space A read-only texture cache To speed up access to the texture memory space Global, constant, texture memories 18

A Common Programming Pattern (cont.)
Texture and Constant memory also reside in device memory (DRAM) - much slower access than shared memory But… cached! Highly efficient access for read-only data Carefully divide data according to access patterns R/O no structure  constant memory R/O array structured  texture memory R/W shared within Block  shared memory R/W registers spill to local memory R/W inputs/results  global memory 19 HK-UIUC

G80 Hardware Implementation: A Set of SIMD Multiprocessors
The device is a set of Stream Multiprocessors (14 on 8800 GT, but that’s totally irrelevant) Each multiprocessor has a collection of 32-bit Scalar Processors with a Single Instruction Multiple Data architecture – shared instruction unit At each clock cycle, a Stream Multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Instruction Unit Scalar Processor 1 … Scalar Processor 2 Scalar Processor M 20 HK-UIUC

Hardware Implementation: Execution Model (review)
Each block of a grid is split into warps, each gets executed by one Stream Multiprocessor (SM) The device processes only one grid at a time Each block is executed by one Stream Multiprocessor The shared memory space resides in the on-chip shared memory A Stream Multiprocessor can execute multiple blocks concurrently Shared memory and registers are partitioned among the threads of all concurrent blocks So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently, which is good Currently, up to eight blocks can be processed concurrently (this is bound to change, as the HW changes…) The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread indices with the first warp containing thread 0. 21

ME964 High Performance Computing for Engineering Applications

Similar presentations

Presentation on theme: "ME964 High Performance Computing for Engineering Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ME964 High Performance Computing for Engineering Applications

Similar presentations

Presentation on theme: "ME964 High Performance Computing for Engineering Applications"— Presentation transcript:

Similar presentations

About project

Feedback