Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
1 CUDA Threads. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 2 Block IDs and Thread IDs Each thread.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Using The CUDA Programming Model 1 Leveraging GPUs for Application Acceleration Dan Ernst, Brandon Holt University of Wisconsin – Eau Claire.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
CUDA Programming. Floating Point Operations for the CPU and the GPU.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Introduction to CUDA 2 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
CIS 565 Fall 2011 Qing Sun
1 ECE 498AL Lecture 2: The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Matrix Multiplication in CUDA
GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 4: The CUDA Memory Model (Cont.)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.
Slides from “PMPP” book
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Memory and Data Locality
© David Kirk/NVIDIA and Wen-mei W. Hwu,
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Presentation transcript:

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation of David Kirk/NVIDIA and Professor Wen-mei W. Hwu 3. IBM CELL documents 4. Toshiba SpursEngine documents 鄭羽伸 麗臺科技股份有限公司 Nov. 13 th, 2009

CUDA Programming CUDA : Compute Unified Device Architecture 2

Floating-Point Operations per Second for CPU and GPU 3

Memory Bandwidth for CPU and GPU 4

CUDA Architecture A set of SIMT multiprocessors with on-chip shared memory. 5

6

7

8

9

10

11

12

Note: Each TPC (Texture Processing Clusters) contains a group of Streaming Multiprocessors (SMs), and each SM is made up of individual Streaming Processors (SPs). The original G80 had eight TPCs, with two SMs inside each TPC. Each TPC, therefore, was composed of 16 individual stream processors, with 128 SPs across the entire die. 13

Grid of Thread Blocks A kernel is executed over an NDRange by a grid of thread blocks. 14

Automatic Scalability A device with more multiprocessors will automatically execute a kernel in less time than a device with fewer multiprocessors. 15

G80 CUDA mode 16 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

Block IDs and Thread IDs Each thread uses IDs to decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data –Image processing –Solving PDEs on volumes –… Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) PDE : Partial Differential Equations 17 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

Block IDs and Thread IDs for (bi=0; bi<M; bi++) { for (bj=0; bj<N; bj++) { for (ti=0; ti<W; ti++) { for (tj=0; tj<Z; tj++) { } Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread Processing (bi, bj, ti, tj) Note : Number of thread in a block is limited !! (512 or above) 18

Programming Model: Square Matrix Multiplication Example P = M * N of size WIDTH x WIDTH Without tiling: –One thread calculates one element of P –M and N are loaded WIDTH times from global memory M N P WIDTH 19 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

Step 1: Matrix Multiplication A Simple Host Version in C M N P WIDTH // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } i k k j 20 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; … 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); Step 2: Input Matrix Data Transfer (Host-side Code) 21 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

Step 3: Output Matrix Data Transfer (Host-side Code) 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } 22 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

Step 4: Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; 23 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

Nd MdPd WIDTH Step 4: Kernel Function (cont.) for (int k = 0; k < Width; ++k) { float Melement = Md[ty * Width + k]; float Nelement = Nd[k * Width + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element Pd[ty * Width + tx] = Pvalue; } ty tx ty tx k k 24 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

// Setup the execution configuration dim3 dimBlock(Width, Width); dim3 dimGrid(1, 1); // Launch the device computation threads! MatrixMulKernel >>(Md, Nd, Pd); Step 5: Kernel Invocation (Host-side Code) 25 © David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory How about performance on G80? All threads access global memory for their input matrix elements –Two memory accesses (8 bytes) per floating point multiply-add –4B/s of memory bandwidth/FLOPS –4*346.5 = 1386 GB/s required to achieve peak FLOP rating –86.4 GB/s limits the code at 21.6 GFLOPS The actual code runs at about 15 GFLOPS Need to drastically cut down memory accesses to get closer to the peak GFLOPS

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, G80 Implementation of CUDA Memories Each thread can: –Read/write per-thread registers –Read/write per-thread local memory –Read/write per-block shared memory –Read/write per-grid global memory –Read/only per-grid constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Matrix Multiplication using Shared Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Idea: Use Shared Memory to reuse global memory data Each input element is read by WIDTH threads. Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth –Tiled algorithms M N P WIDTH ty tx

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub Assume that the dimensions of Md and Nd are multiples of TILE_WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, First-order Size Considerations in G80 Each thread block should have many threads –TILE_WIDTH of 16 gives 16*16 = 256 threads There should be many thread blocks –A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. –Memory bandwidth no longer a limiting factor

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code – Kernel Execution Configuration // Setup the execution configuration dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH);

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code – Kernel Overview // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue stores the element of the block sub-matrix // that is computed by the thread – automatic variable! float Pvalue = 0; // Loop over all the sub-matrices of M and N // required to compute the block sub-matrix for (int m = 0; m < Width/TILE_WIDTH; ++m) { code from the next few slides };

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub m kbx by k m

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code - Load Data to Shared Memory // Get a pointer to the current sub-matrix Msub of M Float* Mdsub = GetSubMatrix(Md, m, by, Width); // Get a pointer to the current sub-matrix Nsub of N Float* Ndsub = GetSubMatrix(Nd, bx, m, Width); __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; // each thread loads one element of the sub-matrix Mds[ty][tx] = GetMatrixElement(Mdsub, tx, ty); // each thread loads one element of the sub-matrix Nds[ty][tx] = GetMatrixElement(Ndsub, tx, ty);

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Md Nd Pd Pd sub TILE_WIDTH WIDTH TILE_WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH Tiled Multiply Each block computes one square sub-matrix Pd sub of size TILE_WIDTH Each thread computes one element of Pd sub m kbx by k m

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code - Compute Result // Synchronize to make sure the sub-matrices are loaded // before starting the computation __syncthreads(); // each thread computes one element of the block sub-matrix for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of M and N in the next iteration __syncthreads();

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, CUDA Code - Save Result // Get a pointer to the block sub-matrix of P Matrix Psub = GetSubMatrix(P, bx, by); // Write the block sub-matrix to device memory; // each thread writes one element SetMatrixElement(Psub, tx, ty, Pvalue); This code runs at about 45 GFLOPS on G80.

39