Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn GPU Computing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn
GPU Computing Dr. Bo Yuan

Overview Foundation Intermediate Advanced Extension GPU CUDA Thread
Memory Structure Intermediate Kernel Vector Addition Matrix Multiplication Shared Memory Advanced Warp Memory Access Resource Optimization Dynamic Parallelism Extension Floating Point Stream Multiple GPUs Parallel Matlab

What is GPU? Graphics Processing Unit First GPU: GeForce 256 (1999)
Connected to motherboard via PCI Express High computational density and memory bandwidth Massively multithreaded many-core chips Traditionally used for real-time rendering Several millions units are sold each year.

Graphics Cards PowerEdge C410x PCIe Expansion Chassis

GPU Pipeline Vertex shaders are run once for each vertex given to the graphics processor. The purpose is to transform each vertex's 3D position in virtual space to the 2D coordinate at which it appears on the screen (as well as a depth value for the Z-buffer). Rasterization is the process by which the 2D image space representation of the scene is converted into raster format and the correct resulting pixel values are determined. Pixel shaders, also known as fragment shaders, compute color and other attributes of each fragment.

GPU Pipeline Rasterization
Shaders are written to apply transformations to a large set of elements at a time, for example, to each pixel in an area of the screen, or for every vertex of a model. This is well suited to parallel processing, and most modern GPUs have multiple shader pipelines to facilitate this, vastly improving computation throughput.

Anti-Aliasing Triangle Geometry Aliased Anti-Aliased

GPGPU General-Purpose Computing on GPUs
Massively Parallel, Simple Operations Suitable for compute-intensive engineering problems The original problem needs to be cast into native graphics operations. Launched through OpenGL or DirectX API calls Input data are stored in texture images and issued to the GPU by submitting triangles. Highly restricted access to input/output Very tedious, limited success with painstaking efforts

Trend of Computing Peach Pit: Sequential Problems, covered by CPU
Peach Meat: Parallel Problems, partially covered by GPGPU (Blue) Arc: The barrier between parallel computing and sequential problems

CPU vs. GPU Multi-Core Many-Core CPU GPU Number of ALUs
Cache ALU Control DRAM DRAM CPU GPU The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic and large cache memories to allow instructions from a single thread to execute in parallel and reduce the instruction and data access latencies. GPUs are designed as numeric computing engines and much more chip area is dedicated to the floating-point calculations. Multi-Core Many-Core Number of ALUs Memory Bandwidth

Power of the Crowd SM SP SIMT Streaming Multiprocessor
Multi-threaded processor core Processing unit for thread block SPs (Streaming Processor) SFUs (Special Function Unit) SP Streaming Processor Scalar ALU for a single CUDA thread SIMT Single-Instruction, Multiple-Thread Shared instruction fetch per 32 threads (warp) Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

Need For Speed

Green Computing GFLOPS per Watt GTX 750 Ti GTX 680 Intel Core i7-980XE
Single Precision GTX 580 Intel Core i7-980XE

Supercomputing TITAN, Oak Ridge National Laboratory
Speed: 24.8 PFLOPS (Theory), 17.6 PFLOPS (Real) CPU: AMD Opteron 6274 (18,688 × 16 cores) GPU: NVIDIA Tesla K20 (18,688 × 2496 cores) Cost: US$ 97 Million Power: 9 MW

Personal Supercomputer

What is CUDA? http://developer.nvidia.com/category/zone/cuda-zone
Compute Unified Device Architecture Introduced by NVIDIA in 2007 Scalable Parallel Programming Model Small extensions to standard C/C++ Enable general-purpose GPU computing Straightforward APIs to manage devices, memory etc. Only supports NVIDIA GPUs.

Thread Execution Manager
CUDA-Enabled GPU Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache

CUDA GPUs Compute Capability GPUs Cards 2.0 GF100, GF110
GeForce GTX 470, GTX 480, GTX 570, GTX 580, GTX 590, Tesla C2050, C2070 2.1 GF104, GF114, GF116, GF108, GF106 GeForce GT 430, GT 440, GTX 460, GTX 550 Ti, GTX 560 Ti, GT 640, GT 630 3.0 GK104, GK106, GK107 GeForce GTX 690, GTX 680, GTX 670, GTX 660, GTX 650 Ti, GTX 650 3.5 GK110, GK208 GeForce GTX TITAN, GT 640 (Rev. 2), GT 630 (Rev. 2), Tesla K40, Tesla K20 5.0 GM107, GM108 GeForce GTX 750 Ti, GTX 750, GTX 860M

Fermi Architecture

Kepler Architecture GeForce GTX 680 (Mar. 22, 2012)
GK104, 28 nm process 3.5 billion transistors on a 294 mm2 die CUDA Cores: 1536 (8 SMs X 192 SPs) Memory Bandwidth: 192 GB/S Peak Performance: 3090 GFLOPS TDP: 195W Release Price: $499

Maxwell Architecture GeForce GTX 750 Ti (Feb. 18, 2014)
GM107, 28 nm process 1.87 billion transistors on a 148 mm2 die CUDA Cores: 640 (5 SMs X 128 Cores) Memory Bandwidth: 86.4 GB/S Peak Performance: 1306 GFLOPS TDP: 60W Release Price: $149

CUDA Teaching Lab GTX 750 (GM107) Compute Capability: 5.0
512 CUDA Cores 1GB, 128-bit GDDR5 80 GB/S 1044 GFLOPS TDP: 55W RMB 799 GT 630 (GK208) Compute Capability: 3.5 384 CUDA Cores 2GB, 64-bit GDDR3 14.4 GB/S 692.7 GFLOPS TDP: 25W RMB 419

CUDA Installation

CUDA: deviceQuery

CUDA: bandwidthTest

CUDA Applications

CUDA Showcase

Heterogeneous Computing
Host Device

Heterogeneous Computing

Grids, Blocks and Threads
Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0)

Thread Block Thread program Thread Id #: 0 1 2 3 … m
Threads have thread ID numbers within block. Threads use thread ID to select work. Threads are assigned to SMs in block granularity. Each GT200 SM can have maximum 8 blocks. Each GT200 SM can have maximum 1024 threads. Threads in the same block can share data and synchronize. Threads in different blocks cannot cooperate. Each block can execute in any order relative to other blocks. Thread Id #: … m Thread program

Code Example

Transparent Scalability
Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Kernel grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Memory Space Each thread can: GeForce GTX 680
Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Constant Memory Each thread can: Read/write per-thread registers Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant memory GeForce GTX 680 Memory Bandwidth … 192 GB/S Single-Precision Floating Point … 4B Peak Performance … GFLOPS Practical Performance … 48 GFLOPS

Hello World! Your first CUDA code! int main(void) {
printf(“Hello World!\n”); return 0; } __global__ void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf(“Hello World!\n”); return 0; Your first CUDA code!

Device Code CUDA keyword __global__ indicates a kernel function that:
Runs on the device. Called from the host. CUDA keyword __device__ indicates a device function that: Called from a kernel function or another device function. Triple angle brackets <<< >>> indicate a call from host code to device code. Kernel launch nvcc separates source code into two components: Device functions are processed by NVIDIA compiler. Host functions are processed by standard host compiler. $ nvcc hello.cu

Addition on Device __global__ void add (int *a, int *b, int *c) {
*c=*a+*b; } add () will execute on the device. add () will be called from the host. a, b, c must point to device memory. We need to allocate memory on GPU.

Memory Management Host and device memories are separate entities.
Device pointers point to GPU memory. May be passed to/from host code. May not be dereferenced in host code. Host pointers point to CPU memory May be passed to/from device code. May not be dereferenced in device code. CUDA APIs for handling device memory cudaMalloc(), cudaFree(), cudaMemcpy() C equivalents: malloc(), free(), memcpy()

Addition on Device: main()
int main(void) { int a, b, c; // host copies int *d_a, *d_b, *d_c; // device copies int size=sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); a=2; b=7; // Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

Addition on Device: main()
// Launch add() kernel on GPU add<<<1,1>>>(d_a,d_b,d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }

Moving to Parallel Each call to add() adds two integers.
With add() running in parallel, we can do vector addition in parallel. add<<<nblocks, 1>>>(d_a, d_b, d_c) Each parallel invocation of add() is referred to as a block. By using blockIdx.x to index into the array, each block handles a different index. Block can be 2D: dim3 nblocks(M, N) blockIdx.x, blockIdx.y

Vector Addition on Device
__global__ void add (int *a, int *b, int *c) { c[blockIdx.x]=a[blockIdx.x]+b[blockIdx.x]; } Block 0 Block 1 c[0]=a[0]+b[0]; c[1]=a[1]+b[1]; Block 2 Block 3 c[2]=a[2]+b[2]; c[3]=a[3]+b[3];

Vector Addition on Device: main()
# define N 512 int main(void) { int *a, *b, *c; // host copies int *d_a, *d_b, *d_c; // device copies int size=N*sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Allocate space of host copies of a, b, c // Set up initial values a=(int *)malloc(size); rand_ints(a, N); b=(int *)malloc(size); rand_ints(b, N); c=(int *)malloc(size); rand_ints(c, N);

Vector Addition on Device: main()
// Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N blocks add<<<N, 1>>(d_a, d_b, d_c); // Copy results back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }

CUDA Threads Each block can be split into parallel threads.
Threads can be up to 3D: dim3 nthreads(M, N, P) threadIdx.x, threadIdx.y, threadIdx.z __global__ void add (int *a, int *b, int *c) { c[threadIdx.x]=a[threadIdx.x]+b[threadIdx.x]; } add<<<1, N>>>(d_a, d_b, d_c);

Combining Blocks and Threads
We have seen parallel vector addition using: Many blocks with one thread each One block with many threads Let’s adapt vector addition to use both blocks and threads. Why bother?

Indexing M=8; // 8 threads/block int index=threadIdx.x+blockIdx.x*M;
int index=threadIdx.x+blockIdx.x*blockDim.x; __global__ void add (int *a, int *b, int *c) { int index=threadIdx.x+blockIdx.x*blockDim.x; c[index]=a[index]+b[index]; }

Indexing #define N (2048*2048) #define M 512 // THREADS_PER_BLOCK …
add<<<N/M, M>>>(d_a, d_b, d_c); __global__ void add (int *a, int *b, int *c, int n) { int index=threadIdx.x+blockIdx.x*blockDim.x; if (index<n) c[index]=a[index]+b[index]; } add<<<(N+M-1)/M, M>>>(d_a, d_b, d_c, N);

Data Access Pattern radius radius How many times? input output

Sharing Data Between Threads
Each thread generates one output element. blockDim.x elements per block Each input element needs to be read several times. High I/O cost Within a block, threads can share data via shared memory. Data are not visible to threads in other blocks. Extremely fast on-chip memory Declared using keyword: __shared__, allocated per block. Read (blockDim.x+2*radius)input elements from global to shared memory.

Collaborative Threads
input T0 T0 T0 1 2 3 4 5 6 7 8 9 shared blockDim.x output elements Data Race! Thread 0 produces the values of temp[i], i=0, 3, 13. Thread 9 requires the values of temp[i], i=9, 10, 11, 12, 13, 14, 15. void _syncthreads()

Kernel Synchronization
__global__ void vector_sum(int *in, int *out) { __shared__ int temp[BLOCK_SIZE+2*RADIUS]; int gindex=threadIdx.x+blockIdx.x*blockDim.x; // global index int lindex=threadIdx.x+RADIUS; // local index // Read input elements into shared memory temp[lindex]=in[gindex]; if (threadIdx.x<RADIUS) { // some extra work temp[lindex-RADIUS]=in[gindex-RADIUS]; temp[lindex+BLOCK_SIZE]=in[gindex+BLOCK_SIZE]; } __syncthreads(); int offset, result=0; for (offset=-RADIUS; offset<=RADIUS; offset++) result+=temp[lindex+offset]; out[gindex]=result;

Matrix Multiplication
void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { int i, j, k; float a, b, sum; for (i = 0; i < Width; ++i)‏ for (j = 0; j < Width; ++j) { sum = 0; for (k = 0; k < Width; ++k) { a = M[i * width + k]; b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; N k j WIDTH M P i Host Code Only WIDTH k WIDTH WIDTH

Single Thread Block dim3 dimGrid(1,1); dim3 dimBlock(Width, Width); …
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ int k=0; float Pvalue = 0, Melement, Nelement; for (k = 0; k < Width; ++k)‏ { Melement = Md[threadIdx.y*Width+k]; // Md[threadIdx.y, k] Nelement = Nd[k*Width+threadIdx.x]; // Nd[k, threadIdx.x] Pvalue += Melement * Nelement; } // Pd[threadIdx.y, threadIdx.x] Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;

Single Thread Block What is the maximum size of the matrix?
Nd Grid 1 Block 1 What is the maximum size of the matrix? Each thread computes one element of Pd. Each thread: Loads a row of matrix Md. Loads a column of matrix Nd. Perform one multiply and addition for each pair of Md and Nd elements. CGMA Compute to Global Memory Access Thread (2, 2)‏ 48 WIDTH Md Pd

Multiple Blocks Break Pd into square tiles.
bx Multiple Blocks 1 2 TILE_WIDTH-1 Break Pd into square tiles. Each block calculates one tile: Each threads calculates one element. Block size equals to tile size. Require both block ID and thread ID. Nd WIDTH Md Pd Pdsub 1 2 TILE_WIDTH WIDTH TILE_WIDTH-1 TILE_WIDTH WIDTH WIDTH

Multiple Blocks: An Example
Nd0,0 Nd1,0 Nd0,1 Nd1,1 Block(0,0) Block(1,0) Nd0,2 Nd1,2 P0,0 P1,0 P2,0 P3,0 Nd0,3 Nd1,3 TILE_WIDTH = 2 P0,1 P1,1 P2,1 P3,1 Md0,0 Md1,0 Md2,0 Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0 P0,2 P1,2 P2,2 P3,2 Md0,1 Md1,1 Md2,1 Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1 P0,3 P1,3 P2,3 P3,3 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Block(0,1) Block(1,1) Pd0,3 Pd1,3 Pd2,3 Pd3,3

Multiple Blocks: Indexing
TILE_WIDTH Block: blockIdx.x, blockIdx.y Thread: threadIdx.x, threadIdx.y Row: blockIdx.y * TILE_WIDTH + threadIdx.y Col: blockIdx.x * TILE_WIDTH + threadIdx.x (0,0) (1,0) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1) (0,2) (1,2) (2,2) (3,2) (0,3) (1,3) (2,3) (3,3) threadIdx.y blockIdx.y blockIdx.x threadIdx.x

Multiple Blocks: Device Code
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏{ // Calculate the row index of the Pd element and Md int Row=blockIdx.y*TILE_WIDTH+threadIdx.y; // Calculate the col index of the Pd element and Nd int Col=blockIdx.x*TILE_WIDTH+threadIdx.x; int k; float Pvalue = 0; // Each thread computes one element of sub-matrix for (k = 0; k < Width; ++k)‏ Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }

Block Granularity Blocks Blocks SM 0 SM 1
t0 t1 t2 … tm t0 t1 t2 … tm SP Shared Memory MT IU SP Shared Memory MT IU Blocks Blocks Each SM in GT200 can take up to 1024 threads and 8 blocks. 8×8: 64 threads per block, 1024/64=12 blocks, 64×8=512 threads per SM 16×16: 256 threads per block, 1024/256=4 blocks, full capacity! 32×32: 1024 threads per block, exceeding the limit of 512 threads/block

Global Memory Access Each thread requires one row from Md and one column from Nd. For a k×k thread block, each row/column will be accessed k times. To reduce the global memory I/O, it is beneficial to load the required data once into the shared memory. Nd T0,0 T1,0 2×2 Thread Block T0,1 T1,1 Md

Splitting Md The shared memory per SM is limited (e.g., 64KB).
Shared among all blocks in the same SM. Luckily, not all data needs to be in the shared memory simultaneously. Phase 1 Phase 2 shared shared Mds0,0 Mds1,0 Mds0,1 Mds1,1 Md0,0 Md1,0 Md2,0 Md3,0 Md0,1 Md1,1 Md2,1 Md3,1 Mds0,0 Mds1,0 Mds0,1 Mds1,1 Mds Md Mds

Shared Memory: Device Code
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaboratively load Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; __syncthreads(); // Make sure the shared memory is ready

Shared Memory: Device Code
12. for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; // Make sure all threads have finished working on the shared memory 14. __syncthreads(); } 15. Pd[Row*Width+Col] = Pvalue; Each SM in G80 can take up to 768 threads and 8 blocks. Each SM has 16KB shared memory. For a tile of size 16 × 16, each block requires 16 × 16 × 4 = 1KB for Mds. Totally, 2KB are required for each block and 8 blocks can be supported. However, since 768/(16 × 16) = 3, only 3 blocks and 6KB will be in use.

Performance Considerations
GPU computing is easy: Host, Device, Kernel, Block, Thread GPU Cards Up and running in a few days As long as performance is not a major concern: Various performance bottlenecks 10× speedup is often within your reach. 100× speedup takes significant amount of tuning efforts.

Thread Execution Conceptually, threads in a block can execute in any order with respect to each other. Exception: barrier synchronizations Each thread block is partitioned into warps. Hardware cost considerations The unit of thread scheduling in SMs 32 threads per warp: [0,…, 31], [32,…, 63] … All threads in a warp execute the same instruction. SM hardware implements zero-overhead thread scheduling. Can tolerate long-latency operations with several warps around. GPU does not require as much chip area for cache memories and branch prediction mechanisms as CPUs.

Warp Scheduling Suppose 1 global memory access is needed for every 4 instructions. Instruction: 4 clock cycles Memory latency: 200 clock cycles At least 14 warps are required to keep the units fully utilized. A B C D A

Flow Divergence SM multithreaded Warp scheduler time
SIMT can reduce the cost of fetching and processing instructions. SIMT works well when all threads in a warp follow the same control flow. Multiple sequential passes may be required for an if-then-else construct. time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 . . . warp 8 instruction 12 if warp 3 instruction 96 then else X Y

Flow Divergence

Flow Divergence Main performance concern with branching is divergence.
Threads within a thread take different paths. The control paths are traversed one at a time. How to avoid divergence when the branch condition is a function of thread ID? With divergence: if (threadIdx.x>2) { } Threads 0, 1, and 2 follow a different path than the rest threads in the warp. Without divergence: if (threadIdx.x/WARP_SIZE>=2) { } Creates two different paths for threads in a block. Branch granularity is a whole multiple of warp size. All threads in any given warp follow the same path.

Memory Coalescing Dynamic Random Access Memory (DRAM)
Each bit is stored in a separate capacitor. All storage locations have nearly identical access time. In practice, many consecutive locations are accessed in parallel. All threads in a warp should access continuous memory locations (coalescing) to maximize memory bandwidth utilization. Not coalesced Coalesced Md Nd Thread 1 WIDTH Thread 2 WIDTH

Memory Layout of a Matrix in C
Time M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 1 Time Period 2 T0 T1 T2 T3 T0 T1 T2 T3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Memory Layout of a Matrix in C
Time M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 2 T0 T1 T2 T3 Time Period 1 T0 T1 T2 T3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Shared Memory Architecture
Many threads access memory: Shared memory is divided in banks. Successive 32-bit words are assigned to successive banks. Each bank has a bandwidth of 32 bits per clock cycle. G80 has 16 banks: bank=address % 16 Same as the size of half a warp Each memory bank can service one address per cycle. Can service as many simultaneous accesses as the number of banks. Multiple simultaneous accesses to the same bank may result in a bank conflict. Conflicting accesses are serialized. No bank conflicts between different half warps.

Bank Conflicts Shared memory is as fast as registers if there are no bank conflicts. The fast case: If all threads of a half-warp access different banks, there is no bank conflict. If all threads of a half-warp access the identical address, there is no bank conflict (broadcast). The slow case: Bank Conflict: Multiple threads in the same half-warp access the same bank. Must serialize the accesses. Cost = max # of simultaneous accesses to a single bank Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Bank Addressing Example
No Bank Conflicts Linear addressing No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Bank Addressing Example
2-way Bank Conflicts Linear addressing stride = 2 8-way Bank Conflicts Linear addressing stride = 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Partitioning of SM Resources
Execution resources in SM Registers Block Slots (GT200: 8) Thread Slots (GT200: 1024) Number of 16×16 blocks = 1024/(16×16) = 4 Determine the number of threads running on a SM. Subtle interactions that may cause underutilization of resources Register File Store automatic variables declared in a CUDA kernel. G80: 32KB (8192 entries) for each SM Dynamically partitioned across all blocks in the same SM. Each thread can only access registers assigned to itself.

SM Resources Example For 16×16 blocks, if each thread uses 10 registers: Each block requires 16×16×10 = 2560 registers. Number of blocks = 8129/2560 = 3 If each thread increases the use of registers by 1: Each block now requires 16×16×11 = 2816 registers. Number of blocks = 8129/2816 = 2 Only two blocks can run on a SM. Number of threads drops from 768 to 512 1/3 reduction of parallelism due to the single extra automatic variable! Performance Cliff

Occupancy Calculator

Instruction Mix Each processor core has limited instruction processing bandwidth. Every instruction consumes processing bandwidth: Floating point calculation Load instruction Branch instruction We should try to increase the efficiency of instructions. 1 loop counter increment instruction 1 loop branch instruction for (int k=0; k<BLOCK_SIZE; k++) Pvalue+=Mds[ty][k]*Nds[k][tx]; 2 address arithmetic instructions 2 floating point arithmetic instructions

Instruction Mix Loop Unrolling Pvalue+=Mds[ty][0]*Nds[0][tx]+…
Express the dot-product computation as one long multiply-add expression. Eliminate the loop branch instruction. Eliminate the loop counter update. Matrix indices are constants rather than a variable. With the help of compiler, address arithmetic instructions can be also eliminated!

Dynamic Parallelism A child CUDA kernel can be called from within a parent CUDA kernel, without CPU involvement. Extension to flat, single-level of parallelism Requires Compute Capability 3.5+ Benefits: Simplified CPU/GPU Cooperation Dynamic Load Balancing Data-Dependent Execution Recursive Parallel Algorithms CPU GPU CPU GPU

What does it mean?

Example __global__ ChildKernel(void* data){ //Operate on data } __global__ ParentKernel(void *data){ ChildKernel<<<1, 16>>>(data); // In Host Code ParentKernel<<<256, 64>>(data); __global__ RecursiveKernel(void*data){ if(continueRecursion == true) RecursiveKernel<<<64, 16>>>(data); }

Matrix Example for (int i=0; i<N; i++){ for (int j=0; j<M; j++){ convolution_function(i, j); }

Matrix Example for (int i=0; i<N; i++){ for (int j=0; j<M[i]; j++){ convolution_function(i,j); } Oversubscription

Matrix Example __global__ void con_kernel(int i){ convolution_function(i, threadIdx.x); } __global__ void dynamic_parallelism_kernel(int *M){ con_kernel<<<1, M[blockIdx.x]>>>(blockIdx.x); //In Host Code dynamic_parallelism_kernel<<<N, 1>>>(M);

Synchronization __global__ void Parent_Kernel() { ... //Kernel code if(threadIdx.x==0){ Child_Kernel<<<1, 256>>>(); // Thread will launch kernel and keep going cudaDeviceSynchronize(); // Make thread wait for Child_Kernel to complete } __syncthreads(); //If all threads in the block need Child_Kernel to complete //Code that needs data generated by Child_Kernel

Timing GPU Kernels float time; cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); // Place the start event kernel <<<..>>>(..); // Returns immediately cudaEventRecord(stop, 0); // Place the stop event cudaEventSynchronize(stop); // Make sure stop is reached cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop); stop Kernel start Stream 0

Multiple Kernels // Create two streams cudaStream_t stream[2];
for (int i=0; i<2; ++i) cudaStreamCreate(&stream[i]); // Launch a Kernel from each stream KernelOne <<<64, 512, 0, stream[0]>>>(..); KernelTwo <<<64, 512, 0, stream[1]>>>(..); // Destroy the streams cudaStreamDestroy(stream[i]); Synchronization is implied for events within the same stream. More than one stream can be associated with a GPU.

Multiple GPUs int nDevices; cudaGetDeviceCount(&nDevices);
cudaDeviceProp prop; for (int i=0; i<nDevices; i++) { cudaGetDeviceProperties(&prop, i); printf("Device Number: %d\n", i); printf("Device Name: %s\n", prop.name); printf("Compute Capability: %d.", prop.major); printf("%d\n", prop.minor); printf("Memory Bus Width: %d\n", prop.memoryBusWidth); }

Streams and Multiple GPUs
cudaSetDevice(0); cudaStreamCreate(&streamA); cudaSetDevice(1); cudaStreamCreate(&streamB); // Launch kernels KernelOne <<<..., streamA>>>(..); // Invalid! KernelTwo <<<..., streamB>>>(..); // Valid Streams belong to the GPU that was active when they were created. Calls to a stream are invalid if the associated GPU is not active.

Floating Point Considerations
Numeric values are represented as bit patterns. IEEE Floating Point Standard Sign (S), Exponent (E) and Mantissa (M) Each (S, E, M) pattern uniquely identifies a floating point number. For each bit pattern, it numeric value is derived as: Value = (-1)S × M × {2E}, where 1.0B ≤ M < 10.0B The interpretation of S: S=0: Positive Number S=1: Negative Number

Normalized Representation of M
Subscripts D and B are for decimal place and binary place values respectively. Specifying 1.0B ≤ M < 10.0B makes the mantissa value for each floating point number unique. For example: 0.5D = 1.0B × 2-1 The only valid mantissa value is M=1.0B. Neither 10.0B × 2-2 (M = 10.0B) nor 0.1B × 20 (M = 0.1B) qualifies. Just like 10.0D × 105, or 0.9D × 10-3 are not valid. Because all mantissa values are of the form 1.XX…, we can omit the “1.” part from the representation. The mantissa value of 0.5D in a 2-bit mantissa is 00, by omitting “1.” from 1.00. With the IEEE format, an n-bit mantissa is effectively an (n+1)-bit mantissa.

Excess Encoding of E (-1)S × 1.M × 2(E-2^(n-1)+1) Decimal Value
Two’s Complement Excess-3 Reserved 100 111 -3 101 000 -2 110 001 -1 010 011 1 2 3 Monotonically Monotonically The advantage of excess representation is that an unsigned comparator can be used to compare signed numbers. In an n-bit exponent representation, 2n-1-1 is added to its two's complement to form its excess representation. (-1)S × 1.M × 2(E-2^(n-1)+1)

Representable Numbers
No-zero Abrupt Underflow Denormalization E M S=0 S=1 00 2-1 -(2-1) 01 2-1+1*2-3 -(2-1+1*2-3) 1*2-2 -1*2-2 10 2-1+2*2-3 -(2-1+2*2-3) 2*2-2 -2*2-2 11 2-1+3*2-3 -(2-1+3*2-3) 3*2-2 -3*2-2 20 -(20) 20+1*2-2 -(20+1*2-2) 20+2*2-2 -(20+2*2-2) 20+3*2-2 -(20+3*2-2) 21 -(21) 21+1*2-1 -(21+1*2-1) 21+2*2-1 -(21+2*2-1) 21+3*2-1 -(21+3*2-1) Reserved Pattern

Representable Numbers
1 2 3 The exponent bits define the major intervals of representable numbers. The mantissa bits define the number of representable numbers in each interval. Zero is not representable in this format. Representable numbers become closer to each other toward 0. There is a gap in representable numbers in the vicinity of 0.

Representing Zero 0.M × 2-2^(n-1)+2 Abrupt Underflow 1 2 3
Treats all bit patters with E=0 as 0. Takes away several representable numbers near zero and lumps them all into 0. 1 2 3 Denormalization Relaxes the normalization requirement for numbers very close to 0. Whenever E=0, the mantissa is assumed to be 0.xx. The exponent is assumed to be the same as the previous interval. 1 2 3 0.M × 2-2^(n-1)+2 (S E M)  0.01 X 20 = 2-2

Accuracy and Rounding 1.00 × 2-2 + 1.00 × 21
1.00 × × 21 0.001 × × 21 = × 21 ≈ 1.00 × 21 (Error = × 21) 1.00 × × × × 2-2 1.00 × × × 2-2 1.00 × × 2-2 1.00 × 21 [1.00 × × 20 ]+ [1.00 × × 2-2] 1.00 × × 2-1 1.01 × 21 Sorting data in ascending order may help achieve greater accuracy. Numbers with similar numerical values are close to each other.

Single vs. Double Precision
GPUs were traditionally not good at double precision calculation. Requires compute capability 1.3 or above. Around 1/8 of single precision performance. Improved greatly to 1/2 with Fermi architecture. Important to avoid using double precision when it is not necessary. Add ‘f’ specifier on float literals: Y=X*0.123; // double assumed Y=X*0.123f; // float explicit Use float version of standard library functions: Y=sin(X); // double assumed Y=sinf(X); // single precision explicit

Matlab in Parallel Matlab: Numerical Computing Environment
Parallel Computing Toolbox (PCT) Offload work from one MATLAB session (client) to other MATLAB sessions (workers). Run as many as eight MATLAB workers (2011b) on your local machine in addition to your MATLAB client session.

Parfor Parallel for-loop
The parfor body is executed on the client and workers. The data on which parfor operates is sent from the client to workers, and the results are sent back to the client and pieced together. MATLAB workers evaluate iterations in no particular order, and independently of each other. Classification of Variables Loop, Sliced, Reduction, Broadcast, Temporary

Classification of Variables
c=pi; z=0; r=rand(1,10); parfor i=1:10 a=i; z=z+i; b(i)=r(i); if i<=c d=2*a; end temporary loop reduction sliced sliced broadcast

Parfor Example X=zeros(1,N); for i = 1:N x(i)=sin(i/N*2*pi); end
parallelization X=zeros(1,N); matlabpool open local 8 % create 8 workers parfor i = 1:N X(i)=sin(i/N*2*pi); end matlabpool close % close all workers

Notes on Parfor Each loop must be independent of other loops.
In the Windows Task Manager, there are multiple Matlab processes: Higher CPU Usage Higher Memory Usage It incurs significant overhead: only works with long loop iterations and/or time consuming calculations. Be prepared to be surprised: Some Matlab functions are already optimized for multithreading. The practical speedup value is generally quite moderate.

GPU Accelerated Matlab
Matlab users can now easily enjoy the benefits of GPU computing. Capabilities Evaluating built-in functions on the GPU. Running MATLAB code on the GPU. Requirements Matlab 2014a (Recommended) NVIDIA CUDA-enabled devices with compute capability of 1.3 or greater NVIDIA CUDA device driver 3.1 or greater Check the GPU environment gpuDeviceCount: number of available GPU devices gpuDevice: select and query GPU device

Create Data on GPU Transferring data between workspace and GPU:
Directly creating GPU data: M = rand(6); G = gpuArray(single(M)); N = gather(G); Workspace GPU G = ones(100,100,50, 'single', 'gpuArray'); size(G) classUnderlying(G) single

Execute Code on GPU Run Built-In Functions
Run Element-Wise Matlab Code X = rand(1000,'single','gpuArray'); Gfft = fft(X); Y = gather(Gfft); function c = myCal(rawdata, gain, offst) c = (rawdata .* gain) + offst; meas = ones(1000)*3; // CPU gn = rand(1000,'gpuArray')/100; // GPU offs = rand(1000,'gpuArray')/50; // GPU corrected = results = gather(corrected); If any of the input arguments is a gpuArray, the function executes on the GPU and returns a gpuArray. If none of the inputs is a gpuArray, then arrayfun and bsxfun execute in the CPU.

Timing GPU Code A = rand(1024,'gpuArray'); fh = @()fft(A);
gputimeit(fh); A = rand(12000,400,'gpuArray'); B = rand(400,12000,'gpuArray'); f t = gputimeit(f); gd = gpuDevice(); tic(); B = fft(A); wait(gd); t = toc(); X = rand(1000,'gpuArray'); f t1 = gputimeit(f,1); t3 = gputimeit(f,3) ; wait(gpudev) blocks execution in MATLAB until the GPU device identified by the object gpudev completes its calculations. When gathering results from a GPU, MATLAB automatically waits until all GPU calculations are complete, so you do not need to explicitly call wait in that situation. You must use wait when you are timing GPU calculations to profile your code.

Testing Host-GPU Bandwidth
sizeOfDouble = 8; sizes = power(2, 14:28); sendTimes = inf(size(sizes)); gatherTimes = inf(size(sizes)); for i=1:numel(sizes) numElements = sizes(i)/sizeOfDouble; hostData = randi([0 9], numElements, 1); gpuData = gpuArray.randi([0 9], numElements, 1); sendFcn sendTimes(i) = gputimeit(sendFcn); gatherFcn gatherTimes(i) = gputimeit(gatherFcn); end sendBandwidth = (sizes./sendTimes)/1e9; [maxSendBandwidth, maxSendIdx] = max(sendBandwidth); gatherBandwidth = (sizes./gatherTimes)/1e9; [maxGatherBandwidth, maxGatherIdx] = max(gatherBandwidth); hold off semilogx(sizes, sendBandwidth, 'b.-', sizes, gatherBandwidth, 'r.-') hold on semilogx(sizes(maxSendIdx), maxSendBandwidth, 'bo-', 'MarkerSize', 10); semilogx(sizes(maxGatherIdx), maxGatherBandwidth, 'ro-', 'MarkerSize', 10); grid on title('Data Transfer Bandwidth') xlabel('Array size (bytes)') ylabel('Transfer speed (GB/s)') legend('Send to GPU', 'Gather from GPU', 'Location', 'NorthWest')

Testing Host-GPU Bandwidth
GTX 750

Testing CPU Bandwidth sizeOfDouble = 8; sizes = power(2, 14:28);
memoryTimesHost = inf(size(sizes)); for i=1:numel(sizes) numElements = sizes(i)/sizeOfDouble; hostData = randi([0 9], numElements, 1); plusFcn 1.0); memoryTimesHost(i) = timeit(plusFcn); end memoryBandwidthHost = 2*(sizes./memoryTimesHost)/1e9; [maxBWHost, maxBWIdxHost] = max(memoryBandwidthHost); K20

Testing GPU Bandwidth memoryTimesGPU = inf(size(sizes));
for i=1:numel(sizes) numElements = sizes(i)/sizeOfDouble; gpuData = gpuArray.randi([0 9], numElements, 1); plusFcn 1.0); memoryTimesGPU(i) = gputimeit(plusFcn); end memoryBandwidthGPU = 2*(sizes./memoryTimesGPU)/1e9; [maxBWGPU, maxBWIdxGPU] = max(memoryBandwidthGPU); Computational Density: 1/2 Flop/Element Most of the time is spent on memory reading and writing.

Bandwidth: CPU vs. GPU hold off
semilogx(sizes, memoryBandwidthGPU, 'b.-', sizes, memoryBandwidthHost, 'r.-') hold on semilogx(sizes(maxBWIdxGPU), maxBWGPU, 'bo-', 'MarkerSize', 10); semilogx(sizes(maxBWIdxHost), maxBWHost, 'ro-', 'MarkerSize', 10); grid on title('Read+write Bandwidth') xlabel('Array size (bytes)') ylabel('Speed (GB/s)') legend('GPU', 'Host', 'Location', 'NorthWest')

Testing Matrix Multiplication
sizes = power(2, 12:2:24); N = sqrt(sizes); mmTimesHost = inf(size(sizes)); mmTimesGPU = inf(size(sizes)); for i=1:numel(sizes) A = rand(N(i), N(i)); B = rand(N(i), N(i)); mmTimesHost(i) = A = gpuArray(A); B = gpuArray(B); mmTimesGPU(i) = end mmGFlopsHost = (2*N.^3 - N.^2)./mmTimesHost/1e9; [maxGFlopsHost,maxGFlopsHostIdx] = max(mmGFlopsHost); mmGFlopsGPU = (2*N.^3 - N.^2)./mmTimesGPU/1e9; [maxGFlopsGPU,maxGFlopsGPUIdx] = max(mmGFlopsGPU);

hold off semilogx(sizes, mmGFlopsGPU, 'b.-', sizes, mmGFlopsHost, 'r.-') hold on semilogx(sizes(maxGFlopsGPUIdx), maxGFlopsGPU, 'bo-', 'MarkerSize', 10); semilogx(sizes(maxGFlopsHostIdx), maxGFlopsHost, 'ro-', 'MarkerSize', 10); grid on title('Double precision matrix-matrix multiply') xlabel('Matrix size (numel)') ylabel('Calculation Rate (GFLOPS)') legend('GPU', 'Host', 'Location', 'NorthWest')

Single Precision

Review What are the differences among MPI, OpenMP and CUDA?
Why is GPU suitable for high performance computing? What is the general framework of CUDA programming? What is a kernel function and how to call it from the host code? What is the advantage of splitting a block into threads? Why do we need multiple thread blocks? What are the major memory types in CUDA? When should we use shared memory? What resource factors are critical to GPU programming?

Review What is a warp and why do we need it?
What is flow divergence and how to avoid it? What is bank conflict? What is instruction mix? What are the benefits of Dynamic Parallelism? How to measure the performance of GPU code? How to run kernel functions in parallel? How is a floating point number represented in the IEEE format? How to execute Matlab code on GPU?

Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn GPU Computing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Similar presentations

Presentation on theme: "Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn GPU Computing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn GPU Computing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Similar presentations

Presentation on theme: "Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn GPU Computing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback