ECE 8823A GPU Architectures Module 5: Execution and Resources - I

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6 CUDA Programming Guide

Objective To understand the implications of programming model constructs on demand for execution resources To be able to reason about performance consequences of programming model parameters Thread blocks, warps, memory behaviors, etc. Need deeper understanding of architecture to be really valuable (later) To understand DRAM bandwidth Cause of the DRAM bandwidth problem Programming techniques that address the problem: memory coalescing, corner turning, © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Closer Look: Formation of Warps
1D Thread Block warp 3D Thread Block How do you form warps out of multidimensional arrays of threads? Linearize thread IDs Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) (0,1,3) (0,1,0) (0,1,1) (0,1,2) (0,0,1) (0,0,2) (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3)

Formation of Warps 3D Thread Block 2D Thread Block linear order T0,0,0
Grid 1 Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) (0,1,3) (0,1,0) (0,1,1) (0,1,2) (0,0,1) (0,0,2) (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3) T0,0,0 T0,0,1 T0,0,2 T0,0,3 T0,1,0 T0,1,1 T0,1,2 T0,1,3 T1,0,0 T1,0,1 T1,0,2 T1,0,3 T1,1,0 T1,1,1 T1,1,2 T1,1,3 linear order

Mapping Thread Blocks to Warps
Thread Bock An Example with a warp size of 16 threads T0,0 T0,3 Warp 0 T3,0 T3,3 Warp 1 T7,0 T7,3 Follow row major order through the Z-dimension Linearize and then split into warps Understanding becomes important when optimizing global memory accesses

Execution of Warps Each warp executed as SIMD bundle
How do we handle divergent control flow among threads in a warp? Execution semantics How is it implemented? (later) How can we optimize against it? © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Impact of Control Divergence
Occurs within a warp Branches lead serialization branch dependent code Performance issue: low warp utilization if(…) {… } else { …} Idle threads Serialization Reconvergence!

Causes Traditional nested branches Loops
Variable number of iterations/thread Loop condition based on thread ID? Switching on thread ID if(threadIDx.x > 5) {} © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Control Divergence Mitigation: Algorithmic Approach
Flexibility of MIMD control flow + Benefits of SIMD execution Can algorithmic techniques maximize utilizations achieved by a warp?

Reduction A commonly used strategy for processing large input data sets There is no required order of processing elements in a data set (associative and commutative) Partition the data set into smaller chunks Have each thread to process a chunk Use a reduction tree to summarize the results from each chunk into the final answer We will focus on the reduction tree step for now. Google and Hadoop MapReduce frameworks are examples of this pattern © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

A parallel reduction tree algorithm performs N-1 Operations in log(N) steps
3 1 7 4 1 6 3 max max max max 3 7 4 6 max max 7 6 max 7 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

Reduction: Approach 1 __shared__ float partialsum[]; ..
unsigned int t = threadIDx.x; For (unsigned int stride =1; stride <blockDim.x; stride *=2) { __syncthread(); If(t%(2*stride) == 0) partialsum[t] +=partialsum[t+stride]; } threadID.x thread thread thread thread Data in shared memory 1 2 3 4 5 6 6 Thread Block 0+1 2+3 4+5 6+7 O(N) additions and therefore work efficient? Hardware efficiency? 0..3 4..7 0..7 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

A Better Strategy Principle: Shift the index usage to ensure high thread utilization in warp Remap thread indices Keep the active threads consecutive © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

An Example of 16 threads 0+16 15+31 No Divergence 1 2 3 … 13 14 15 16
1 2 3 … 13 14 15 16 17 18 19 0+16 15+31 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

Reduction: Approach 2 Difference is in which threads diverge!
__shared__ float partialsum[]; .. unsigned int t = threadIDx.x; For (unsigned int stride = blockDim.x; stride>1; stride /=2) { __syncthread(); If(t < stride) partialsum[t] +=partialsum[t+stride]; } Difference is in which threads diverge! For a thread block of 512 threads Threads take the branch, do not For a warp size of 32, all threads in a warp have identical branch conditions  no divergence! When #active threads <warp-size,  old problem © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Global Memory Bandwidth
How can we map thread access patterns to global memory addresses to maximize bandwidth utilization? Need to understand the organization of DRAMs! Hierarchy of latencies © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

Basic Organization 1 1 decode Sense amps and buffer Mux
1 1 decode Example: 32x32 = 1024 bit array Sense amps and buffer Mux ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, I/O pins

1Gb Micron DD2 SDRAM Column access time Row access time

Technology Trends How?  increasing burst length Past two decades,
Courtesy: Synopsis DesignWare Technical Bulletin Past two decades, Data rate increase ~ 1000x RAS/CAS latency decrease = 56% How?  increasing burst length

DRAM Bursting for a 8x2 Bank
Address bits to decoder 2 bits to pin 2 bits to pin Core Array access delay time Non-burst timing Modern DRAM systems are designed to be always accessed in burst mode. Burst bytes are transferred but discarded when accesses are not to sequential locations. Burst timing ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Multiple DRAM Banks 1 1 decode decode Sense amps Sense amps Bank 1
1 1 decode decode Sense amps Sense amps Mux Mux Bank 1 Bank 0 ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

DRAM Bursting for the 8x2 Bank
Address bits to decoder 2 bits to pin 2 bits to pin Core Array access delay time Single-Bank burst timing, dead time on interface Multi-Bank burst timing, reduced dead time ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

First-order Look at the GPU off-chip memory subsystem
nVidia V100 Volta GPU: Peak global memory bandwidth = 900 GB/s Global memory (HBM2) 4096 bits Prior generation GPUs (e.g., Keplar) 384 bit wide 224GBytes/sec ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Multiple Memory Channels
Divide the memory address space into N parts N is number of memory channels Assign each portion to a channel “You can buy bandwidth but you can’t bribe God” -- Unknown Bank Bank Bank Bank Channel 0 Channel 1 Channel 2 Channel 3 ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Lessons Organize data accesses to maximize burst mode bandwidth
Access consecutive locations Algorithmic strategies + data layout Thread blocks issue warp-size load/store instructions 32 addresses for a warp size of 32 Coalesce these accesses to create smaller number of memory transactions  maximize memory bandwidth More later as we discuss microarchitecture

Memory Coalescing Warp LD LD LD LD Memory references are coalesced into sequence of memory transactions Accesses to a segment are coalesced, e.g., 128 byte segments) Ability and extent of coalescing depends on compute capability

Implications of Memory Coalescing
Warp Schedulers Reduce the request rate to L1 and DRAM Distinct from CPU optimizations – why? Need to be able to remap entries from each access back to threads SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Register File L1 access bandwidth L1/Shared Memory DRAM access bandwidth DRAM

Placing a 2D C array into linear memory space
linearized order in increasing address ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Base Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column index of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub- matrix for (int k = 0; k < Width; ++k) Pvalue += d_M[Row*Width+k]* d_N[k*Width+Col]; d_P[Row*Width+Col] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al, University of Illinois,

Lets look at these access patterns
Two Access Patterns Lets look at these access patterns d_M d_N Thread 1 H T D I Thread 2 W WIDTH (a) (b) d_M[Row*Width+k] d_N[k*Width+Col] k is loop counter in the inner product loop of the kernel code ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

N accesses are coalesced.
T0 T1 T2 T3 Load iteration 0 Load iteration 1 Access direction in kernel code (one thread) … N0,2 N1,1 N0,1 N0,0 N1,0 N0,3 N1,2 N1,3 N2,1 N2,0 N2,2 N2,3 N3,1 N3,0 N3,2 N3,3 Across successive threads in a warp d_N[k*Width+Col] (each thread) ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

M accesses are not coalesced.
Access direction in Kernel code (in a thread) M0,0 M0,1 M0,2 M0,3 Access across successive threads in a warp M1,0 M1,1 M1,2 M1,3 d_M[Row*Width+k] M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 … Load iteration 1 T0 T1 T2 T3 Leads to many distinct memory transactions for accessing d_M Load iteration 0 T0 T1 T2 T3 M M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

Using Shared Memory Original Access Pattern Copy into scratchpad
d_N WIDTH Original Access Pattern Tiled Copy into scratchpad memory Perform multiplication with scratchpad values WIDTH ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Shared Memory Accesses
Shared memory is banked No coalescing Data access patterns should be structured to avoid bank conflicts Low order interleaved mapping?

__global__ void MatrixMulKernel(float. d_M, float. d_N, float
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the d_P element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the d_M and d_N tiles required to compute the d_P element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of d_M and d_N tiles into shared memory 9. Mds[tx][ty] = d_M[Row*Width + m*TILE_WIDTH+tx]; Nds[tx][ty] = d_N[(m*TILE_WIDTH+ty)*Width + Col]; __syncthreads(); 12. for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[tx][k] * Nds[k][ty]; __synchthreads(); } 15. d_P[Row*Width+Col] = Pvalue; Accesses from shared memory, hence coalescing is not necessary Consider bank conflicts

Coalescing Behavior … Col Row m*TILE_WIDTH d_N d_M d_P Pdsub
TILE_WIDTHE m*TILE_WIDTH Col Row … ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Thread Granularity Consider instruction bandwidth vs. memory bandwidth
Fetch/Decode Consider instruction bandwidth vs. memory bandwidth Control amount of work per thread Warp Schedulers SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Register File L1/Shared Memory DRAM

Thread Granularity Tradeoffs
d_M d_N d_P Pdsub TILE_WIDTH WIDTH TILE_WIDTHE m*TILE_WIDTH Col Row … Preserving instruction bandwidth (memory bandwidth) Increase thread granularity Merge adjacent tiles: sharing tile data ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Thread Granularity Tradeoffs (2)
d_M d_N d_P Pdsub TILE_WIDTH WIDTH TILE_WIDTHE m*TILE_WIDTH Col Row … Impact on parallelism #TBs, #registers/thread Need to explore impact  autotuning ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois,

Any more questions? Read Chapter 6!

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Similar presentations

Presentation on theme: "ECE 8823A GPU Architectures Module 5: Execution and Resources - I"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

Similar presentations

Presentation on theme: "ECE 8823A GPU Architectures Module 5: Execution and Resources - I"— Presentation transcript:

Similar presentations

About project

Feedback