Chao Li Yi Yang HongwenDai Shenggen Yan Frank Mueller Huiyang Zhou

Chao Li Yi Yang HongwenDai Shenggen Yan Frank Mueller Huiyang Zhou
Understanding the Tradeoffs between Software-Managed vs. Hardware-Managed Caches in GPUs Chao Li Yi Yang HongwenDai North Carolina State University NEC Lab North Carolina State University Shenggen Yan Frank Mueller Huiyang Zhou Institute of Softare, CAS North Carolina State University North Carolina State University

Introduction Two ways to manage on-chip caches effectively
Explicit software management Shared Memory in GPU; Near Memory in MIC KNL; Implicit hardware management L1 Data Cache (L1 D-cache). Accelerators like GPUs provide an ideal platform Shared the same hardware resources Configured using runtime API Insight to two questions: Is it worthwhile for application developers to explicitly manage shared memory given existence of the hardware managed L1 D-caches in GPUs? What are the main reasons for code utilizing shared memory to outperform code leveraging L1 D-caches (and vice versa)?

Outline Four Detailed Case Studies
Matrix Multiplication FFT Marching Cubes Pathfinder Benchmarks Categorization and Experimental Results Conclusions **Mention other detailed cases will see our paper

… Matrix-Multiplication Matrix A One Tile L1 D-Cache Shared Memory
Matrix B L2 Cache Matrix multiplication is an important application in high performance computing, and an basic operation in linear transformations. For mm with large matrixes, the key optimization is loop tiling/blocking to reduce the working set and thereby reduce the reuse distances. The block operations can be optimized for each architecture to account for the memory hierarchy. DRAM Matrix C = A * B

… Matrix-Multiplication Matrix A A Tile
Shared Memory Version VS D-cache Version: 1) Same tiling optimization 2) Similar cache performance (95% hit ratio) 3) Same hardware, similar latency L1 D-Cache Shared Memory Surprisingly D-cache version is 43.8% slower than shared memory version Matrix B L2 Cache Red character side replaced with blue character. DRAM Matrix C = A * B

Hardware managed Cache Version
Matrix-Multiplication Software_managed Cache Version (i.e. Shared Memory Version) Code Hardware_managed Cache Version (i.e. D-Cache Version) for (each thread block) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; AS(ty, tx) = A [a + WA * ty + tx]; BS(ty, tx) = B [b + WB * ty + tx]; __syncthreads(); #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) Csub += AS(ty, k) * BS(k, tx); } for (each thread block) { #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) Csub += A[a+WA*ty+k] * B[b+k*WB+tx]; } tx ty Data movement Overhead On-chip AS, BS Same Tiling Configuration Capacity: 48KB for both Caches; Matrix size: 256*256; Tile size: 16*16. Thread block(16,16); TB Number: 5 for both versions. Key optimization is to Loop tiling to reduce working set; the mm kernel from nvidia sdk uses sh-mem to store tile Figure 1a computes a tile of elements, read tiles into sh-mem, and do MAD in as,bs before moving to next tiles. Since tile is in sm, enjoyed low latency. Explicitly, where to load and how to access it. When L1 D-cache is used, tile are store in cache implicitly. Both use same tile technique We configure 16*16 Expectation will be slightly slower, actual output show 43.6% quicker. We resort to gpgpusim; show similar trend; insns more which confirm our code analysis. code Time on GTX480(ms) Cycles on GPGPUsim Software managed Cache Version 0.16 132844 Hardware managed Cache Version 0.23 281831 43.8% slower 112.2% slower

Dynamic Instruction Count:
Shared-memory version has 12.7% more instructions than D-Cache version __global__ void Micro_sm(float*D1,float*D2,int iter, int initer,int stride) { int i,j; float temp=0; __shared__ float sh[32]; for(j=0;j<iter;j++) sh[threadIdx.x]=D1[threadIdx.x]; if(threadIdx.x==0) for(i=0;i<initer;i++) temp+=sh[i]; } D1+= stride; D2[0]=temp; Shared Memory Version __global__ void Micro_L1(float*D1,float*D2,int iter, int initer,int stride) { int i,j; float temp=0; for(j=0;j<iter;j++) temp=D1[threadIdx.x]; if(threadIdx.x==0) for(i=0;i<initer;i++) temp+=D1[i]; } D1+= stride; D2[0]=temp; D-Cache Version Perfect L1 Cache: Shared-memory version is still faster than D-cache version (0.2%) Micro-Benchmark: Is it possible for D-cache version to outperform shared-memory version?

Dynamic Instructions :
Shared-memory Version has 12.7% more instructions than the D-Cache Version Perfect L1 Cache: Shared-memory version is still faster than D-cache Version.(0.2%) Micro-Benchmark: D-cache version is 13.0% faster than shared-memory version Cache Configuration: Cache associativity: From 6-way set assoc to fully assoc: Miss rate: 3.59MPKI to 1.79 MPKI; Performance improved by 12.8%; Still 88.7% slower than shared-mem Version 2) Cache capacity: 16kB vs 48kB, No impact What happened? Not Total Number of Misses, But how cache misses overlap with each other, i.e. MLP

for (int k = 0; k < BLOCK_SIZE; ++k) {
Csub += A[a+WA*ty+k] * B[b+k*WB+tx]; } For the D-cache version, each thread block will load a tile from matrix a ,and matrix B and do computation. For a warp of 32 threads, when TB size is 16by16, Each warp will load two rows from A but load one row from B.

... A[a+WA*ty+k] B[b+k*WB+tx] 16 Cache Misses 1 Cache Miss
B[b+0*WB+ 0~15] Tx: 0~15 Ty: 0~1 Warp 0 A[a+WA*1+0] Cache hit A[a+WA*2+0] Tx: 0~15 Ty: 2~3 Warp 1 B[b+0*WB+ 0~15] A[a+WA*3+0] A[a+WA*4+0] Tx: 0~15 Ty: 4~5 Warp 2 B[b+0*WB+ 0~15] A[a+WA*5+0] ... A[a+WA*14+0] Tx: 0~15 Ty: 14~15 Warp 7 B[b+0*WB+ 0~15] A[a+WA*15+0] 16 Cache Misses 1 Cache Miss = 17 Cache Misses

... A[a+WA*ty+k] B[b+k*WB+tx] LD A LD B 0 Cache Miss 1 Cache Miss
B[b+1 *WB+ 0~15] Tx: 0~15 Ty: 0~1 Warp 0 A[a+WA*1+1] 17 misses A[a+WA*2+1] Tx: 0~15 Ty: 2~3 Warp 1 B[b+1 *WB+ 0~15] A[a+WA*3+1] A[a+WA*4+1] Tx: 0~15 Ty: 4~5 Warp 2 B[b+1 *WB+ 0~15] A[a+WA*5+1] ... A[a+WA*14+1] Tx: 0~15 Ty: 14~15 Warp 7 B[b+1 *WB+ 0~15] A[a+WA*15+1] 0 Cache Miss 1 Cache Miss = 1 Cache Miss

... A[a+WA*ty+k] B[b+k*WB+tx] LD A LD B LD A LD B 0 Cache Miss
B[b+1*WB+ 0~15] Tx: 0~15 Ty: 0~1 Warp 0 A[a+WA*1+1] 17 misses A[a+WA*2+1] Tx: 0~15 Ty: 2~3 Warp 1 B[b+1*WB+ 0~15] LD A LD B A[a+WA*3+1] A[a+WA*4+1] Tx: 0~15 Ty: 2~3 Warp 2 B[b+1*WB+ 0~15] A[a+WA*5+1] 1 miss ... A[a+WA*14+1] Tx: 0~15 Ty: 2~3 Warp 7 B[b+1*WB+ 0~15] A[a+WA*15+1] 0 Cache Miss 1 Cache Miss = 1 Cache Miss

K=0 K=1 K=2 LD A LD B 1 miss LD A LD B 17 misses 1 miss LD A LD B

… … LD A LD B LD A LD B LD A LD B LD A LD B LD A LD B K=0 K=1 K=2 K=3
1 miss LD A LD B 17 misses 1 miss LD A LD B 1 miss LD A LD B … 1 miss LD A LD B

... ... AS(ty, tx) = A [a + WA * ty + tx];
BS(ty, tx) = B [b + WB * ty + tx]; _Sync(); ld A A[a+WA*0+0] Tx: 0~15 Ty: 0~1 Warp 0 A[a+WA*1+0] A[a+WA*2+0] Tx: 0~15 Ty: 2~3 Warp 1 A[a+WA*3+0] A[a+WA*4+0] Tx: 0~15 Ty: 4~5 Warp 2 A[a+WA*5+0] ... ... A[a+WA*4+0] Tx: 0~15 Ty: 14~15 Warp 7 A[a+WA*5+0] 16 Cache Misses = 16 Cache Misses

... ... AS(ty, tx) = A [a + WA * ty + tx];
BS(ty, tx) = B [b + WB * ty + tx]; _Sync(); ld A stsA ld B 16 misses 16 misses B[a+WA*0+0] Tx: 0~15 Ty: 0~1 Warp 0 B[a+WA*1+0] B[a+WA*2+0] Tx: 0~15 Ty: 2~3 Warp 1 B[a+WA*3+0] B[a+WA*4+0] Tx: 0~15 Ty: 2~3 Warp 2 B[a+WA*5+0] ... ... B[a+WA*14+0] Tx: 0~15 Ty: 2~3 Warp 7 B[a+WA*15+0] 16 Cache Misses = 16 Cache Misses

... ... ld A stsA ld B lds a lds b fma lds a lds b fma lds a lds b fma
K=0 K=1 K=15 ld A stsA ld B 16 misses Sync 16 misses lds a lds b fma lds a lds b fma ... lds a lds b fma

… … ... ... Low MLP High MLP High MLP 17 misses 1 miss 1 miss 1 miss
K=0 K=1 K=2 K=3 K=15 ld A ld B 17 misses Low MLP ld A ld B 1 miss ld A ld B High MLP 1 miss ld A ld B … 1 miss ld A ld B D-Cache Version ... 1 miss K=0 K=1 K=15 Cycles Reduced ld A Metion total miss are same, but how they overlap with each other. stsA ld B 16 misses 16 misses lds a lds b fma lds a lds b fma High MLP ... lds a lds b fma Shared-Memory Version

High Memory Level Parallelism for Shared Memory Version
Short Summary Matrix Multiplication (Shared Memory Version ) High Memory Level Parallelism for Shared Memory Version More Dynamic Instructions for Shared Memory Version Fast Fourier Transformation (Shared Memory Version ) Write Evict for D-Cache Version Uncoalesced Memory Access for D-Cache Version MarchingCubes (D-Cache Version ) High Thread Level Parallelism for D-cache Version PathFinder (D-Cache Version ) More Opportunities to Store Data into Registers for D-Cache Version

Experimental Methodology
Experimental Setup: Real Hardware: GTX480 (FERMI) and GTX680 (KEPLER) Simulator : GPGPUsim V3.2.1. # of execution cores 15 SIMD Pipeline Width 16 Number of Threads/Core 1536 Number of Registers/Core 32768 Shared Memory /Core 16kB/48kB: 32 banks; 3-cycle latency; 1 access per cycle L1 Data cache/Core 16kB: 128B line, 4-way assoc 48kB: 128B line, 6-way assoc 1-cycle hit latency MSHR Entry 128 entries Constant Cache Size /Core 8k Texture Cache Size/Core 12k , 128B line, 24-way assoc L2 Data cache 768k: 128B line, 16-way assoc Number of Memory Channels 6 Memory Channel Bandwidth 8 Bytes/Cycle DRAM clock 1400 MHz DRAM Schedule Queue Size 16 , Out of Order ( FR-RCFS) Warp Scheduling Policy Greedy then oldest scheduler Simulator Configuration

Benchmarks 16 GPGPU Workloads Cover typical on-chip memory usages.

Performance Impact: FERMI
55.7% Performance on GTX480

Performance Impact: KEPLER
Performance on GTX680

Performance Impact: GPGPUsim
27% The green should be disappeared Performance on GPGPUsim

Performance Impact: GPGPUsim
27% Performance on GPGPUsim

Energy Impact: GPUWattch
On average, shared memory versions consume 48.5% energy of D-cache versions with WE, 53.7% of D-cache versions with WBWA, and 71.9% of D-cache versions with WBWA and FA policy

Conclusion An In-depth study on interesting tradeoffs between software-managed caches and hardware-managed caches. Key reasons for shared memory versions to outperform D-cache versions Memory Level Parallelism Memory Coalescing (bank-conflict-free accesses) D-Cache versions mainly benefit from Improved thread-level parallelism Use registers to store variables Shared memory versions outperform D-cache versions and consume less energy for most of the benchmarks, justifying the software complexity to manage the shared memory.

Thank You! Questions?

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{ prefetch(A[a + WA * ty + tx]); //Prefetch the tile of matrix A prefetch(B[b + WB * ty + tx]); //Prefetch the tile of matrix B __syncthreads(); #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) Csub += A[a+WA*ty+k]*B[b+k*WB+tx]; // tx = threadIdx.x and ty = threadIdx.y } Figure. The D-cache version of matrix multiplication with prefetching instructions to improve MLP. Similar complexity compared to explicitly data management Subtle MLP impact is not obvious, non-intuitive to engage in a prefetching optimization Even with prefetching instructions, the code is still 8% lower than shared memory version.

Chao Li Yi Yang HongwenDai Shenggen Yan Frank Mueller Huiyang Zhou

Similar presentations

Presentation on theme: "Chao Li Yi Yang HongwenDai Shenggen Yan Frank Mueller Huiyang Zhou"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chao Li Yi Yang HongwenDai Shenggen Yan Frank Mueller Huiyang Zhou

Similar presentations

Presentation on theme: "Chao Li Yi Yang HongwenDai Shenggen Yan Frank Mueller Huiyang Zhou"— Presentation transcript:

Similar presentations

About project

Feedback