Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State University NEC Labs America Zhen Lin Huiyang Zhou North.

Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State University NEC Labs America Zhen Lin Huiyang Zhou North Carolina State University North Carolina State University 1 www.nec-labs.com

 GPUs rely on thread-level parallelism to hide off-chip latency.  Judicious utilization of on-chip memory resources remains critical to performance.  The off-chip memory bandwidth still a bottleneck.  E.g., Big Data Applications, Deep Learning on GPUs.  Two key challenges:  Explicitly managing the intricate on-chip resources  Performance portability across different GPU generations.  Our solution: Automatic data placement into GPU memory resources  Compiler-driven automatic data placement  Focus on programs that have been reasonably optimized  Revise data placement to achieve both performance enhancement and performance portability. Introduction 2

 Three types on on-chip memory resources: registers, shared memory, and D-caches  Different Capacity and Allocation Restrictions:  Large register file, small cache;  64 registers per thread, 48KB shared memory per TB, no limit on D-cache.  Different Accessibility:  Register File: within threads (warps in Kerpler); Shared Memory: within a thread block; D-cache: TBs on the same SM.  Different Performance Characteristics:  Register File: highest bandwidth; Shared Memory: high bandwidth with fixed data access latency; D-cache: high bandwidth with variable access latency. Explicit Resource Management 3

 GPUs evolve at a very fast pace  There is a higher increase in computation throughput than off-chip bandwidth: Ratio of GFLOS/GB: 10X-> 15X (GTX8800-> GTX680).  Register file and D-cache/shared memory size have been changing across different generations. G80 (GTX 8800) GT200 (GTX 280) FERMI (GTX 480) KEPLER (GTX 680) KEPLER (K20c) Arithmetic throughput (Gflops/S) 504933134530903950 Memory Bandwidth (GB/S) 57141177192250 Shared memory size(KB) 16 48 Register file size(KB) 3264128256 4 Performance Portability

A Compiler Algorithm For Automatic Data Placement Analyze Possible Data Placement Patterns Construct Compiler Algorithms to Use the Profitable Patterns Our Solution 5

 Data movement from one to another to achieve optimal resource utilization.  Data (re)Placement Patterns:  Direction 6: : compiler determined, e.g. A[B[0]], B[12];  Directions a and 5: :previous works on specific optimizations, significant code changes; the trend of GPU evolution: larger register files  Directions, and : our focus. 6 Data (re)Placement Register variables Shared memory variables Local/global variables in L1 D-caches 1 2 3 6 4 5 6 4 5 1 2 3

Pattern 1: Shared Memory to Registers  Three reasons:  Shared memory usage may limit the TLP;  Shared memory has longer access latency and lower bandwidth;  Accessing shared memory incurs instruction overhead for address computation.  Promotion strategy on multiple promotable shared memory variables: reference count-based priority. __global__ void dynproc_kernel(…) { __shared__ float prev[256]; __shared__ float result [256]; int tx=threadIdx.x ; for (int i=0; i<iteration ; i++){ …. shortest = minum( prev[tx-1], prev[tx],prev[tx+1]); result[tx] = shortest + gpuWall[index]; __syncthreads(); prev[tx]= result[tx]; __syncthreads();} gpuResults[xidx]=result[tx]; } Baseline __global__ void dynproc_kernel(…) { __shared__ float prev[256]; float result; int tx=threadIdx.x ; for (int i=0; i<iteration ; i++){ … shortest = minum( prev[tx-1], prev[tx],prev[tx+1]); result = shortest + gpuWall[index]; syncthreads(); prev[tx]= result; __syncthreads();} gpuResults[xidx]=result; } Optimized Code 7

Pattern 2: Shared Memory to L1 D-caches  Three reasons:  Shared memory usage may limit the TLP but cannot promote to registers  Local/global memory implicitly utilizes L1 D- cache to achieve high performance;  The communication among threads can also be ensured by global memory variables.  To balance the tradeoff between TLP and memory pressure, auto-tuning is employed to determine :  Which variables to be promoted;  Whether into global or local memory. __global__ void generateTriangles(…) { __shared__ float3 vertlist[12*NTHREADS]; //12*32 __shared__ float3 normlist[12*NTHREADS]; //defines to the shared memory array vertexInterp2(..., vertlist[threadIdx.x], normlist[threadIdx.x])); vertexInterp2(…,vertlist[threadIdx.x+NTHREADS], normlist[threadIdx.x+NTHREADS]); …edge = tech1Dfetch (triTex,..) ; //uses of the shared memory array pos[index] = make_float4(vertlist[(edge*NTHREADS)+threadIdx.x], 1.0f); … } Baseline __global__ void generateTriangles(…) { float3 vertlist[12]; float3 normlist[12]; //defines to the local memory array vertexInterp2(.., vertlist[0], normlist[0]); vertexInterp2(…,vertlist[1], normlist[1]); … edge = tech1Dfetch(triTex,..); //uses of the local memory array pos[index] = make_float4 (vertlist[edge], 1.0f); …} Optimized Code 8

Pattern 3: Shared Memory/D-cache to Registers to Achieve Register Tiling  Two reasons:  Common side effect of SPMD: redundant computations and memory accesses;  Redundant shared/global memory usage can be converted into register usage.  Three ways for saving bandwidth:  Implicitly L1 Data-cache: cache hit, but data may be evicted out by others;  Shared memory: only select one warp for loading task, additional control flow and _sync;  Register file: not shared among warps, so compact warps of threads first. Introduce C_Factor for best register tiling. __global__ void srad_kernel(int [] c_cuda…){ int index_s = cols * BLOCK_SIZE * by + BLOCK_SIZE * bx + cols * BLOCK_SIZE + tx; //BLOCK_SIZE = 16; __shared__ float south_c[BLOCK_SIZE][BLOCK_SIZE]; …. south_c[ty][tx] = c_cuda[index_s]; if ( by == gridDim.y - 1 ) south_c[ty][tx] = … __syncthreads(); …} Baseline __global__ void srad_kernel(int [] c_cuda…){ int index_s = cols * BLOCK_SIZE * by + BLOCK_SIZE * bx + cols * BLOCK_SIZE + tx; //BLOCK_SIZE = 16; __shared__ float south_c[BLOCK_SIZE][BLOCK_SIZE]; …. int tmp_1= c_cuda[index_s]; #pragma unroll for(int m=0;m<C_Factor; m++) south_c[ty+ m*blockDim.y/C_Factor][tx] = tmp_1; if ( by == gridDim.y - 1 ) south_c[ty][tx] = … __syncthreads(); …} Optimized Code 9

Analyze Possible Data Placement Patterns Compiler Algorithms to Utilize the Profitable Patterns Our Solution 5

Compiler Algorithms Compiler pass 1: (patterns 1 & 2) Compiler pass 2: (pattern 3) Identification Stage Processing Stage Auto-tuning Stage Identification Stage Processing Stage Auto-tuning Stage  Three stages for one compiler pass:  Identifying stage: scan and generate a list of candidate variables by collecting the architecture features and analyzing memory access behavior;  Processing stage: implement the placement patterns;  Auto-tuning stage: construct the search space, decide which variables to be processed and achieve best code generation. 11

Identify and collect shared memory variables Analyze memory access behavior Is the access across threads? Memory reference counts and allocation size The access index is decided at runtime? Candidate variables list On each candidate variable in the list with priority order ! (a) && !(b) (a) (b) !(a) && (b) Promote to register file Promote to local memory Promote to global memory Promote to global memory Auto- tuning for optimal kernel Compiler Pass 1 12 Y N Y N Generate a new kernel Input Kernel

Identify and analyze the access behavior of global and shared memory variables Check the redundancy along x or y dimension; generate redundancy type Collect the expressions with indices featured with redundancy Adjust the thread block dimension for each different C_Factor Construct unroll-able loop for thread compaction /coarsening/merge Dump out the expr list The expressions in expr list will be performed once (i.e., no redundancy) Auto- tuning for optimal kernel Compiler Pass 2 13 Generate a new kernel

Auto-Tuning  Auto-tuning steps:  Construct a search space based on tunable parameters;  Measure the execution time;  Select the best preforming code variant for the target architecture.  Three search spaces constructed for data placement:  How many and which shared memory variables should be promoted into register file;  Which shared memory variables to be promoted into local/global memory;  The compaction factor.  Search space pruning strategies:  Memory reference-count based priority;  Allocation size-based priority;  Limit the compact factor to 2’s powers. 14

Preprocessor  Memory access index regulation:  An affine function of thread index;  Scaling factor: macro/constant variables, kernel launch parameter, run-time parameters.  Dynamic loop bound:  Let the user to provide the info through profiling; or  Use simple heuristics: a default loop count of 4.  Collect data structure declaration and annotate data type:  int2, float4 vector type: being processed the same as int, float;  User-define struct type: identified separately. 15

Experimental Methodology  Implementation:  Implement into Cetus, a source-to-source framework;  Basic CUDA syntax support from MCUDA.  Evaluation Environment :  Three generations with all possible D-cache and shared memory capacity configurations. ParameterGTX480GTX680K20c <Shared memory size, L1 D-cache size>,,, Register file size 128kB256kB Max number of threads per SM 51210241536 Max number of registers per thread 64 256 Compaction Factor 2,4,8,16 16

 Shared memory allocation size defined by programmer;  Initial register allocation controlled statically by compiler and architecture parameter. GTX480GTX680K20C Benchmark Inputregsmemregsmemregsmem HotSpot (HS) height 2353072363072393072 Back Prop1 (BP1) 65536 layer131088111088121088 Back Prop 2 (BP2) 65536 layer220200210 SRAD1 (SR1) 2048*2048200 0260 SRAD2 (SR2) 2048*2048190200 0 Matrix Multiply (MM) 2048*2048238192268192258192 Path Finder (PF) 409600 steps162048182048172048 N-Queue (NQU) N=8151574419157441615744 Marching Cubes (MC) 32768 voxels639216639216769216 B+tree1 (BT1) qrSize=6000180190210 B+tree2 (BT2) qrSize=6000230280300 Lu-Decompose (LUD) 2048.dat152048172048172048 17 Benchmarks

Performance Gains from Automatic Data Placement  Measurement:  Baseline: Select the best one for original kernel by trying all different shared memory/L1D cache sizes  For each device, generate the kernel with the optimal data placement choices.  Result:  GTX480: Up to 4.14X, Average of 1.76X; GTX680: Up to 3.30X, Average of 1.61X; K20C: Up to 2.44X, Average of 1.48X. 18

Optimal Parameters (the number of shared memory array to be promoted or the C-Factor) for Different GPUs  Performance Portability:  Our compiler intelligently generates the optimized kernel for specific architecture;  Different architecture features of these GPUs lead to different optimal parameters. 20

Auto-Tuning  Effective pruning:  Search space has been reduced significantly;  Performance of the optimized kernel not impacted. Original search space Pruned search space HS488 BP1163 BP2164 SR1165 SR2165 MM325 PF11 NQU4512 MC96 BT133 BT233 LUD164 21 Auto-tuning time (ms) 42.873 11.361 15.755 24.133 21.941 210.876 8.88 48.124 23.986 12.183 14.343 129.531 HS BP1 BP2 SR1 SR2 MM PF NQU MC BT1 BT2 LUD  The resulting auto-tuning time is small.

Conclusions  GPUs have been widely used for general-purpose computation:  Achieving high performance is not easy, one of the reasons is the intricate on-chip memory resources;  Manually tuned code for one device may not perform well on a new device.  We propose compiler-driven automatic data placement as our solution:  Our compiler algorithm refines GPU programs by altering data placement to achieve both performance enhancement and performance portability;  We show that in different GPU devices, the kernels optimized with our compiler algorithm achieve significant performance improvement. 23

Backup

Effectiveness Breakdown 19

Impact of Input Sizes (Marching-Cube)  Problem Input Size Impact:  The optimized code generation for on-chip data placement is generally input agnostic;  Large input tends to show higher benefit. 22

Compiler Pass 1 Kernel shared_to_register_or_local_or_global (Kernel kernel) { Kernel best_kernel = kernel; float exe_time = eval(kernel); //collect the execution time of kernel; /**Identification Stage**/ List arrays; for (each shared memory array sma in kernel) { sma.is_overlap = false; sma.is_index = false; sma.access_count = 0; sma.size = allocation_size; for (each access acc of array sma) { sma.access_count += (acc in loop)?loop_count::1; if (acc is overlapped across threads) sam.is_overlap = true; else if (the address of acc is calculated in the runtime) sma.is_index = true ;} if (sma.access_count >0) {arrays.add(sma);} } //end for /**Processing Stage**/ /**Auto-tuning Stage**/ } /**Identification Stage**/ /**Processing Stage**/ sma = array with largest access_count in arrays, pop it out; if (!sma.is_index and !sma.is_overlap) replace sma with register file; else if (sma.is_index and !sma.is_overlap) replace sma with local memory; else replace sma with global memory; /**Auto-tuning Stage**/ /**Identification Stage**/ /**Processing Stage**/ /**Auto-tuning Stage**/ generate a new kernel nkernel exe_time1=eval(nkernel) //the execution time of nkernel if (exe_time1< exe_time) { // the new kernel is better best_kernel = nkernel; exe_time = exe_time ;} else return best_kernel; // found the best kernel } //end while

Compiler Pass 2 Kernel shared_to_register_or_local_or_global (Kernel kernel) { Kernel best_kernel = kernel; float exe_time = eval(kernel); //collect the execution time of kernel; /**Identification Stage**/ List exprs; bool is_redundant_1d = false, is_redundant_2d = false; for (each shared/global memory array sma in kernel) { for (each access acc of array sma in expression expr) { if (acc is independent of one thread dimension) { is_redundant_1d = true; exprs.add (expr);} if (is_redundant_1d && acc is independent of the other thread dimension in expression expr) {is_redundant_2d = true; exprs.add (expr);} } }//end for /**Processing Stage**/ /**Auto-tuning Stage**/ } /**Identification Stage**/ /**Processing Stage**/ Adjust Thread Block Dimension. if(is_redundant_1d) { construct a one-loop with loop bound C_Factor to perform the workload for compacted threads convert expr in exprs to from inter-thread memory usage into register array. } else if(is_redundant_2d){ construct an 2-level loop with loop bound C_Factor.x, and C_Factor. y to perform the workload for compacted threads convert expr in exprs to from inter-thread memory usage into register array usage; } /**Auto-tuning Stage**/ /**Identification Stage**/ /**Processing Stage**/ /**Auto-tuning Stage**/ generate a new kernel nkernel exe_time1=eval(nkernel) //the execution time of nkernel if (exe_time1< exe_time) { // the new kernel is better best_kernel = nkernel; exe_time = exe_time ;} else return best_kernel; // found the best kernel } //end for

 our compiler algorithm focuses on code that has been reasonably optimized:  Manually or automatically by some compiler tools;  Already employ classical loop optimizations such as tiling;  Already allocate important data in shared memory either for communications among threads or for data reuses.  The way of thread compaction can also be referred to as thread merge/coarsening. Compared to thread merge/coarsen/fusion, our approach specifically utilize this technique for register tiling. In other words, to utilize register reuse for eliminating the redundant usage of shared/global memory existed in GPU programs. We further focus on address how many threads to be compacted to maximize the register tiling while restrict the register pressure on TLP, thus to determine the most profitable version of data placement.

Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State University NEC Labs America Zhen Lin Huiyang Zhou North.

Similar presentations

Presentation on theme: "Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State University NEC Labs America Zhen Lin Huiyang Zhou North."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State University NEC Labs America Zhen Lin Huiyang Zhou North.

Similar presentations

Presentation on theme: "Automatic Data Placement Into GPU On-Chip Memory Resources Chao Li Yi Yang North Carolina State University NEC Labs America Zhen Lin Huiyang Zhou North."— Presentation transcript:

Similar presentations

About project

Feedback