CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University

Register as Cache?

3 Volatile qualifier Volatile Qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x] array[threadIdx.x+1]=2; temp2=array[threadIdx.x] result[threadIdx.x]=temp1*temp2; } identical reads compiler optimized this read away

4 Volatile qualifier Volatile Qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { int temp=array[threadIdx.x]; temp1=temp; array[threadIdx.x+1]=2; temp2=temp; result[threadIdx.x]=temp1*temp2; }

5 Volatile qualifier Volatile Qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; __syncthreads(); temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }

6 Volatile qualifier Volatile Qualifier __global__ void kernelFunc(int* result) { volatile int temp1; volatile int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }

Data Prefetch

8 Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique Data Prefetch Md Pd Pd sub Nd load blue block to shared memory compute blue block on shared memory and load next block to shared memory

9 Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique Data Prefetch for loop { load data from global to shared memory synchronize block compute data in the shared memory synchronize block }

10 Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique Data Prefetch load data from global memory to registers for loop { store data from register to shared memory synchronize block load data from global memory to registers compute data in the shared memory synchronize block } very small overhead both memory are very fast computing and loading overlap register and shared are independent

11 Matrix-matrix multiplication Data Prefetch

Constant Memory

13 Constant Memory Where is constant memory? - data is stored in the device global memory - read data through multiprocessor constant cache - 64KB constant memory and 8KB cache for each multiprocessor How about the performance? - optimized when warp of threads read same location - 4 bytes per cycle through broadcasting to warp of threads - serialized when warp of threads read in different location - very slow when cache miss (read data from global memory) - access latency can range from one to hundreds clock cycles

14 Constant Memory How to use constant memory? - declare constant memory on the file scope (global variable) - copy data to constant memory by host (because it is constant!!) //declare constant memory __constant__ float cst_ptr[size]; //copy data from host to constant memory cudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);

15 Constant Memory //declare constant memory __constant__ float cangle[360]; int main(int argc,char** argv) { int size=3200; float* darray; float hangle[360]; //allocate device memory cudaMalloc((void**)&darray,sizeof(float)*size); //initialize allocated memory cudaMemset(darray,0,sizeof(float)*size); //initialize angle array on host for(int loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f; //copy host angle data to constant memory cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);

16 Constant Memory //execute device kernel test_kernel >>(darray); //free device memory cudaFree(darray); return 0; } __global__ void test_kernel(float* darray) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x; #pragma unroll 10 for(int loop=0;loop<360;loop++) darray[index]=darray[index]+cangle[loop]; return; }

Texture Memory

18 Texture Memory Texture mapping

19 Texture Memory Texture mapping

20 Texture Memory Texture filtering nearest-neighborhood interpolation

21 Texture Memory Texture filtering linear/bilinear/trilinear interpolation

22 Texture Memory Texture filtering two times bilinear interpolation

23 Texture Memory L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Work DistributionPixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB these units perform graphical texture operations

24 Texture Memory two SMs are cooperated as texture processing cluster scalable units on graphics texture specific unit only available for texture

25 Texture Memory texture specific unit texture address units compute texture addresses texture filtering units compute data interpolation read only texture L1 cache

26 Texture Memory L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Work DistributionPixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB read only texture L2 cache for all TPCread only texture L1 cache for each TPC

27 Texture Memory texture specific units

28 Texture Memory Texture is an object for reading data - data is stored on the device global memory - global memory is bound with texture cache L2 FB SP L1 TF Thread Processor SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 global memory

What is the advantages of texture?

30 Texture Memory Data caching - helpful when global memory coalescing is the main bottleneck L2 FB SP L1 TF Thread Processor SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB

31 Texture Memory Data filtering - support linear/bilinear and trilinear hardware interpolation texture specific unit intrinsic interpolation cudaFilterModePoint cudaFilterModeLinear

32 Texture Memory Accesses modes - clamp and wrap memory accessing for out-of-bound addresses texture specific unit clamp boundary wrap boundary cudaAddressModeWrap cudaAddressModeClamp

33 Texture Memory Bound to linear memory - only support 1-dimension problems - only get the benefits from texture cache - not support addressing modes and filtering Bound to cuda array - support float addressing - support addressing modes - support hardware interpolation - support 1/2/3-dimension problems

34 Texture Memory Host code - allocate global linear memory or cuda array - create and set the texture reference on file scope - bind the texture reference to the allocated memory - unbind the texture reference to free cache resource Device code - fetch data by indicating texture reference - fetch data by using texture fetch function

35 Texture Memory Texture memory constrain Compute capability 1.3Compute capability 2.0 1D texture linear memory819231768 1D texture cuda array1024x128 2D texture cuda array(65536,32768)(65536,65536) 3D texture cuda array(2048,2048,2048)(4096,4096,4096)

36 Texture Memory Measuring texture cache miss or hit number - latest visual profiler can count cache miss or hit - need device compute capability higher than 1.2

Example: 1-dimension linear memory

38 Texture Memory //declare texture reference texture texreference; int main(int argc,char** argv) { int size=3200; float* harray; float* diarray; float* doarray; //allocate host and device memory harray=(float*)malloc(sizeof(float)*size); cudaMalloc((void**)&diarray,sizeof(float)*size); cudaMalloc((void**)&doarray,sizeof(float)*size); //initialize host array before usage for(int loop=0;loop<size;loop++) harray[loop]=(float)rand()/(float)(RAND_MAX-1); //copy array from host to device memory cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);

39 Texture Memory //bind texture reference with linear memory cudaBindTexture(0,texreference,diarray,sizeof(float)*size); //execute device kernel kernel >>(doarray,size); //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result array from device to host memory cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost); //free host and device memory free(harray); cudaFree(diarray); cudaFree(doarray); return 0; }

40 Texture Memory __global__ void kernel(float* doarray,int size) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x; //fetch global memory through texture reference doarray[index]=tex1Dfetch(texreference,index); return; }

41 Texture Memory __global__ void offsetCopy(float* idata,float* odata,int offset) { //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x; //copy data from global memory odata[index]=idata[index+offset]; }

42 Texture Memory __global__ void offsetCopy(float* idata,float* odata,int offset) { //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x; //copy data from global memory odata[index]=tex1Dfetch(texreference,index+offset); }

Example: 2-dimension cuda array

44 Texture Memory #define size 3200 //declare texture reference texture texreference; int main(int argc,char** argv) { dim3 blocknum; dim3 blocksize; float* hmatrix; float* dmatrix; cudaArray* carray; cudaChannelFormatDesc channel; //allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size); //initialize host matrix before usage for(int loop=0;loop<size*size;loop++) hmatrix[loop]=float)rand()/(float)(RAND_MAX-1);

45 Texture Memory //create channel to describe data type channel=cudaCreateChannelDesc (); //allocate device memory for cuda array cudaMallocArray(&carray,&channel,size,size); //copy matrix from host to device memory bytes=sizeof(float)*size*size; cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice); //set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint; //set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaaddressModeClamp;

46 Texture Memory //bind texture reference with cuda array cudaBindTextureToArray(texreference,carray); blocksize.x=16; blocksize.y=16; blocknum.x=(int)ceil((float)size/16); blocknum.y=(int)ceil((float)size/16); //execute device kernel kernel >>(dmatrix,size); //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost); //free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray); return 0; }

47 Texture Memory __global__ void kernel(float* dmatrix,int size) { int xindex; int yindex; //calculate each thread global index xindex=blockIdx.x*blockDim.x+threadIdx.x; yindex=blockIdx.y*blockDim.y+threadIdx.y; //fetch cuda array through texture reference dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex); return; }

Example: 3-dimension cuda array

49 Texture Memory #define size 256 //declare texture reference texture texreference; int main(int argc,char** argv) { dim3 blocknum; dim3 blocksize; float* hmatrix; float* dmatrix; cudaArray* cudaarray; cudaExtent volumesize; cudaChannelFormatDesc channel; cudaMemcpy3DParms copyparms={0}; //allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size);

50 Texture Memory //initialize host matrix before usage for(int loop=0;loop<size*size*size;loop++) hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1); //set cuda array volume size volumesize=make_cudaExtent(size,size,size); //create channel to describe data type channel=cudaCreateChannelDesc (); //allocate device memory for cuda array cudaMalloc3DArray(&cudaarray,&channel,volumesize); //set cuda array copy parameters copyparms.extent=volumesize; copyparms.dstArray=cudaarray; copyparms.kind=cudaMemcpyHostToDevice; copyparms.srcPtr= make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size); cudaMemcpy3D(&copyparms);

51 Texture Memory //set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint; //set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaAddressModeWrap; texreference.addressMode[2]=cudaaddressModeClamp; //bind texture reference with cuda array cudaBindTextureToArray(texreference,carray,channel); blocksize.x=8; blocksize.y=8; blocksize.z=8; blocknum.x=(int)ceil((float)size/8); blocknum.y=(int)ceil((float)size/8); //execute device kernel kernel >>(dmatrix,size);

52 Texture Memory //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost); //free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray); return 0; }

53 Texture Memory __global__ void kernel(float* dmatrix,int size) { int loop; int xindex; int yindex; int zindex; //calculate each thread global index xindex=threadIdx.x+blockIdx.x*blockDim.x; yindex=threadIdx.y+blockIdx.y*blockDim.y; for(loop=0;loop<size;loop++) { zindex=loop; //fetch cuda array via texture reference dmatrix[zindex*size*size+yindex*size+xindex]= tex3D(texreference,xindex,yindex,zindex); } return; }

Performance comparison: image projection

55 Texture Memory image projection or ray casting

56 Texture Memory trilinear interpolation on nearby 8 pixels intrinsic interpolation units is very powerful global memory accessing is very close to random

57 Texture Memory MethodTimeSpeedup global1.891- global/locality0.1989.5 texture/point0.07226.2 texture/linear0.03751.1 texture/linear/locality0.012157.5 texture/linear/locality/fast math0.011171.9 object size 512 x 512x 512 / ray number 512 x 512

Why texture memory is so powerful?

59 Texture Memory CUDA Array is reordered to something like space filling Z-order - software driver supports reordering data - hardware supports spatial memory layout

Why only readable texture cache?

61 Texture cache cannot detect the dirty data Texture Memory host memory cache float array load from memory to cache perform some operations on cache lazy update for write-back reload from memory to cache modified by other threads

62 Write data to global memory directly without texture cache - only suitable for global linear memory not cuda array Texture Memory device memory cache float array write data to global memory directly read data through texture cache tex1Dfetch(texreference,index) darray[index]=value; texture cache may not be updated

How about the texture data locality?

64 Texture Memory all blocks get scheduled round-robin based on the number of shaders Why CUDA distributes the work blocks in horizontal direction?

65 Texture Memory load balancing on overall SMs, suppose consecutive blocks have very similar work load texture cache data locality, suppose consecutive blocks use similar nearby data

66 Texture Memory reorder the block index fitting into z-order to take advantage of texture L1 cache

67 Texture Memory streaming processors temp1=a/b+sin(c) special function units temp2[loop]=__cos(d) texture operation units temp3=tex2D(ref,x,y) concurrent execution for independent units

68 Texture Memory MemoryLocationCacheSpeedAccess globaloff-chipnohundredsall threads constantoff-chipyesone ~ hundredsall threads textureoff-chipyesone ~ hundredsall threads sharedon-chip-oneblock threads localoff-chipnovery slowsingle thread registeron-chip-onesingle thread instructionoff-chipyes-invisible

69 Texture Memory MemoryRead/WriteProperty globalread/writeinput or output constantreadno structure texturereadlocality structure sharedread/writeshared within block localread/write- registerread/writelocal temp variable

70 Reference - Mark Harris http://www.markmark.net/ http://www.markmark.net/ - Wei-Chao Chen http://www.cs.unc.edu/~ciao/ http://www.cs.unc.edu/~ciao/ - Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php http://impact.crhc.illinois.edu/people/current/hwu.php

CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

Similar presentations

Presentation on theme: "CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

Similar presentations

Presentation on theme: "CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung"— Presentation transcript:

Similar presentations

About project

Feedback