CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Intermediate GPGPU Programming in CUDA
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Programming with CUDA WS 08/09 Lecture 6 Thu, 11 Nov, 2008.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Portable Operating System Interface Thread Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 24: Advanced CUDA Feature Highlights April 21, 2009.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Introduction to CUDA Programming Textures Andreas Moshovos Winter 2009 Some material from: Matthew Bolitho’s slides.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Computer Engg, IIT(BHU)
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Basic CUDA Programming
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
L18: CUDA, cont. Memory Hierarchy and Examples
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Programming Massively Parallel Processors Performance Considerations
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung Department of Mathematics National Taiwan University

Register as Cache?

3 Volatile qualifier Volatile Qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x] array[threadIdx.x+1]=2; temp2=array[threadIdx.x] result[threadIdx.x]=temp1*temp2; } identical reads compiler optimized this read away

4 Volatile qualifier Volatile Qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { int temp=array[threadIdx.x]; temp1=temp; array[threadIdx.x+1]=2; temp2=temp; result[threadIdx.x]=temp1*temp2; }

5 Volatile qualifier Volatile Qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; __syncthreads(); temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }

6 Volatile qualifier Volatile Qualifier __global__ void kernelFunc(int* result) { volatile int temp1; volatile int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }

Data Prefetch

8 Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique Data Prefetch Md Pd Pd sub Nd load blue block to shared memory compute blue block on shared memory and load next block to shared memory

9 Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique Data Prefetch for loop { load data from global to shared memory synchronize block compute data in the shared memory synchronize block }

10 Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique Data Prefetch load data from global memory to registers for loop { store data from register to shared memory synchronize block load data from global memory to registers compute data in the shared memory synchronize block } very small overhead both memory are very fast computing and loading overlap register and shared are independent

11 Matrix-matrix multiplication Data Prefetch

Constant Memory

13 Constant Memory Where is constant memory? - data is stored in the device global memory - read data through multiprocessor constant cache - 64KB constant memory and 8KB cache for each multiprocessor How about the performance? - optimized when warp of threads read same location - 4 bytes per cycle through broadcasting to warp of threads - serialized when warp of threads read in different location - very slow when cache miss (read data from global memory) - access latency can range from one to hundreds clock cycles

14 Constant Memory How to use constant memory? - declare constant memory on the file scope (global variable) - copy data to constant memory by host (because it is constant!!) //declare constant memory __constant__ float cst_ptr[size]; //copy data from host to constant memory cudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);

15 Constant Memory //declare constant memory __constant__ float cangle[360]; int main(int argc,char** argv) { int size=3200; float* darray; float hangle[360]; //allocate device memory cudaMalloc((void**)&darray,sizeof(float)*size); //initialize allocated memory cudaMemset(darray,0,sizeof(float)*size); //initialize angle array on host for(int loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f; //copy host angle data to constant memory cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);

16 Constant Memory //execute device kernel test_kernel >>(darray); //free device memory cudaFree(darray); return 0; } __global__ void test_kernel(float* darray) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x; #pragma unroll 10 for(int loop=0;loop<360;loop++) darray[index]=darray[index]+cangle[loop]; return; }

Texture Memory

18 Texture Memory Texture mapping

19 Texture Memory Texture mapping

20 Texture Memory Texture filtering nearest-neighborhood interpolation

21 Texture Memory Texture filtering linear/bilinear/trilinear interpolation

22 Texture Memory Texture filtering two times bilinear interpolation

23 Texture Memory L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Work DistributionPixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB these units perform graphical texture operations

24 Texture Memory two SMs are cooperated as texture processing cluster scalable units on graphics texture specific unit only available for texture

25 Texture Memory texture specific unit texture address units compute texture addresses texture filtering units compute data interpolation read only texture L1 cache

26 Texture Memory L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Work DistributionPixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB read only texture L2 cache for all TPCread only texture L1 cache for each TPC

27 Texture Memory texture specific units

28 Texture Memory Texture is an object for reading data - data is stored on the device global memory - global memory is bound with texture cache L2 FB SP L1 TF Thread Processor SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 global memory

What is the advantages of texture?

30 Texture Memory Data caching - helpful when global memory coalescing is the main bottleneck L2 FB SP L1 TF Thread Processor SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB

31 Texture Memory Data filtering - support linear/bilinear and trilinear hardware interpolation texture specific unit intrinsic interpolation cudaFilterModePoint cudaFilterModeLinear

32 Texture Memory Accesses modes - clamp and wrap memory accessing for out-of-bound addresses texture specific unit clamp boundary wrap boundary cudaAddressModeWrap cudaAddressModeClamp

33 Texture Memory Bound to linear memory - only support 1-dimension problems - only get the benefits from texture cache - not support addressing modes and filtering Bound to cuda array - support float addressing - support addressing modes - support hardware interpolation - support 1/2/3-dimension problems

34 Texture Memory Host code - allocate global linear memory or cuda array - create and set the texture reference on file scope - bind the texture reference to the allocated memory - unbind the texture reference to free cache resource Device code - fetch data by indicating texture reference - fetch data by using texture fetch function

35 Texture Memory Texture memory constrain Compute capability 1.3Compute capability 2.0 1D texture linear memory D texture cuda array1024x128 2D texture cuda array(65536,32768)(65536,65536) 3D texture cuda array(2048,2048,2048)(4096,4096,4096)

36 Texture Memory Measuring texture cache miss or hit number - latest visual profiler can count cache miss or hit - need device compute capability higher than 1.2

Example: 1-dimension linear memory

38 Texture Memory //declare texture reference texture texreference; int main(int argc,char** argv) { int size=3200; float* harray; float* diarray; float* doarray; //allocate host and device memory harray=(float*)malloc(sizeof(float)*size); cudaMalloc((void**)&diarray,sizeof(float)*size); cudaMalloc((void**)&doarray,sizeof(float)*size); //initialize host array before usage for(int loop=0;loop<size;loop++) harray[loop]=(float)rand()/(float)(RAND_MAX-1); //copy array from host to device memory cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);

39 Texture Memory //bind texture reference with linear memory cudaBindTexture(0,texreference,diarray,sizeof(float)*size); //execute device kernel kernel >>(doarray,size); //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result array from device to host memory cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost); //free host and device memory free(harray); cudaFree(diarray); cudaFree(doarray); return 0; }

40 Texture Memory __global__ void kernel(float* doarray,int size) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x; //fetch global memory through texture reference doarray[index]=tex1Dfetch(texreference,index); return; }

41 Texture Memory __global__ void offsetCopy(float* idata,float* odata,int offset) { //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x; //copy data from global memory odata[index]=idata[index+offset]; }

42 Texture Memory __global__ void offsetCopy(float* idata,float* odata,int offset) { //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x; //copy data from global memory odata[index]=tex1Dfetch(texreference,index+offset); }

Example: 2-dimension cuda array

44 Texture Memory #define size 3200 //declare texture reference texture texreference; int main(int argc,char** argv) { dim3 blocknum; dim3 blocksize; float* hmatrix; float* dmatrix; cudaArray* carray; cudaChannelFormatDesc channel; //allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size); //initialize host matrix before usage for(int loop=0;loop<size*size;loop++) hmatrix[loop]=float)rand()/(float)(RAND_MAX-1);

45 Texture Memory //create channel to describe data type channel=cudaCreateChannelDesc (); //allocate device memory for cuda array cudaMallocArray(&carray,&channel,size,size); //copy matrix from host to device memory bytes=sizeof(float)*size*size; cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice); //set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint; //set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaaddressModeClamp;

46 Texture Memory //bind texture reference with cuda array cudaBindTextureToArray(texreference,carray); blocksize.x=16; blocksize.y=16; blocknum.x=(int)ceil((float)size/16); blocknum.y=(int)ceil((float)size/16); //execute device kernel kernel >>(dmatrix,size); //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost); //free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray); return 0; }

47 Texture Memory __global__ void kernel(float* dmatrix,int size) { int xindex; int yindex; //calculate each thread global index xindex=blockIdx.x*blockDim.x+threadIdx.x; yindex=blockIdx.y*blockDim.y+threadIdx.y; //fetch cuda array through texture reference dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex); return; }

Example: 3-dimension cuda array

49 Texture Memory #define size 256 //declare texture reference texture texreference; int main(int argc,char** argv) { dim3 blocknum; dim3 blocksize; float* hmatrix; float* dmatrix; cudaArray* cudaarray; cudaExtent volumesize; cudaChannelFormatDesc channel; cudaMemcpy3DParms copyparms={0}; //allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size);

50 Texture Memory //initialize host matrix before usage for(int loop=0;loop<size*size*size;loop++) hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1); //set cuda array volume size volumesize=make_cudaExtent(size,size,size); //create channel to describe data type channel=cudaCreateChannelDesc (); //allocate device memory for cuda array cudaMalloc3DArray(&cudaarray,&channel,volumesize); //set cuda array copy parameters copyparms.extent=volumesize; copyparms.dstArray=cudaarray; copyparms.kind=cudaMemcpyHostToDevice; copyparms.srcPtr= make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size); cudaMemcpy3D(&copyparms);

51 Texture Memory //set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint; //set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaAddressModeWrap; texreference.addressMode[2]=cudaaddressModeClamp; //bind texture reference with cuda array cudaBindTextureToArray(texreference,carray,channel); blocksize.x=8; blocksize.y=8; blocksize.z=8; blocknum.x=(int)ceil((float)size/8); blocknum.y=(int)ceil((float)size/8); //execute device kernel kernel >>(dmatrix,size);

52 Texture Memory //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost); //free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray); return 0; }

53 Texture Memory __global__ void kernel(float* dmatrix,int size) { int loop; int xindex; int yindex; int zindex; //calculate each thread global index xindex=threadIdx.x+blockIdx.x*blockDim.x; yindex=threadIdx.y+blockIdx.y*blockDim.y; for(loop=0;loop<size;loop++) { zindex=loop; //fetch cuda array via texture reference dmatrix[zindex*size*size+yindex*size+xindex]= tex3D(texreference,xindex,yindex,zindex); } return; }

Performance comparison: image projection

55 Texture Memory image projection or ray casting

56 Texture Memory trilinear interpolation on nearby 8 pixels intrinsic interpolation units is very powerful global memory accessing is very close to random

57 Texture Memory MethodTimeSpeedup global global/locality texture/point texture/linear texture/linear/locality texture/linear/locality/fast math object size 512 x 512x 512 / ray number 512 x 512

Why texture memory is so powerful?

59 Texture Memory CUDA Array is reordered to something like space filling Z-order - software driver supports reordering data - hardware supports spatial memory layout

Why only readable texture cache?

61 Texture cache cannot detect the dirty data Texture Memory host memory cache float array load from memory to cache perform some operations on cache lazy update for write-back reload from memory to cache modified by other threads

62 Write data to global memory directly without texture cache - only suitable for global linear memory not cuda array Texture Memory device memory cache float array write data to global memory directly read data through texture cache tex1Dfetch(texreference,index) darray[index]=value; texture cache may not be updated

How about the texture data locality?

64 Texture Memory all blocks get scheduled round-robin based on the number of shaders Why CUDA distributes the work blocks in horizontal direction?

65 Texture Memory load balancing on overall SMs, suppose consecutive blocks have very similar work load texture cache data locality, suppose consecutive blocks use similar nearby data

66 Texture Memory reorder the block index fitting into z-order to take advantage of texture L1 cache

67 Texture Memory streaming processors temp1=a/b+sin(c) special function units temp2[loop]=__cos(d) texture operation units temp3=tex2D(ref,x,y) concurrent execution for independent units

68 Texture Memory MemoryLocationCacheSpeedAccess globaloff-chipnohundredsall threads constantoff-chipyesone ~ hundredsall threads textureoff-chipyesone ~ hundredsall threads sharedon-chip-oneblock threads localoff-chipnovery slowsingle thread registeron-chip-onesingle thread instructionoff-chipyes-invisible

69 Texture Memory MemoryRead/WriteProperty globalread/writeinput or output constantreadno structure texturereadlocality structure sharedread/writeshared within block localread/write- registerread/writelocal temp variable

70 Reference - Mark Harris Wei-Chao Chen Wen-Mei Hwu