1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.

1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's Toolbox for CUDA Karen L. Karavanic, Portland State University David Bunde, Knox College Jens Mache, Lewis and Clark College Barry Wilkinson, University of North Carolina Charlotte Wednesday November 14, 2012 Session 2 (Part II ): Further Features and Performance of CUDA Programs 1:30 pm – 3:00 pm SC12 Workshop 119 Session2.ppt Modification date: Nov 11, 2012 B. Wilkinson

2 Review of grids, blocks, and threads, build-in CUDA variables (from Session 1) Vector/matrix addition and multiplication Measuring performance Improving performance using memory coalescing and shared memory Session 2

3 Review Grids, Blocks, and Threads Threads grouped into “blocks” Blocks can be 1, 2, or 3 dimensional Each kernel call uses a “grid” of blocks Grids can be 1, 2 or 3* dimensional Programmer needs to specify grid/block organization on each kernel call (which can be different each time), within limits set by the GPU. Limits*: Maximum number of threads per block: 1024 Maximum x- and y- dimension of thread block: 1024 Maximum z-dimension of a block: 64 Maximum size of each dimension of grid of thread blocks: 65535 * for compute capability 2.x+ devices. Our C2050s are compute capability 2.0. As of mid 2012, compute capabilities up to 3.x

4 Need to provide each kernel call with values for: Number of blocks in each dimension Threads per block in each dimension myKernel >>(arg1, … ); B – a CUDA defined structure that defines number of blocks in grid in each dimension (1D, 2D, or possibly 3D). T – a CUDA defined structure that defines number of threads in a block in each dimension (1D, 2D, or 3D). If want a 1-D structure, can use a integer for B and T. Defining Grid/Block Structure

5 CUDA Built-in Variables for a 1-D grid and 1-D block threadIdx.x -- “thread index” within block in “x” dimension blockIdx.x -- “block index” within grid in “x” dimension blockDim.x -- “block dimension” in “x” dimension (i.e. number of threads in block in x dimension) 01234765012347650123476501234765 threadIdx.x blockIdx.x = 3 threadIdx.x blockIdx.x = 1blockIdx.x = 0blockIdx.x = 2 Global ID 26 Global thread ID = blockIdx.x * blockDim.x + threadIdx.x = 3 * 8 + 2 = thread 26 with linear global addressing Example - 4 blocks, each having 8 threads:

6 #define N 2048 // size of vectors #define T 256 // number of threads per block #define B 8 // number blocks in grid, N/T, one element per thread __global__ void vecAdd(int *A, int *B, int *C) { int i = blockIdx.x*blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { … vecAdd >>(devA, devB, devC); … return (0); } Code example with a 1-D grid and blocks Vector addition Note: __global__ CUDA function qualifier. __ is two underscores __global__ must return a void

7 Built-in CUDA Variables for Grid/Block Sizes dim3 gridDim -- Grid dimensions, x, y, or z. # of blocks in grid =gridDim.x * gridDim.y * gridDim.z dim3 blockDim -- Size of block dimensions x, y, and z. # of threads in block = blockDim.x * blockDim.y * blockDim.z dim3 essentially a structure of unsigned integers: x, y, z. Example to set dimensions: dim3 grid(16, 16); // Grid -- 16 x 16 blocks dim3 block(32, 32); // Block -- 32 x 32 threads … myKernel >>(...); which sets: gridDim.x = 16, gridDim.y = 16, gridDim.y = 1 blockDim.x = 32, blockDim.y = 32, blockDim.z = 1

8 uint3 blockIdx -- block index within grid: blockIdx.x, blockIdx.y, blockIdx.z uint3 threadIdx -- thread index within block: blockIdx.x, blockIdx.y, blockId.z uint3 essentially as a CUDA-defined structure of unsigned integers: x, y, z. 2-D block and grid Global thread ID: x = blockIdx.x*blockDim.x+threadIdx.x; y = blockIdx.y*blockDim.y+threadIdx.y; CUDA Built-in Variables for Grid/Block Indices threadID.x threadID.y blockIdx.x * blockDim.x + threadIdx.x blockIdx.y * blockDim.y + threadIdx.y Thread Grid Block

9 Flattening arrays onto linear memory Generally memory allocated dynamically on device (GPU) and that case we cannot not use 2-dimensional indices (e.g. A[row][column]) to access array as we might otherwise. Need to know how the array is laid out in memory and then compute the distance from the beginning of the array. C uses row-major order --- rows are stored one after the other in memory, i.e. row 0 then row 1 etc. Note: GPU memory can be allocated statically see later.

10 Flattening an array Number of columns, N column Array element* a[row][column] = a[offset] offset = column + row * N CUDA code int col = blockIdx.x*blockDim.x+threadIdx.x; int row = blockIdx.y*blockDim.y+threadIdx.y; int index = col + row * N; A[index] = … row * number of columns row 0 0 N-1 * Note: Another way to flatten array is: offset = row + column * N. Will come back to this later as it has very significant consequences on performance.

11 Matrix mapped on 2-D Grids and 2-D blocks threadID.x threadID.y blockIdx.x * blockDim.x + threadIdx.x blockIdx.y * blockDim.y + threadIdx.y A[][column] A[row][] Thread Arrays mapped onto structure, one element per thread Array Grid Block Basically array divided into “tiles” and one tile mapped onto one block

12 Matrix multiplication __global__ void gpu_matrixmult(int *a, int *b, int *c, int N) { int k, sum = 0; int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; if(col < N && row < N) { for (k = 0; k < N; k++) sum += a[row * N + k] * b[k * N + col]; c[row * N + col] = sum; }

13 Measuring Performance Since the primary motive for using GPUs in HPC is increased performance, we need to be able to measure code performance. Here we will introduce: Timing Program Execution How to measure time of execution of CUDA programs CUDA “events” Synchronous and asynchronous CUDA routines Bandwidth measures Computation measures – floating point operations/sec

14 Ways to measure time of execution Generally instrument code. Measure time at two places and get difference. Routines to use to measure time: C clock() or time() routines CUDA “events” -- seems the best way Timing measured using GPU clock. Resolution approx ½ millisecond CUDA SDK timer

15 Timing GPU Execution with CUDA events Code cudaEvent_t start, stop; float elapsedTime; cudaEventCreate(&start); // create event objects cudaEventCreate(&stop); cudaEventRecord(start, 0); // Record start event. cudaEventRecord(stop, 0); // record end event cudaEventSynchronize(stop); // wait for event recorded cudaEventElapsedTime(&elapsedTime, start, stop); //time between events cudaEventDestroy(start);//destroy start event cudaEventDestroy(stop);); //destroy stop event Time period cudaEventRecord() asynchronous and may return before recording event! cudaEventSynchronize() waits until event actually recorded - when all work prior to specified event done by threads complete. Not necessary if synchronous CUDA call in code.

16 Asynchronous and synchronous calls Kernels Kernel starts after all previous CUDA calls completed Control returned to CPU immediately (asynchronous, non-blocking) cudaMemcpy Copy starts after all previous CUDA calls completed Returns after copy complete (synchronous)

17 Asynchronous CUDA routines returning before they are complete – a big issue. First kernel launch more timing consuming than subsequent kernel executions because of code being transferred to GPU. Issues to watch for

18 Bandwidth Bandwidth is the rate at which data is transferred. Physical connection will define the maximum system bandwidth. Maximum bandwidth S2050 (4 GPUs) 4121.6 GB/sec C2050 Telsa (coit-grid06/7)1030.4 GB/sec GTX 280141.6 GB/sec GT 320M/330M (in Mac pro laptops)25.6 GB/sec Pentium Core i7 with Quickpath25.6 GB/sec Xbox6.4 GB/sec Wikipedia: Comparison of Nvidia graphics processing units http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#Tesla

19 Effective Bandwidth Effective bandwidth is the actual bandwidth achieved by a program. If we measure the effective bandwidth of a program, we can compare that to the maximum possible. Effective bandwidth achieved by a program/kernel given by: Effective Bandwidth = (number_Bytes/time) * 10 -9 GB/s where: number_Bytes is total number of bytes read or written time is the time period in seconds GB/s = Gigabytes per second = 1,000,000,000 Bytes/s Use effective bandwidth as a metric for measuring performance/optimization benefits* * from NVIDIA CUDA C Best Practices Guide, Version 3.2, 8/20/2010

20 Bandwidth of Matrix Copy Operation From NVIDIA CUDA C Best Practices Guide, Version 3.2, 8/20/2010 Copying an N x N matrix: ( (N 2 x b x 2) / time) x 10 -9 GB/sec where there are b bytes in each number. Need to know size of variables. Integers, int (32 bits)b = 4 bytes Floating point (32 bits)b = 4 bytes Double (64 bits)b = 8 bytes 2 transfers -- Read plus write.

21 Computational Measures GFLOPs Classical measure in HPC -- the number of floating point operations. Systems have max/peak GFLOPs IBM's Sequoia supercomputer *16 petaFLOPs S2050 (4 GPUs) 5152 GFLOPS C2050 Telsa (on our coit-grid06/7)1288 GFLOPS Pentium Core i7 40-55 GFLOPS Actual FLOPs -- Measured using standard benchmark programs such as LINPACK Can measure FLOPs on your program, can see how close it get to peak (which presumably is doing only floating point operations). * Current world record June 2012 Peak single precision GFLOPs petaFLOPs, 10 15 FLOPS, Gflops = 10 9 FLOPS)

22 #define N 1000 // a big number up to INT_MAX, 2,147,483,647 __global__ void gpu_compute(float *result) { int i, j; float a = 0.0; int tid = blockIdx.x * blockDim.x + threadIdx.x; for (i = 0; i < N; i++) for (j = 0; j < N; j++) a = a + 0.0001;// do something, N x N floating pt operations result[tid] = a;// store result return; } int main(int argc, char *argv[]) { int T = 1, B = 1; // threads per block and blocks per grid float cpu_result, *gpu_result, ans[T * B];// result from gpu, to make sure computation is being done cudaEvent_t start, end; // using cuda events to measure time float time; // which is applicable for asynchronous code also cudaEventCreate(&start); // instrument code to measure start time cudaEventCreate(&end); cudaEventRecord(start, 0 ); cudaMalloc((void**) &gpu_result, T * B * sizeof(float)); gpu_compute >>(gpu_result); cudaMemcpy(ans,gpu_result, T * B * sizeof(float),cudaMemcpyDeviceToHost); cudaEventRecord(end, 0 ); // instrument code to measure end time cudaEventSynchronize(end); cudaEventElapsedTime(&time, start, end); printf("GPU, Answer thread 0, %e\n", ans[0]); printf("GPU Number of floating pt operations done %e\n", (double) N * N * T * B); printf("GPU Time using CUDA events: %f ms\n", time); // time is in ms cudaEventDestroy(start); cudaEventDestroy(end); return 0; } Sample code to measure performance

23 GPU Memories These notes introduce: The basic memory hierarchy in the NVIDIA GPU -- global memory, shared memory, register file, constant memory How to declare variables for each memory Cache memory and making most effective in program

24 Improving Performance Using GPU Memory Hierarchy Global memory is off-chip on the GPU card. Even though an order of magnitude faster than CPU memory, still relatively slow and a bottleneck for performance GPU provided with faster on-chip memory. Two principal types on-chip: Shared memory -- up to around 15 x speed of global memory. Registers -- similar to shared memory potentially. Need to explicitly transfer from global memory to on-chip memories.

25 Declaring program variables for registers, shared memory and global memory MemoryDeclarationScopeLifetime RegistersAutomatic variables*ThreadKernel other than arrays LocalAutomatic array variablesThreadKernel Shared__shared__BlockKernel Global__device__GridApplication Constant__constant__GridApplication *Automatic variables allocated automatically when entering scope of variable and de- allocated when leaving scope. In C, all variables declared within a block are “automatic” by default, see http://en.wikipedia.org/wiki/Automatic_variable Check registers have lifetime of a warp

26 Global Memory __device__ Global memory for data available to all threads in device. Declared outside function bodies Scope of Grid and lifetime of application #include #define N 1000 … __device__ int A[N]; __global__ kernel() { int tid = blockIdx.x * blockDim.x + threadIdx.x; A[tid] = … … } main { … } Note this is statically declared GPU memory – to get contents back to host need version of cudaMemcpy() that uses the array name rather than a pointer

27 Issues with using Global memory Long delays, slow Access congestion Cannot synchronize accesses Need to ensure no conflicts of accesses between threads

28 Shared Memory Shared memory is on the GPU chip and very fast Separate data available to all threads in one block. Declared inside function bodies Scope of block and lifetime of kernel call So each block would have its own array A[N] #include #define N 1000 … __global__ kernel() { __shared__ int A[N]; int tid = threadIdx.x; A[tid] = … … } main { … }

29 Transferring data to shared memory int A[N][N];// to be copied from host to device with cudamalloc __global__ void myKernel (int *A_global) { __shared__ int A_sh[n][n];// declare shared memory int row = … int col = … A_sh[i][j] = *A_global[row + col*N]; //copy from global to shared … } main () { … cudaMalloc((void**)dev_ A, size);// allocate global memory cudoMemcpy(dev_A, A, size, cudaMemcpyHostToDevice); //copy to global myKernel >(dev_A) … }

30 Issues with Shared Memory Not immediately synchronized after access. Usually it is the writes that matter. Use __syncthreads() before you read data that has been altered. Shared memory is very limited (Fermi has up to 48KB per GPU core, NOT per block) Hence may have to divide data into “chunks”

31 Registers Compiler will place variables declared in kernel in registers when possible. Limit to the number of registers Fermi has 32768 32-bit registers Registers divided across groups of 32 threads (“warps”) that will operate in the SIMT mode and have the lifetime of the warps?? __global__ kernel() { int x, y, z; … }

32 Arrays declared within kernel (Automatic array variables) __global__ kernel() { int A[10]; … } Generally stored in global memory but private copy made for each thread.* Can be as slow access as global memory, except cached, see later If array indexed with a constant value, compiler may use registers * Global “local” memory, see later

33 Constant Memory __constant__ For data not altered by device. Although stored in global memory, cached and has fast access Declared outside function bodies Scope of grid and lifetime of application Size currently limited to 65536 bytes #include … __constant__ int n; __global__ kernel() { … } main { n = … … }

34 Local memory Resides in device memory space (global memory) and is slow except that organized such that consecutive 32-bit words accessed by consecutive threadIDs for best coalesced accesses when possible. For compute capability 2.x, cached in L1 and L2 caches on-chip Used to hold arrays if not indexed with a constant value and for variables when there are no more register available for them

35 Cache memory More recent GPUs have L1 and L2 cache memory, but apparently without cache coherence so up to the programmer to ensure that. Make sure each thread accesses different locations Ideally arrange accesses to be in same cache lines Compute capability 1.3 Tesla’s do not have cache memory Compute capability 2.0 Fermi’s have L1/L2 caches

36 Poor Performance from Poor Data Layout __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[1000*i] = … } Very Bad! Each thread accesses a location on a different line. Fermi line size is 32 integers or floats

37 Taking Advantage of Cache __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[i] = … } Good! Groups of 32 accesses by consecutive threads on same line. Threads will be in same warp Fermi line size is 32 integers or floats

38 Warp A “warp’ in CUDA is a group of 32 threads that will operate in the SIMT mode A “half warp” (16 threads) actually execute simultaneously (current GPUs) Using knowledge of warps and how the memory is laid out can improve code performance

39 Memory Banks Memory 1Memory 4Memory 3Memory 2 Device (GPU) Consecutive locations on successive memory banks A[0]A[1]A[2]A[3] Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.

40 Memory Coalescing Aligned memory accesses Threads can read 4, 8, or 16 bytes at a time from global memory but only if accesses are aligned. That is: A 4-byte read must start at address…xxxxx00 A 8 byte read must start at address…xxxx000 A 16 byte read must start at address…xxx0000 Then access is much faster (twice?)

41 Ideally try to arrange for threads to access different memory modules at the same time, and consecutive addresses A bad case would be: Thread 0 to access A[0], A[2],... A[15] Thread 1 to access A[16], A[17],... A[31] Thread 2 to access A[32], A[33],... A[63] … etc. Good case would be Thread 0 to access A[0], A[16],... A[31] Thread 1 to access A[1], A[17],... A[32] Thread 2 to access A[2], A[18],... A[33] … etc. if there are 16 banks. Need to know that detail! Time Hands-on session will explore effects of memory coalescing

42 Simply load numbers into a two-dimensional array Global threadID loaded into array element being accessed so one can tell which thread accesses which location. Loading could be done across rows or down column Time of execution of each waycompared. GPU structure -- one or more 2-D 32 x 32 blocks in a 2-D grid. Experiment

43 __global__ void gpu_Comput1 (int *h, int N, int T) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N;// thread ID int index = col + row * N;// array index for (int t = 0; t < T; t++)// loop to reduce other time effects h[index] = threadID; // load array with global thread ID } One way Alternate way part of hands-on tasks

44 A grid of one block and 1000000 iterations Array 32 x 32 Speedup = 17.16

45 Unified Virtual Addressing CUDA Version 4 (2012) Host and device(s) memories share single addressing space Pointers can point to either host or device memories cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice options done with a single routine cudaMemcpyDefault Host memory GPU memory CPU GPU 0x0000 0xFFFF May not be necessary to explicitly copy data between memories in program.

Questions

1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.

Similar presentations

Presentation on theme: "1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.

Similar presentations

Presentation on theme: "1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's."— Presentation transcript:

Similar presentations

About project

Feedback