1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 5, 2011, 3-DBlocks.ppt Addressing 2-D grids with 3-D blocks Class Discussion Notes.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA Grids, Blocks, and Threads
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CIS 565 Fall 2011 Qing Sun
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
Computer Engg, IIT(BHU)
CUDA Programming Model
GPU Memories These notes will introduce:
Device Routines and device variables
CUDA Grids, Blocks, and Threads
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Device Routines and device variables
Measuring Performance
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
CUDA Grids, Blocks, and Threads
CUDA Programming Model
Measuring Performance
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's Toolbox for CUDA Karen L. Karavanic, Portland State University David Bunde, Knox College Jens Mache, Lewis and Clark College Barry Wilkinson, University of North Carolina Charlotte Wednesday November 14, 2012 Session 2 (Part II ): Further Features and Performance of CUDA Programs 1:30 pm – 3:00 pm SC12 Workshop 119 Session2.ppt Modification date: Nov 11, 2012 B. Wilkinson

2 Review of grids, blocks, and threads, build-in CUDA variables (from Session 1) Vector/matrix addition and multiplication Measuring performance Improving performance using memory coalescing and shared memory Session 2

3 Review Grids, Blocks, and Threads Threads grouped into “blocks” Blocks can be 1, 2, or 3 dimensional Each kernel call uses a “grid” of blocks Grids can be 1, 2 or 3* dimensional Programmer needs to specify grid/block organization on each kernel call (which can be different each time), within limits set by the GPU. Limits*: Maximum number of threads per block: 1024 Maximum x- and y- dimension of thread block: 1024 Maximum z-dimension of a block: 64 Maximum size of each dimension of grid of thread blocks: * for compute capability 2.x+ devices. Our C2050s are compute capability 2.0. As of mid 2012, compute capabilities up to 3.x

4 Need to provide each kernel call with values for: Number of blocks in each dimension Threads per block in each dimension myKernel >>(arg1, … ); B – a CUDA defined structure that defines number of blocks in grid in each dimension (1D, 2D, or possibly 3D). T – a CUDA defined structure that defines number of threads in a block in each dimension (1D, 2D, or 3D). If want a 1-D structure, can use a integer for B and T. Defining Grid/Block Structure

5 CUDA Built-in Variables for a 1-D grid and 1-D block threadIdx.x -- “thread index” within block in “x” dimension blockIdx.x -- “block index” within grid in “x” dimension blockDim.x -- “block dimension” in “x” dimension (i.e. number of threads in block in x dimension) threadIdx.x blockIdx.x = 3 threadIdx.x blockIdx.x = 1blockIdx.x = 0blockIdx.x = 2 Global ID 26 Global thread ID = blockIdx.x * blockDim.x + threadIdx.x = 3 * = thread 26 with linear global addressing Example - 4 blocks, each having 8 threads:

6 #define N 2048 // size of vectors #define T 256 // number of threads per block #define B 8 // number blocks in grid, N/T, one element per thread __global__ void vecAdd(int *A, int *B, int *C) { int i = blockIdx.x*blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main (int argc, char **argv ) { … vecAdd >>(devA, devB, devC); … return (0); } Code example with a 1-D grid and blocks Vector addition Note: __global__ CUDA function qualifier. __ is two underscores __global__ must return a void

7 Built-in CUDA Variables for Grid/Block Sizes dim3 gridDim -- Grid dimensions, x, y, or z. # of blocks in grid =gridDim.x * gridDim.y * gridDim.z dim3 blockDim -- Size of block dimensions x, y, and z. # of threads in block = blockDim.x * blockDim.y * blockDim.z dim3 essentially a structure of unsigned integers: x, y, z. Example to set dimensions: dim3 grid(16, 16); // Grid x 16 blocks dim3 block(32, 32); // Block x 32 threads … myKernel >>(...); which sets: gridDim.x = 16, gridDim.y = 16, gridDim.y = 1 blockDim.x = 32, blockDim.y = 32, blockDim.z = 1

8 uint3 blockIdx -- block index within grid: blockIdx.x, blockIdx.y, blockIdx.z uint3 threadIdx -- thread index within block: blockIdx.x, blockIdx.y, blockId.z uint3 essentially as a CUDA-defined structure of unsigned integers: x, y, z. 2-D block and grid Global thread ID: x = blockIdx.x*blockDim.x+threadIdx.x; y = blockIdx.y*blockDim.y+threadIdx.y; CUDA Built-in Variables for Grid/Block Indices threadID.x threadID.y blockIdx.x * blockDim.x + threadIdx.x blockIdx.y * blockDim.y + threadIdx.y Thread Grid Block

9 Flattening arrays onto linear memory Generally memory allocated dynamically on device (GPU) and that case we cannot not use 2-dimensional indices (e.g. A[row][column]) to access array as we might otherwise. Need to know how the array is laid out in memory and then compute the distance from the beginning of the array. C uses row-major order --- rows are stored one after the other in memory, i.e. row 0 then row 1 etc. Note: GPU memory can be allocated statically see later.

10 Flattening an array Number of columns, N column Array element* a[row][column] = a[offset] offset = column + row * N CUDA code int col = blockIdx.x*blockDim.x+threadIdx.x; int row = blockIdx.y*blockDim.y+threadIdx.y; int index = col + row * N; A[index] = … row * number of columns row 0 0 N-1 * Note: Another way to flatten array is: offset = row + column * N. Will come back to this later as it has very significant consequences on performance.

11 Matrix mapped on 2-D Grids and 2-D blocks threadID.x threadID.y blockIdx.x * blockDim.x + threadIdx.x blockIdx.y * blockDim.y + threadIdx.y A[][column] A[row][] Thread Arrays mapped onto structure, one element per thread Array Grid Block Basically array divided into “tiles” and one tile mapped onto one block

12 Matrix multiplication __global__ void gpu_matrixmult(int *a, int *b, int *c, int N) { int k, sum = 0; int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; if(col < N && row < N) { for (k = 0; k < N; k++) sum += a[row * N + k] * b[k * N + col]; c[row * N + col] = sum; }

13 Measuring Performance Since the primary motive for using GPUs in HPC is increased performance, we need to be able to measure code performance. Here we will introduce: Timing Program Execution How to measure time of execution of CUDA programs CUDA “events” Synchronous and asynchronous CUDA routines Bandwidth measures Computation measures – floating point operations/sec

14 Ways to measure time of execution Generally instrument code. Measure time at two places and get difference. Routines to use to measure time: C clock() or time() routines CUDA “events” -- seems the best way Timing measured using GPU clock. Resolution approx ½ millisecond CUDA SDK timer

15 Timing GPU Execution with CUDA events Code cudaEvent_t start, stop; float elapsedTime; cudaEventCreate(&start); // create event objects cudaEventCreate(&stop); cudaEventRecord(start, 0); // Record start event. cudaEventRecord(stop, 0); // record end event cudaEventSynchronize(stop); // wait for event recorded cudaEventElapsedTime(&elapsedTime, start, stop); //time between events cudaEventDestroy(start);//destroy start event cudaEventDestroy(stop);); //destroy stop event Time period cudaEventRecord() asynchronous and may return before recording event! cudaEventSynchronize() waits until event actually recorded - when all work prior to specified event done by threads complete. Not necessary if synchronous CUDA call in code.

16 Asynchronous and synchronous calls Kernels Kernel starts after all previous CUDA calls completed Control returned to CPU immediately (asynchronous, non-blocking) cudaMemcpy Copy starts after all previous CUDA calls completed Returns after copy complete (synchronous)

17 Asynchronous CUDA routines returning before they are complete – a big issue. First kernel launch more timing consuming than subsequent kernel executions because of code being transferred to GPU. Issues to watch for

18 Bandwidth Bandwidth is the rate at which data is transferred. Physical connection will define the maximum system bandwidth. Maximum bandwidth S2050 (4 GPUs) GB/sec C2050 Telsa (coit-grid06/7) GB/sec GTX GB/sec GT 320M/330M (in Mac pro laptops)25.6 GB/sec Pentium Core i7 with Quickpath25.6 GB/sec Xbox6.4 GB/sec Wikipedia: Comparison of Nvidia graphics processing units

19 Effective Bandwidth Effective bandwidth is the actual bandwidth achieved by a program. If we measure the effective bandwidth of a program, we can compare that to the maximum possible. Effective bandwidth achieved by a program/kernel given by: Effective Bandwidth = (number_Bytes/time) * GB/s where: number_Bytes is total number of bytes read or written time is the time period in seconds GB/s = Gigabytes per second = 1,000,000,000 Bytes/s Use effective bandwidth as a metric for measuring performance/optimization benefits* * from NVIDIA CUDA C Best Practices Guide, Version 3.2, 8/20/2010

20 Bandwidth of Matrix Copy Operation From NVIDIA CUDA C Best Practices Guide, Version 3.2, 8/20/2010 Copying an N x N matrix: ( (N 2 x b x 2) / time) x GB/sec where there are b bytes in each number. Need to know size of variables. Integers, int (32 bits)b = 4 bytes Floating point (32 bits)b = 4 bytes Double (64 bits)b = 8 bytes 2 transfers -- Read plus write.

21 Computational Measures GFLOPs Classical measure in HPC -- the number of floating point operations. Systems have max/peak GFLOPs IBM's Sequoia supercomputer *16 petaFLOPs S2050 (4 GPUs) 5152 GFLOPS C2050 Telsa (on our coit-grid06/7)1288 GFLOPS Pentium Core i GFLOPS Actual FLOPs -- Measured using standard benchmark programs such as LINPACK Can measure FLOPs on your program, can see how close it get to peak (which presumably is doing only floating point operations). * Current world record June 2012 Peak single precision GFLOPs petaFLOPs, FLOPS, Gflops = 10 9 FLOPS)

22 #define N 1000 // a big number up to INT_MAX, 2,147,483,647 __global__ void gpu_compute(float *result) { int i, j; float a = 0.0; int tid = blockIdx.x * blockDim.x + threadIdx.x; for (i = 0; i < N; i++) for (j = 0; j < N; j++) a = a ;// do something, N x N floating pt operations result[tid] = a;// store result return; } int main(int argc, char *argv[]) { int T = 1, B = 1; // threads per block and blocks per grid float cpu_result, *gpu_result, ans[T * B];// result from gpu, to make sure computation is being done cudaEvent_t start, end; // using cuda events to measure time float time; // which is applicable for asynchronous code also cudaEventCreate(&start); // instrument code to measure start time cudaEventCreate(&end); cudaEventRecord(start, 0 ); cudaMalloc((void**) &gpu_result, T * B * sizeof(float)); gpu_compute >>(gpu_result); cudaMemcpy(ans,gpu_result, T * B * sizeof(float),cudaMemcpyDeviceToHost); cudaEventRecord(end, 0 ); // instrument code to measure end time cudaEventSynchronize(end); cudaEventElapsedTime(&time, start, end); printf("GPU, Answer thread 0, %e\n", ans[0]); printf("GPU Number of floating pt operations done %e\n", (double) N * N * T * B); printf("GPU Time using CUDA events: %f ms\n", time); // time is in ms cudaEventDestroy(start); cudaEventDestroy(end); return 0; } Sample code to measure performance

23 GPU Memories These notes introduce: The basic memory hierarchy in the NVIDIA GPU -- global memory, shared memory, register file, constant memory How to declare variables for each memory Cache memory and making most effective in program

24 Improving Performance Using GPU Memory Hierarchy Global memory is off-chip on the GPU card. Even though an order of magnitude faster than CPU memory, still relatively slow and a bottleneck for performance GPU provided with faster on-chip memory. Two principal types on-chip: Shared memory -- up to around 15 x speed of global memory. Registers -- similar to shared memory potentially. Need to explicitly transfer from global memory to on-chip memories.

25 Declaring program variables for registers, shared memory and global memory MemoryDeclarationScopeLifetime RegistersAutomatic variables*ThreadKernel other than arrays LocalAutomatic array variablesThreadKernel Shared__shared__BlockKernel Global__device__GridApplication Constant__constant__GridApplication *Automatic variables allocated automatically when entering scope of variable and de- allocated when leaving scope. In C, all variables declared within a block are “automatic” by default, see Check registers have lifetime of a warp

26 Global Memory __device__ Global memory for data available to all threads in device. Declared outside function bodies Scope of Grid and lifetime of application #include #define N 1000 … __device__ int A[N]; __global__ kernel() { int tid = blockIdx.x * blockDim.x + threadIdx.x; A[tid] = … … } main { … } Note this is statically declared GPU memory – to get contents back to host need version of cudaMemcpy() that uses the array name rather than a pointer

27 Issues with using Global memory Long delays, slow Access congestion Cannot synchronize accesses Need to ensure no conflicts of accesses between threads

28 Shared Memory Shared memory is on the GPU chip and very fast Separate data available to all threads in one block. Declared inside function bodies Scope of block and lifetime of kernel call So each block would have its own array A[N] #include #define N 1000 … __global__ kernel() { __shared__ int A[N]; int tid = threadIdx.x; A[tid] = … … } main { … }

29 Transferring data to shared memory int A[N][N];// to be copied from host to device with cudamalloc __global__ void myKernel (int *A_global) { __shared__ int A_sh[n][n];// declare shared memory int row = … int col = … A_sh[i][j] = *A_global[row + col*N]; //copy from global to shared … } main () { … cudaMalloc((void**)dev_ A, size);// allocate global memory cudoMemcpy(dev_A, A, size, cudaMemcpyHostToDevice); //copy to global myKernel >(dev_A) … }

30 Issues with Shared Memory Not immediately synchronized after access. Usually it is the writes that matter. Use __syncthreads() before you read data that has been altered. Shared memory is very limited (Fermi has up to 48KB per GPU core, NOT per block) Hence may have to divide data into “chunks”

31 Registers Compiler will place variables declared in kernel in registers when possible. Limit to the number of registers Fermi has bit registers Registers divided across groups of 32 threads (“warps”) that will operate in the SIMT mode and have the lifetime of the warps?? __global__ kernel() { int x, y, z; … }

32 Arrays declared within kernel (Automatic array variables) __global__ kernel() { int A[10]; … } Generally stored in global memory but private copy made for each thread.* Can be as slow access as global memory, except cached, see later If array indexed with a constant value, compiler may use registers * Global “local” memory, see later

33 Constant Memory __constant__ For data not altered by device. Although stored in global memory, cached and has fast access Declared outside function bodies Scope of grid and lifetime of application Size currently limited to bytes #include … __constant__ int n; __global__ kernel() { … } main { n = … … }

34 Local memory Resides in device memory space (global memory) and is slow except that organized such that consecutive 32-bit words accessed by consecutive threadIDs for best coalesced accesses when possible. For compute capability 2.x, cached in L1 and L2 caches on-chip Used to hold arrays if not indexed with a constant value and for variables when there are no more register available for them

35 Cache memory More recent GPUs have L1 and L2 cache memory, but apparently without cache coherence so up to the programmer to ensure that. Make sure each thread accesses different locations Ideally arrange accesses to be in same cache lines Compute capability 1.3 Tesla’s do not have cache memory Compute capability 2.0 Fermi’s have L1/L2 caches

36 Poor Performance from Poor Data Layout __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[1000*i] = … } Very Bad! Each thread accesses a location on a different line. Fermi line size is 32 integers or floats

37 Taking Advantage of Cache __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[i] = … } Good! Groups of 32 accesses by consecutive threads on same line. Threads will be in same warp Fermi line size is 32 integers or floats

38 Warp A “warp’ in CUDA is a group of 32 threads that will operate in the SIMT mode A “half warp” (16 threads) actually execute simultaneously (current GPUs) Using knowledge of warps and how the memory is laid out can improve code performance

39 Memory Banks Memory 1Memory 4Memory 3Memory 2 Device (GPU) Consecutive locations on successive memory banks A[0]A[1]A[2]A[3] Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.

40 Memory Coalescing Aligned memory accesses Threads can read 4, 8, or 16 bytes at a time from global memory but only if accesses are aligned. That is: A 4-byte read must start at address…xxxxx00 A 8 byte read must start at address…xxxx000 A 16 byte read must start at address…xxx0000 Then access is much faster (twice?)

41 Ideally try to arrange for threads to access different memory modules at the same time, and consecutive addresses A bad case would be: Thread 0 to access A[0], A[2],... A[15] Thread 1 to access A[16], A[17],... A[31] Thread 2 to access A[32], A[33],... A[63] … etc. Good case would be Thread 0 to access A[0], A[16],... A[31] Thread 1 to access A[1], A[17],... A[32] Thread 2 to access A[2], A[18],... A[33] … etc. if there are 16 banks. Need to know that detail! Time Hands-on session will explore effects of memory coalescing

42 Simply load numbers into a two-dimensional array Global threadID loaded into array element being accessed so one can tell which thread accesses which location. Loading could be done across rows or down column Time of execution of each waycompared. GPU structure -- one or more 2-D 32 x 32 blocks in a 2-D grid. Experiment

43 __global__ void gpu_Comput1 (int *h, int N, int T) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N;// thread ID int index = col + row * N;// array index for (int t = 0; t < T; t++)// loop to reduce other time effects h[index] = threadID; // load array with global thread ID } One way Alternate way part of hands-on tasks

44 A grid of one block and iterations Array 32 x 32 Speedup = 17.16

45 Unified Virtual Addressing CUDA Version 4 (2012) Host and device(s) memories share single addressing space Pointers can point to either host or device memories cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice options done with a single routine cudaMemcpyDefault Host memory GPU memory CPU GPU 0x0000 0xFFFF May not be necessary to explicitly copy data between memories in program.

Questions