1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, 2012. CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA More GA, Events, Atomics. GA Revisited Speedup with a more computationally intense evaluation function Parallel version of the crossover and mutation.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.
CUDA Grids, Blocks, and Threads
Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
An Introduction to Programming with CUDA Paul Richmond
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Martin Kruliš by Martin Kruliš (v1.0)1.
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
GPU Programming with CUDA – Optimisation Mike Griffiths
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
CIS 565 Fall 2011 Qing Sun
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.
Synchronization These notes introduce:
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CUDA C/C++ Basics Part 2 - Blocks and Threads
GPU Computing CIS-543 Lecture 10: Streams and Events
CUDA Programming Model
GPU Memories These notes will introduce:
Heterogeneous Programming
Device Routines and device variables
CUDA Grids, Blocks, and Threads
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Device Routines and device variables
Measuring Performance
CUDA Execution Model – III Streams and Events
CUDA Programming Model
Measuring Performance
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to measure time of execution of CUDA programs CUDA “events” Synchronous and asynchronous CUDA routines Bandwidth measures Computation measures – floating point operations/sec

2 Ways to measure time of execution Generally instrument code. Measure time at two places and get difference Routines to use to measure time: C clock() or time() routines CUDA “events” (seems the best way) CUDA SDK timer

3 If program uses cudaMemcpy, which is synchronous and waits for previous operations to complete and returns when it is complete, could use clock(): #include // needed for clock() int main() { clock_t start, stop;// return types are clock_t, int’s … start = clock();// number of clock ticks since prog launched cudaMemcpy mykernel >>();// kernel call cudaMemcpy stop = clock(); … printf(“Execution time is %f seconds\n", (float) (stop-start)/(CLOCKS_PER_SEC) ; return 0; } Timing with clock()

4 If just measuring time of asynchronous kernel with clock() Important to remember that kernel calls asynchronous and return immediately and before kernels have fully executed. Hence need to wait for kernel to complete. Can be achieved using cudaThreadSynchronize(): start = clock(); mykernel >>();// kernel call cudaThreadSynchronize(); stop = clock(); (We will discuss synchronization within a computation later.)

5 CUDA event timer In general, better to use CUDA event timer. First need to create event objects. cudaEvent_t event1; cudaEventCreate(&event1); cudaEvent_t event1; cudaEventCreate(&event1); creates two “event” objects, event1 and event1.

6 Recording Events cudaEventRecord(event1, 0) record an “event” into default “stream” (0). Device will record a timestamp for the event when it reaches that event in the stream, that is, after all preceding operations have completed. (Default stream 0 will mean completed in CUDA context) NOTE: This operation is asynchronous and may return before recording event!

7 Making event actually recorded cudaEventSynchronize(event) -- waits until named event actually recorded. Event recorded when all work done by threads to complete prior to specified event (Not strictly be necessary if synchronous CUDA call in code.)

8 Measuring time between two events cudaEventElapsedTime(&time, event1, event2) will return (pointer argument) the time elapsed between two events, in milliseconds. Resolution approx ½ millisecond. Timing measured using GPU clock.

9 Timing GPU Execution with CUDA events Code cudaEvent_t start, stop; float elapsedTime; cudaEventCreate(&start); // create event objects cudaEventCreate(&stop); cudaEventRecord(start, 0); // Record start event. cudaEventRecord(stop, 0); // record end event cudaEventSynchronize(stop); // wait for all device work to complete cudaEventElapsedTime(&elapsedTime, start, stop); //time between events cudaEventDestroy(start);//destroy start event cudaEventDestroy(stop);); //destroy stop event Time period

10 First kernel launch will be more timing consuming than subsequent kernel executions because of code being transferred to GPU. Asynchronous CUDA routines returning before they are complete – a big issue. Issues to watch for

11 Asynchronous and synchronous calls Kernels Kernel starts after all previous CUDA calls completed Control returned to CPU immediately (asynchronous, non-blocking) cudaMemcpy Copy starts after all previous CUDA calls completed Returns after copy complete (synchronous) NEW NVIDIA now says this applies only for transfers > 64KB. From “CUDA C Programming Guide” October 2012, page 29.

12 Asynchronous CUDA routines Control is returns before device has completed request tasked: Kernel launches Memory copies between two addresses in same device memory (Device to device memory copies) Host to device memory copy (<= 64KB) Memory copies with Async suffix Memory set function calls From “CUDA C Programming Guide” October 2012, page 29.

13 Timing within Kernel -- Using clock() Possible to use clock() within kernel See NVIDIA CUDA C Programming Guide, page 115: “B.10 Time Function clock_t clock(); when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is greater that the latter since threads are time sliced.”

14 Timing within Kernel -- Using events Appears possible to use event timer within kernel. Events can be recorded in specific “stream” objects – sequences of in-order code operating on a data set. Events in default “stream 0” completed when all preceding operations completed by device See NVIDIA CUDA C Programming Guide, page 39 for more details on streams. (Will come back to this later.)

15 Bandwidth Bandwidth is the rate at which data is transferred. Physical connection will define the maximum system bandwidth. Maximum bandwidth S2050 (4 GPUs) GB/sec C2050 Telsa (coit-grid06/7) GB/sec GTX GB/sec GT 320M/330M (in Mac pro laptops)25.6 GB/sec Pentium Core i7 with Quickpath25.6 GB/sec Xbox6.4 GB/sec Wikipedia: Comparison of Nvidia graphics processing units

16 Effective Bandwidth Effective bandwidth is the actual bandwidth achieved by a program. If we measure the effective bandwidth of a program, we can compare that to the maximum possible. Effective bandwidth achieved by a program/kernel given by: Effective Bandwidth = (number_Bytes/time) * GB/s where: number_Bytes is total number of bytes read or written time is the time period in seconds GB/s = Gigabytes per second = 1,000,000,000 Bytes/s Use effective bandwidth as a metric for measuring performance/optimization benefits* * from NVIDIA CUDA C Best Practices Guide, Version 3.2, 8/20/2010

17 Bandwidth of Matrix Copy Operation From NVIDIA CUDA C Best Practices Guide, Version 3.2, 8/20/2010 Copying an N x N matrix: ( (N 2 x b x 2) / time) x GB/sec where there are b bytes in each number. Need to know size of variables. Integers, int (32 bits)b = 4 bytes Floating point (32 bits)b = 4 bytes Double (64 bits)b = 8 bytes 2 transfers -- Read plus write.

18 Computational Measures The classical measure in high performance computing (HPC) to measure performance is the number of floating point operations. Systems have peak GFLOPs Tianhe-12.5 PFLOPS* Cray Jaguar1.75 PFLOPS S2050 (4 GPUs) 5152 GFLOPS C2050 Telsa (coit-grid06/7)1288 GFLOPS GTX GFLOPS GT 330M (in Mac pro laptops)182 GFLOPS Pentium Core i GFLOPS Peak single precision GFLOPs Petaflops, FLOPS, Gflops = 10 9 FLOPS) * These numbers need checking

19 Actual FLOPS Measured using standard benchmark programs such as LINPACK If measure it on your program, can see how close it get to the peak (which presumably is doing only floating point operations).

20 #define N 1000 // a big number up to INT_MAX, 2,147,483,647 __global__ void gpu_compute(float *result) { int i, j; float a = 0.0; int tid = blockIdx.x * blockDim.x + threadIdx.x; for (i = 0; i < N; i++) for (j = 0; j < N; j++) a = a ;// do something, N x N floating pt operations result[tid] = a;// store result return; } int main(int argc, char *argv[]) { int T = 1, B = 1; // threads per block and blocks per grid float cpu_result, *gpu_result, ans[T * B];// result from gpu, to make sure computation is being done cudaEvent_t start, end; // using cuda events to measure time float time; // which is applicable for asynchronous code also cudaEventCreate(&start); // instrument code to measure start time cudaEventCreate(&end); cudaEventRecord(start, 0 ); cudaMalloc((void**) &gpu_result, T * B * sizeof(float)); gpu_compute >>(gpu_result); cudaMemcpy(ans,gpu_result, T * B * sizeof(float),cudaMemcpyDeviceToHost); cudaEventRecord(end, 0 ); // instrument code to measure end time cudaEventSynchronize(end); cudaEventElapsedTime(&time, start, end); printf("GPU, Answer thread 0, %e\n", ans[0]); printf("GPU Number of floating pt operations done %e\n", (double) N * N * T * B); printf("GPU Time using CUDA events: %f ms\n", time); // time is in ms cudaEventDestroy(start); cudaEventDestroy(end); return 0; } Sample partial code to measure performance on GPU

Questions