CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Slides:

Advertisements

Similar presentations

Computer-System Structures Er.Harsimran Singh

Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 17 Atomic.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

Multi-GPU and Stream Programming Kishan Wimalawarne.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.

CUDA More GA, Events, Atomics. GA Revisited Speedup with a more computationally intense evaluation function Parallel version of the crossover and mutation.

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

5.6 Semaphores Semaphores –Software construct that can be used to enforce mutual exclusion –Contains a protected variable Can be accessed only via wait.

5.6.2 Thread Synchronization with Semaphores Semaphores can be used to notify other threads that events have occurred –Producer-consumer relationship Producer.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Martin Kruliš by Martin Kruliš (v1.0)1.

Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.

Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CIS 565 Fall 2011 Qing Sun

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Java Thread and Memory Model

L7: Writing Correct Programs. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.

1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Synchronization These notes introduce:

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.

Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

© David Kirk/NVIDIA and Wen-mei W

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

GPU Computing CIS-543 Lecture 10: Streams and Events

Heterogeneous Programming

Device Routines and device variables

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Lecture 5: GPU Compute Architecture for the last time

CUDA Grids, Blocks, and Threads

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Device Routines and device variables

Measuring Performance

CUDA Execution Model – III Streams and Events

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Measuring Performance

Lecture 5: Synchronization and ILP

Synchronization These notes introduce:

Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu

Parallel Computation Patterns (Histogram)

Presentation transcript:

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised

2 Timing GPU Execution Can use CUDA “events” – create two events and compute the time between them: cudaEvent_t start, stop; float elapsedTime; cudaEventCreate(&start); // create event objects cudaEventCreate(&stop); cudaEventRecord(start, 0); // Record start event. cudaEventRecord(stop, 0); // record end event cudaEventSynchronize(stop); // wait work preceding to complete cudaEventRecord(stop,0) cudaEventElapsedTime(&elapsedTime, start, stop); //compute elapsed time between events cudaEventDestroy(start);//destroy start event cudaEventDestroy(stop);); //destroy stop event

3

4

5

6

7

8 Host Synchronization Kernels Control returned to CPU immediately (asynchronous, non-blocking) Kernel starts after all previous CUDA calls completed cudaMemcpy Returns after copy complete (synchronous) Copy starts after all previous CUDA calls completed

9 CUDA Synchronization Routines Host cudaThreadSynchronize() Blocks until all previous CUDA calls complete GPU void __syncthreads() Synchronizes all threads in a block Barrier – no thread can pass until all threads in block reach it. All threads must reach __syncthread in thread block.

10 GPU Atomic Operations Performs a read-modify-write atomic operation on one word residing in global or shared memory. Associative operations on signed/unsigned integers, add, sub, min, max, and, or, xor, increment, decrement, exchange, compare and swap. Requires GPU with compute capability 1.1+ (Shared memory operations and 64-bit words require higher capability) coit-grid06 Tesla C2050 has compute capability 2.0 See for GPU compute capabilities

11 int atomicAdd(int* address, int val); reads old located at address address in global or shared memory, computes (old + val), and stores result back to memory at same address. These three operations (read, compute, and write) are performed in one atomic transaction.* Function returns old. Atomic Operation Example * Once stated, it continues to completion without being able to be interrupted by other processors. Other processors cannot read or write to memory location once atomic operation starts. Mechanism implemented in hardware.

12 Other operations int atomicSub(int* address, int val); int atomicExch(int* address, int val); int atomicMin(int* address, int val); int atomicMax(int* address, int val); unsigned int atomicInc(unsigned int* address, unsigned int val); unsigned int atomicDec(unsigned int* address, unsigned int val); int atomicCAS(int* address, int compare, int val); //compare and swap int atomicAnd(int* address, int val); int atomicOr(int* address, int val); int atomicXor(int* address, int val); Source: NVIDIA CUDA C Programming Guide, version 3.2, 11/9/2010

13 int atomicCAS(int* address, int compare, int val); reads the word old located at address address in global or shared memory, and compares old with compare. If they are the same, it set old to val (stores val at address address), i.e.: if (old == compare) old = val; // else old = old The three operations (read, compute, and write) are performed in one atomic transaction. The function returns the original value of old. Also unsigned and unsigned long long int versions. Compare and Swap (also called compare and exchange)

14 __device__ int lock=0; // unlocked __global__ void kernel(...) {... do {} while(atomicCAS(&lock,0,1)); // if lock = 0 set to1 // and continue... // critical section lock = 0; // free lock } Coding Critical Sections with Locks

15 Memory Fences Threads may see the effects of a series of writes to memory executed by another thread in different orders. To enforce ordering: void __threadfence_block(); waits until all global and shared memory accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block. Other routines: void __threadfence(); void __threadfence_system();

16 Writes to device memory not guaranteed in any order, so global writes may not have completed by the time the lock is unlocked __global__ void kernel(...) {... do {} while(atomicCAS(&lock,0,1));...// criticial section __threadfence(); // wait for writes to finish lock = 0; } Critical sections with memory operations

17 Error reporting All CUDA calls (except kernel launches) return error code of type cudaError_t cudaError_t cudaGetLastError(void) Returns code for the last error Can be used to get error from kernel execution. Char* cudaGetErroprString(cudaError_t code) Returns a null-terminated character string describing error Example print(“%s\n”,cudaGetErrorString(cudaGetLastError());

Questions