Presentation is loading. Please wait.

Presentation is loading. Please wait.

Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1.

Similar presentations


Presentation on theme: "Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1."— Presentation transcript:

1 Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1

2  GPU ◦ “Independent” device ◦ Controlled by host ◦ Used for “offloading”  Host Code ◦ Needs to be designed in a way that  Utilizes GPU(s) efficiently  Utilize CPU while GPU is working  CPU and GPU do not wait for each other 4 11. 2014 by Martin Kruliš (v1.0)2

3  Bad Example cudaMemcpy(..., HostToDevice); Kernel1 >>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost);... cudaMemcpy(..., HostToDevice); Kernel2 >>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost);... 4 11. 2014 by Martin Kruliš (v1.0)3 CPUGPU Device is working

4  Overlapping CPU and GPU work ◦ Kernels  Started asynchronously  Can be waited for ( cudaDeviceSynchronize() )  A little more can be done with streams ◦ Memory transfers  cudaMemcpy() is synchronous and blocking  Alternatively cudaMemcpyAsync() starts the transfer and returns immediately  Can be synchronized the same way as the kernel 4 11. 2014 by Martin Kruliš (v1.0)4

5  Using Asynchronous Transfers cudaMemcpyAsync(HostToDevice); Kernel1 >>(...); cudaMemcpyAsync(DeviceToHost);... do_something_on_cpu();... cudaDeviceSynchronize(); 4 11. 2014 by Martin Kruliš (v1.0)5 CPUGPU Workload balance becomes an issue

6  CPU Threads ◦ Multiple CPU threads may use the GPU  GPU Overlapping Capabilities ◦ Multiple kernels may run simultaneously  Since Fermi architecture  cudaDeviceProp.concurrentKernels ◦ Kernel execution may overlap with data transfers  Or even multiple data transfers  cudaDeviceProp.asyncEngineCount 4 11. 2014 by Martin Kruliš (v1.0)6

7  Stream ◦ In-order GPU command queue (like in OpenCL)  Asynchronous GPU operations are registered in queue  Kernel execution  Memory data transfers  Commands in different streams may overlap  Provide means for explicit and implicit synchronization ◦ Default stream (stream 0)  Always present, does not have to be created  Global synchronization capabilities 4 11. 2014 by Martin Kruliš (v1.0)7

8  Stream Creation cudaStream_t stream; cudaStreamCreate(&stream);  Stream Usage cudaMemcpyAsync(dst, src, size, kind, stream); kernel >>(...);  Stream Destruction cudaStreamDestroy(stream); 4 11. 2014 by Martin Kruliš (v1.0)8

9  Synchronization ◦ Explicit  cudaStreamSynchronize(stream) – waits until all commands issued to the stream have completed  cudaStreamQuery(stream) – a non-blocking test whether the stream has finished ◦ Implicit  Operations in different streams cannot overlap if a special operation is issued between them  Memory allocation  A CUDA command to default stream  Switch between L1/shared memory configuration 4 11. 2014 by Martin Kruliš (v1.0)9

10  Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(…HostToDevice, stream[i]); MyKernel >>(...); cudaMemcpyAsync(…DeviceToHost, stream[i]); } 4 11. 2014 by Martin Kruliš (v1.0)10 May have many implicit synchronizations, depending on CC and hardware overlapping capabilities.

11  Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…HostToDevice, stream[i]); for (int i = 0; i < 2; ++i) MyKernel >>(...); for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…DeviceToHost, stream[i]); 4 11. 2014 by Martin Kruliš (v1.0)11 Much less opportunities for implicit synchronization

12  Callbacks ◦ Callbacks are registered in streams by cudaStreamAddCallback(stream, fnc, data, 0); ◦ The callback function is invoked asynchronously after all preceding commands terminate ◦ Callback registered to the default stream is invoked after previous commands in all streams terminate ◦ Operations issued after registration start after the callback returns ◦ The callback looks like void CUDART_CB MyCallback(stream, errorStatus, userData) {... 4 11. 2014 by Martin Kruliš (v1.0)12

13  Events ◦ Special markers that can be used for synchronization and performance monitoring ◦ The typical usage is  Waiting for all commands before the marker finishes  Explicit synchronization between selected streams  Measuring time between two events ◦ Example cudaEvent_t event; cudaEventCreate(&event); cudaEventRecord(event, stream); cudaEventSynchronize(event); 4 11. 2014 by Martin Kruliš (v1.0)13

14  Making a Good Use of Overlapping ◦ Split the work into smaller fragments ◦ Create a pipeline effect (load, process, store) 4 11. 2014 by Martin Kruliš (v1.0)14

15  Data Gather and Scatter Problem 4 11. 2014 by Martin Kruliš (v1.0)15 Host Memory GPU Memory Host Memory Gather Scatter Kernel Execution Input Data Results Multiple cudaMemcpy() calls may be quite inefficient

16  Gather and Scatter ◦ Reducing overhead ◦ Performed by CPU before/after cudaMemcpy 4 11. 2014 by Martin Kruliš (v1.0)16 Main Thread Gather Scatter Kernel HtD copy DtH copy Gather Scatter Kernel HtD copy DtH copy Stream 0 Stream 1 … # of thread per GPU and # of streams per thread depends on the workload structure

17  Page-locked (Pinned) Host Memory ◦ Host memory that is prevented from swapping ◦ Created/dismissed by cudaHostAlloc(), cudaFreeHost() cudaHostRegister(), cudaHostUnregister() ◦ Optionally with flags cudaHostAllocWriteCombined cudaHostAllocMapped cudaHostAllocPortable ◦ Copies between pinned host memory and device are automatically performed asynchronously ◦ Pinned memory is a scarce resource 4 11. 2014 by Martin Kruliš (v1.0)17 Optimized for writing, not cached on CPU

18  Device Memory Mapping ◦ Allowing GPU to access portions of host memory directly (i.e., without explicit copy operations)  For both reading and writing ◦ The memory must be allocated/registered with flag cudaHostAllocMapped ◦ The context must have cudaDeviceMapHost flag (set by cudaSetDeviceFlags() ) ◦ Function cudaHostGetDevicePointer() gets host pointer and returns corresponding device pointer 4 11. 2014 by Martin Kruliš (v1.0)18

19  Asynchronous Errors ◦ An error may occur outside the a CUDA call  In case of asynchronous memory transfers or kernel execution ◦ The error is reported by the following CUDA call ◦ To make sure all errors were reported, the device must synchronize ( cudaDeviceSynchronize() ) ◦ Error handling functions  cudaGetLastError()  cudaPeekAtLastError()  cudaGetErrorString(error) 4 11. 2014 by Martin Kruliš (v1.0)19

20 4 11. 2014 by Martin Kruliš (v1.0)20


Download ppt "Martin Kruliš 4 11. 2014 by Martin Kruliš (v1.0)1."

Similar presentations


Ads by Google