Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Similar presentations


Presentation on theme: "Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008."— Presentation transcript:

1 Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008

2 Previously CUDA Runtime Component CUDA Runtime Component –Common Component Built-in vector types Built-in vector types Math functions Math functions Timing Timing Textures Textures –Texture fetch –Texture reference –Texture read modes –Normalized texture coordinates –Linear texture filtering

3 Today CUDA Runtime Component CUDA Runtime Component –Common Component –Device Component –Host Component

4 CUDA Runtime Component Common Component Common Component Device Component Device Component Host Component Host Component

5 Device Runtime Component Can only be used in device code Can only be used in device code Math functions Math functions –Faster, less accurate versions of functions from common component –__ –__ log and __logf log and __logf –Appendix B of Programming Guide –Use fast math by default Compiler option -use_fast_math Compiler option -use_fast_math

6 Device Runtime Component Synch function: __syncThreads()‏ Synch function: __syncThreads()‏ –Synchronize threads in a block –Avoid read-after-write, write-after-read, write-after-write hazards for commonly accessed shared memory –Dangerous to use in conditionals Code hangs / unwanted effects Code hangs / unwanted effects

7 Device Runtime Component Atomic functions Atomic functions –Guaranteed to perform un-interfered Memory address is locked Memory address is locked –Supported by CUDA cards > 1.0 –Mostly operate on integers only –Appendix C of programming guide

8 Device Runtime Component Warp vote functions Warp vote functions –Supported by CUDA cards >= 1.2 –Check a condition on all threads in a warp int __all (int predicate) true (non-zero) if predicate is true for all warp threads int __all (int predicate) true (non-zero) if predicate is true for all warp threads int __any (int predicate) true (non-zero) if predicate is true for any warp thread int __any (int predicate) true (non-zero) if predicate is true for any warp thread

9 Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texture data may be stored in linear memory or CUDA arrays –Texturing from linear memory template Type tex1Dfetch( texture texRef, int x); float tex1Dfetch( texture texRef, int x);

10 Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from linear memory – Type can be any of the supported 1-, 2- or 4- vector types template Type tex1Dfetch( texture texRef, int x); float4 tex1Dfetch( texture texRef, int x);

11 Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from linear memory –No addressing modes supported –No texture filtering supported

12 Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from CUDA arrays template Type tex1D(texture texRef, float x); template Type tex2D(texture texRef, float x, float y); template Type tex3D(texture texRef, float x, float y, float z);

13 Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from CUDA arrays –Run-time attributes determine Coordinate normalization Coordinate normalization Addressing mode (clamp/wrap)‏ Addressing mode (clamp/wrap)‏ Filtering Filtering

14 CUDA Runtime Component Common Component Common Component Device Component Device Component Host Component Host Component

15 Host Runtime Component Can only be used by host functions Can only be used by host functions Composed of 2 APIs Composed of 2 APIs –High-level CUDA runtime API, which runs on top of –Low-level CUDA driver API No mixing: an application should use either one or the other. No mixing: an application should use either one or the other.

16 Each API provides functions for Each API provides functions for –Device management –Context management –Memory management –Code module management –Execution control –Texture reference management –OpenGL/Direct3D interoperability Host Runtime Component

17 The CUDA runtime API implicitly provides The CUDA runtime API implicitly provides –Initialization –Context management –Module management CUDA driver API does not, and is harder to program. CUDA driver API does not, and is harder to program. Host Runtime Component

18 Recall: nvcc parses an input source file Recall: nvcc parses an input source file –Separates device and host code –Device code compiled to cubin object –Generated host code in C compiled by external tool Host Runtime Component

19 Generated host code Generated host code –Is in C format –Includes the cubin object Applications may Applications may –Ignore host code and run cubin object directly using the low-level CUDA driver API –Link to generated host code and launch it using the high-level CUDA runtime API Host Runtime Component

20 The CUDA driver API The CUDA driver API –Is harder to program –Offers greater control –Does not depend on C –Does not offer device emulation Host Runtime Component

21 CUDA runtime functions and other entry points are prefixed by cuda CUDA runtime functions and other entry points are prefixed by cuda CUDA driver functions and other entry points are prefixed by cu CUDA driver functions and other entry points are prefixed by cu Host Runtime Component

22 Device memory is always allocated as either of Device memory is always allocated as either of –Linear memory –CUDA arrays Host Runtime Component - detour

23 Linear memory in device Linear memory in device –Contiguous segment of memory –32-bit addresses –Can be referenced using pointers Host Runtime Component - detour

24 CUDA arrays CUDA arrays –“opaque” memory layout –1D/2D/3D arrays of 1/2/4 vectors of 8/16/32 bit integers or 16/32 bit floats 16 bit floats from driver API only 16 bit floats from driver API only –Optimized for texture fetching –Accessible from kernels through texture fetches only Host Runtime Component - detour

25 Both the CUDA runtime and CUDA driver APIs Both the CUDA runtime and CUDA driver APIs –Can access device information –Enable the host to read/write to linear memory/CUDA arrays With support for pinned memory With support for pinned memory Host Runtime Component

26 Both the CUDA runtime and CUDA driver APIs Both the CUDA runtime and CUDA driver APIs –Can access device information –Enable the host to read/write to linear memory/CUDA arrays With support for pinned memory With support for pinned memory –Provide OpenGL/Direct3D interoperability –Provide management for asynchronous execution Host Runtime Component

27 Asynchronous functions Asynchronous functions –Kernel launches, and some others – Async memory copies –Device device memory copies –Memory setting Concurrent execution of functions is managed through streams Concurrent execution of functions is managed through streams Host Runtime Component

28 Streams Streams –A queue of operations –An application may have multiple stream objects simultaneously – kernel >> – A kernel can be scheduled to execute on a stream – Some memory copy functions can also be queued on a stream Host Runtime Component

29 Streams Streams – If no stream is specified, stream 0 is used by default. – Operations in a stream are executed synchronously Previous stream operations have to end before a new one begins Previous stream operations have to end before a new one begins Host Runtime Component

30 CUDA runtime and driver APIs provide execution control through stream management CUDA runtime and driver APIs provide execution control through stream management – StreamQuery()‏ Is stream free? Is stream free? – StreamSynchronize()‏ Wait for stream operations to end Wait for stream operations to end Host Runtime Component

31 CUDA runtime and driver APIs provide execution control through stream management CUDA runtime and driver APIs provide execution control through stream management – cudaThreadSynchronize() / cuCtxSynchronize()‏ Wait for all streams to be free Wait for all streams to be free – StreamDestroy()‏ Wait for stream to get free Wait for stream to get free Destroy stream Destroy stream Host Runtime Component

32 Accurate timing using events Accurate timing using events – CUEvent/cudaEvent_t start,stop; EventCreate (&start); EventCreate (&stop); –Events have to be recorded EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous –Stream 0: record all operations from all streams –Stream N: record operations in stream N Host Runtime Component

33 Accurate timing using events Accurate timing using events – EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); –As call to record is asynchronous, the event has to be synchronized before timing – EventDestroy (start); EventDestroy (stop); Host Runtime Component

34 Asynchronous execution can get confusing Asynchronous execution can get confusing –Can be switched off –Useful for degbugging –Set CUDA_LAUNCH_BLOCKING to 1 Host Runtime Component

35 Device Initialization Device Initialization –CUDA Runtime API Automatically with first function call Automatically with first function call –Cuda Driver API cuInit()‏ cuInit()‏ MUST be called before calling any other API function MUST be called before calling any other API function Host Runtime Component

36 Device Management Device Management – cudaDeviceProp / CUDevice device; – int devCount; cudaGetDeviceCount (&devCount) / cuDeviceGetCount (&devCount)‏ – for dev = 1 to devCount do cudaGetDeviceProperties / cuDeviceGet (&device, dev)‏ Host Runtime Component

37 Device Management Device Management – cudaSetDevice()‏ Sets the device to be used Sets the device to be used MUST be set before calling any __global__ function MUST be set before calling any __global__ function Device 0 used by default Device 0 used by default Host Runtime Component

38 Stream Management Stream Management – CUStream / cudaStream_t st; – cudaStreamCreate (&st); / cuStreamCreate (&st, 0); – cudaStreamDestroy (&st); Host Runtime Component

39 Accurate timing using events Accurate timing using events – EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); –As call to record is asynchronous, the event has to be synchronized before timing – EventDestroy (start); EventDestroy (stop); Host Runtime Component

40 Event management Event management – CUEvent/cudaEvent_t start,stop; EventCreate (&start); EventCreate (&stop); EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); EventDestroy (start); EventDestroy (stop); Host Runtime Component

41 All for today Next time Next time –More on the host runtime APIs Memory, stream, event, texture management Memory, stream, event, texture management Debug mode for runtime API Debug mode for runtime API Context, module, execution control for driver API Context, module, execution control for driver API –Performance & Optimization

42 See you next week!


Download ppt "Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008."

Similar presentations


Ads by Google