Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008

Previously CUDA Runtime Component CUDA Runtime Component –Common Component Built-in vector types Built-in vector types Math functions Math functions Timing Timing Textures Textures –Texture fetch –Texture reference –Texture read modes –Normalized texture coordinates –Linear texture filtering

Today CUDA Runtime Component CUDA Runtime Component –Common Component –Device Component –Host Component

CUDA Runtime Component Common Component Common Component Device Component Device Component Host Component Host Component

Device Runtime Component Can only be used in device code Can only be used in device code Math functions Math functions –Faster, less accurate versions of functions from common component –__ –__ log and __logf log and __logf –Appendix B of Programming Guide –Use fast math by default Compiler option -use_fast_math Compiler option -use_fast_math

Device Runtime Component Synch function: __syncThreads()‏ Synch function: __syncThreads()‏ –Synchronize threads in a block –Avoid read-after-write, write-after-read, write-after-write hazards for commonly accessed shared memory –Dangerous to use in conditionals Code hangs / unwanted effects Code hangs / unwanted effects

Device Runtime Component Atomic functions Atomic functions –Guaranteed to perform un-interfered Memory address is locked Memory address is locked –Supported by CUDA cards > 1.0 –Mostly operate on integers only –Appendix C of programming guide

Device Runtime Component Warp vote functions Warp vote functions –Supported by CUDA cards >= 1.2 –Check a condition on all threads in a warp int __all (int predicate) true (non-zero) if predicate is true for all warp threads int __all (int predicate) true (non-zero) if predicate is true for all warp threads int __any (int predicate) true (non-zero) if predicate is true for any warp thread int __any (int predicate) true (non-zero) if predicate is true for any warp thread

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texture data may be stored in linear memory or CUDA arrays –Texturing from linear memory template Type tex1Dfetch( texture texRef, int x); float tex1Dfetch( texture texRef, int x);

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from linear memory – Type can be any of the supported 1-, 2- or 4- vector types template Type tex1Dfetch( texture texRef, int x); float4 tex1Dfetch( texture texRef, int x);

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from linear memory –No addressing modes supported –No texture filtering supported

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from CUDA arrays template Type tex1D(texture texRef, float x); template Type tex2D(texture texRef, float x, float y); template Type tex3D(texture texRef, float x, float y, float z);

Device Runtime Component Texture functions: fetching textures, or texturing Texture functions: fetching textures, or texturing –Texturing from CUDA arrays –Run-time attributes determine Coordinate normalization Coordinate normalization Addressing mode (clamp/wrap)‏ Addressing mode (clamp/wrap)‏ Filtering Filtering

CUDA Runtime Component Common Component Common Component Device Component Device Component Host Component Host Component

Host Runtime Component Can only be used by host functions Can only be used by host functions Composed of 2 APIs Composed of 2 APIs –High-level CUDA runtime API, which runs on top of –Low-level CUDA driver API No mixing: an application should use either one or the other. No mixing: an application should use either one or the other.

Each API provides functions for Each API provides functions for –Device management –Context management –Memory management –Code module management –Execution control –Texture reference management –OpenGL/Direct3D interoperability Host Runtime Component

The CUDA runtime API implicitly provides The CUDA runtime API implicitly provides –Initialization –Context management –Module management CUDA driver API does not, and is harder to program. CUDA driver API does not, and is harder to program. Host Runtime Component

Recall: nvcc parses an input source file Recall: nvcc parses an input source file –Separates device and host code –Device code compiled to cubin object –Generated host code in C compiled by external tool Host Runtime Component

Generated host code Generated host code –Is in C format –Includes the cubin object Applications may Applications may –Ignore host code and run cubin object directly using the low-level CUDA driver API –Link to generated host code and launch it using the high-level CUDA runtime API Host Runtime Component

The CUDA driver API The CUDA driver API –Is harder to program –Offers greater control –Does not depend on C –Does not offer device emulation Host Runtime Component

CUDA runtime functions and other entry points are prefixed by cuda CUDA runtime functions and other entry points are prefixed by cuda CUDA driver functions and other entry points are prefixed by cu CUDA driver functions and other entry points are prefixed by cu Host Runtime Component

Device memory is always allocated as either of Device memory is always allocated as either of –Linear memory –CUDA arrays Host Runtime Component - detour

Linear memory in device Linear memory in device –Contiguous segment of memory –32-bit addresses –Can be referenced using pointers Host Runtime Component - detour

CUDA arrays CUDA arrays –“opaque” memory layout –1D/2D/3D arrays of 1/2/4 vectors of 8/16/32 bit integers or 16/32 bit floats 16 bit floats from driver API only 16 bit floats from driver API only –Optimized for texture fetching –Accessible from kernels through texture fetches only Host Runtime Component - detour

Both the CUDA runtime and CUDA driver APIs Both the CUDA runtime and CUDA driver APIs –Can access device information –Enable the host to read/write to linear memory/CUDA arrays With support for pinned memory With support for pinned memory Host Runtime Component

Both the CUDA runtime and CUDA driver APIs Both the CUDA runtime and CUDA driver APIs –Can access device information –Enable the host to read/write to linear memory/CUDA arrays With support for pinned memory With support for pinned memory –Provide OpenGL/Direct3D interoperability –Provide management for asynchronous execution Host Runtime Component

Asynchronous functions Asynchronous functions –Kernel launches, and some others – Async memory copies –Device device memory copies –Memory setting Concurrent execution of functions is managed through streams Concurrent execution of functions is managed through streams Host Runtime Component

Streams Streams –A queue of operations –An application may have multiple stream objects simultaneously – kernel >> – A kernel can be scheduled to execute on a stream – Some memory copy functions can also be queued on a stream Host Runtime Component

Streams Streams – If no stream is specified, stream 0 is used by default. – Operations in a stream are executed synchronously Previous stream operations have to end before a new one begins Previous stream operations have to end before a new one begins Host Runtime Component

CUDA runtime and driver APIs provide execution control through stream management CUDA runtime and driver APIs provide execution control through stream management – StreamQuery()‏ Is stream free? Is stream free? – StreamSynchronize()‏ Wait for stream operations to end Wait for stream operations to end Host Runtime Component

CUDA runtime and driver APIs provide execution control through stream management CUDA runtime and driver APIs provide execution control through stream management – cudaThreadSynchronize() / cuCtxSynchronize()‏ Wait for all streams to be free Wait for all streams to be free – StreamDestroy()‏ Wait for stream to get free Wait for stream to get free Destroy stream Destroy stream Host Runtime Component

Accurate timing using events Accurate timing using events – CUEvent/cudaEvent_t start,stop; EventCreate (&start); EventCreate (&stop); –Events have to be recorded EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous –Stream 0: record all operations from all streams –Stream N: record operations in stream N Host Runtime Component

Accurate timing using events Accurate timing using events – EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); –As call to record is asynchronous, the event has to be synchronized before timing – EventDestroy (start); EventDestroy (stop); Host Runtime Component

Asynchronous execution can get confusing Asynchronous execution can get confusing –Can be switched off –Useful for degbugging –Set CUDA_LAUNCH_BLOCKING to 1 Host Runtime Component

Device Initialization Device Initialization –CUDA Runtime API Automatically with first function call Automatically with first function call –Cuda Driver API cuInit()‏ cuInit()‏ MUST be called before calling any other API function MUST be called before calling any other API function Host Runtime Component

Device Management Device Management – cudaDeviceProp / CUDevice device; – int devCount; cudaGetDeviceCount (&devCount) / cuDeviceGetCount (&devCount)‏ – for dev = 1 to devCount do cudaGetDeviceProperties / cuDeviceGet (&device, dev)‏ Host Runtime Component

Device Management Device Management – cudaSetDevice()‏ Sets the device to be used Sets the device to be used MUST be set before calling any __global__ function MUST be set before calling any __global__ function Device 0 used by default Device 0 used by default Host Runtime Component

Stream Management Stream Management – CUStream / cudaStream_t st; – cudaStreamCreate (&st); / cuStreamCreate (&st, 0); – cudaStreamDestroy (&st); Host Runtime Component

Accurate timing using events Accurate timing using events – EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); –As call to record is asynchronous, the event has to be synchronized before timing – EventDestroy (start); EventDestroy (stop); Host Runtime Component

Event management Event management – CUEvent/cudaEvent_t start,stop; EventCreate (&start); EventCreate (&stop); EventRecord (start, 0); // asynchronous // stuff to time EventRecord (stop, 0); // asynchronous EventSynchronize (stop); float time; EventElapsedTime (&time, start, stop); EventDestroy (start); EventDestroy (stop); Host Runtime Component

All for today Next time Next time –More on the host runtime APIs Memory, stream, event, texture management Memory, stream, event, texture management Debug mode for runtime API Debug mode for runtime API Context, module, execution control for driver API Context, module, execution control for driver API –Performance & Optimization

See you next week!

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Similar presentations

Presentation on theme: "Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.

Similar presentations

Presentation on theme: "Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008."— Presentation transcript:

Similar presentations

About project

Feedback