Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008

Previously CUDA programming model CUDA programming model CUDA hardware model CUDA hardware model

Today CAJ registration CAJ registration CUDA use cases CUDA use cases The CUDA API The CUDA API

CAJ Registration Needed to get a grade for the course Needed to get a grade for the course No grade wanted do not register No grade wanted do not register

Some Use Cases A few videos A few videos –Neural simulation –Geo-physical computations –Bio-medical computations

Some Use Cases: gpgpu.org –Concurrent Number Cruncher: a GPU implementation of a general sparse linear solver –Einstein@Home, a distributed computing software –OpenSteer, a game-like application –Radiation therapy, deformable image registration –Real-time Visual Tracker by Stream Processing

Some Use Cases http://www.nvidia.com/object/io_12093 86593154.html http://www.nvidia.com/object/io_12093 86593154.html –Molecular visualization and analysis –Computational fluid dynamics –Molecular dynamics

Some Use Cases CUDA is being presented and promoted at CUDA is being presented and promoted at –Universities: talks, scholarship kits, courses –Supercomputing conferences International Supercomputing Conference 2008, Dresden, Germany International Supercomputing Conference 2008, Dresden, Germany Supercomputing 2008, Texas, USA Supercomputing 2008, Texas, USA

The CUDA API

Minimal extension to C Minimal extension to C Consists of a runtime library Consists of a runtime library –Host component: runs on host –Device component: runs on device –Common component: runs on both Only those C functions can run on device that are included in this component Only those C functions can run on device that are included in this component

Extensions to C 4 extensions 4 extensions –Function type qualifiers –Variable type qualifiers –Kernel calling directive –5 built-in variables

Function type qualifiers Specify Specify –Where a function executes –Where can a function be called from

Function type qualifiers __global__ __global__ –Specifies a kernel –Callable from host only –Executes on the device –Must return void

Function type qualifiers __device__ __device__ –Callable from device only –Executes on device

Function type qualifiers __host__ __host__ –Callable from host only –Executes on host –The qualifier is optional –Can be combined with __device__ to compile for both host and device

Function type qualifiers Table for function qualifiers (on board)‏ Table for function qualifiers (on board)‏

Function type qualifiers - caveats __host__ and __device__ can be used together __host__ and __device__ can be used together __host__ and __global__ cannot be used together __host__ and __global__ cannot be used together Function pointers to __device__ functions are not allowed Function pointers to __device__ functions are not allowed

Function type qualifiers - caveats Functions that execute on the device ( __device__, __global__ ) cannot Functions that execute on the device ( __device__, __global__ ) cannot –Support recursion –Declare static variables –Have a variable number of arguments

Function type qualifiers - caveats __global__ functions __global__ functions –Must return void –Must be called using the kernel calling directive –Are called asynchronously Command returns to host immediately Command returns to host immediately –Have a parameter size limit of 256 bytes Device pointer takes 8 bytes Device pointer takes 8 bytes

Variable type qualifiers Specify Specify –Where a variable resides in device memory Lifetime of the variable Lifetime of the variable Where the variable can be accessed from Where the variable can be accessed from

Variable type qualifiers __device__ __device__ –Resides in global memory Lifetime of the application Lifetime of the application Accessible from Accessible from –All threads in the grid –Host –Can be used with __constant__ or __shared__

Variable type qualifiers __constant__ __constant__ –Resides in constant memory Lifetime of the application Lifetime of the application Accessible from Accessible from –All threads in the grid –Host –Can be used with __device__

Variable type qualifiers __shared__ __shared__ –Resides in shared memory Lifetime of the block Lifetime of the block Accessible from Accessible from –All threads in the block –Can be used with __device__ –Values assigned to __shared__ variables are guaranteed to be visible to other threads in the block only after a call to __syncthreads()‏

Variable type qualifiers - caveats Qualifiers not allowed on Qualifiers not allowed on –Members of struct and union –Local variables of host functions __device__ can be used with one other qualifier __device__ can be used with one other qualifier __shared__ and __constant__ variables are static __shared__ and __constant__ variables are static

Variable type qualifiers - caveats __constant__ variables are read only from device code __constant__ variables are read only from device code –Can be set through host __shared__ variables cannot be initialized on declaration __shared__ variables cannot be initialized on declaration Unqualified variables in device code are created in registers Unqualified variables in device code are created in registers –Large structures may be placed in local memory, SLOW

Execution configuration Must for calls to __global__ functions Must for calls to __global__ functions Specifies Specifies –Number of threads that will execute the function –Amount of shared memory to be allocated per block, optional –Stream number, optional

Execution configuration >> >> – Dg Of type dim3 Of type dim3 Grid dimension Grid dimension – Db Of type dim3 Of type dim3 Block dimension Block dimension –#threads/function call = (Dg.x*Dg.y) * (Db.x*Db.y*Db.z)‏

Execution configuration >> >> – Ns Of type size_t Of type size_t Optional, defaults to 0 Optional, defaults to 0 Num bytes in shared memory dynamically allocated to each block Num bytes in shared memory dynamically allocated to each block Accessed from inside the kernel as extern __shared__ float sh_data[]; Accessed from inside the kernel as extern __shared__ float sh_data[];

Execution configuration >> >> – S Of type cudaStream_t Of type cudaStream_t Optional, defaults to 0 Optional, defaults to 0 Specifies the stream* on which the function should launch * - to be covered later Specifies the stream* on which the function should launch * - to be covered later

Execution configuration - caveats Function call fails if Dg or Db are greater than the device limit. Function call fails if Dg or Db are greater than the device limit. Shared memory Shared memory –Execution configuration –Function parameters –Function/static variables Function call fails if Ns is greater than device limit minus sum of above Function call fails if Ns is greater than device limit minus sum of above

5 built-in variables gridDim gridDim –Of type dim3 –Contains grid dimensions blockDim blockDim –Of type dim3 –Contains block dimensions

5 built-in variables blockIdx blockIdx –Of type uint3 –Contains block index in the grid threadIdx threadIdx –Of type uint3 –Contains thread index in the block

5 built-in variables warpSize warpSize –Of type int –Contains #threads in a warp

5 built-in variables - caveats Cannot have pointers to these variables Cannot have pointers to these variables Cannot assign values to these variables Cannot assign values to these variables

The NVCC compiler Separates device code and host code Separates device code and host code Compiles device code into binary, cubin object Compiles device code into binary, cubin object Host code is compiled by some other tool, e.g. g++ Host code is compiled by some other tool, e.g. g++

The NVCC compiler Host code can be in C++ syntax Host code can be in C++ syntax –Will anyway be compiled by external tool Device code has to be in C format Device code has to be in C format

Synchronization & Optimization

Host Synchronization All kernel launches are asynchronous All kernel launches are asynchronous –Control returns to host immediately –Kernel executes after all previous CUDA calls have completed –Host and device can run simultaneously

Host Synchronization cudaMemcpy() is synchronous cudaMemcpy() is synchronous –Control returns to host after copy completes –Copy starts after all previous CUDA calls have completed cudaThreadSynchronize() cudaThreadSynchronize() –Blocks until all previous CUDA calls complete

Timing Timing a kernel execution Timing a kernel execution –Wrong: measures only call time // start timer kernel >> (... ) // stop timer –Right: also measures execution time // start timer kernel >> (... ) cudaThreadSynchronize(); // stop timer

__syncthreads or cudaThreadSynchronize ? __syncthreads()‏ __syncthreads()‏ –Invoked from within device code –Synchronizes all threads in a block –Used to avoid inconsistencies in shared memory cudaThreadSynchronize()‏ cudaThreadSynchronize()‏ –Invoked from within host code –Halts execution until device is free

Error reporting All CUDA calls return an error code All CUDA calls return an error code –Except kernel launches – cudaError_t type cudaError_t cudaGetLastError()‏ cudaError_t cudaGetLastError()‏ –Returns the code for the last error (also for no error)‏ –Can be used to get error from kernel execution??

Error reporting char* cudaGetErrorString (cudaError_t code)‏ char* cudaGetErrorString (cudaError_t code)‏ – printf (“%s\n”, cudaGetErrorString (cudaGetLastError())‏

Device information See deviceQuery.cu in the deviceQuery project See deviceQuery.cu in the deviceQuery project cudaGetDeviceCount (int* count)‏ cudaGetDeviceCount (int* count)‏ cudaGetDeviceProperties (cudaDeviceProp* prop)‏ cudaGetDeviceProperties (cudaDeviceProp* prop)‏ cudaSetDevice (int device_num)‏ cudaSetDevice (int device_num)‏ –Device 0 set be default

Memory optimizations Host Device memory transfers slow Host Device memory transfers slow Device Device memory transfers faster Device Device memory transfers faster ipc692 bandwidthTest ipc692 bandwidthTest –Host -> Device: 1.9 GB/s –Device -> Host: 1.5 GB/s –Device -> Device: 5.0 GB/s

Memory optimizations – pinned memory Pinned memory = portion of main memory that cannot be swapped out Pinned memory = portion of main memory that cannot be swapped out Allows faster memory access, cudaMemcpy()‏ Allows faster memory access, cudaMemcpy()‏

Memory optimizations – pinned memory ipc692 bandwidthTest ipc692 bandwidthTest –Host -> Device: 1.9 GB/s –Device -> Host: 1.5 GB/s –Device -> Device: 49.7 GB/s ipc692 bandwidthTest -- memory=pinned ipc692 bandwidthTest -- memory=pinned –Host -> Device: 2.5 GB/s –Device -> Host: 1.9 GB/s –Device -> Device: 49.7 GB/s

Memory optimizations cudaMallocHost() instead of malloc()‏ cudaMallocHost() instead of malloc()‏ cudaFreeHost() instead of free()‏ cudaFreeHost() instead of free()‏ Use with caution Use with caution –Pinning too much memory leaves leaves little memory for the system

All for today Next time Next time –More on the CUDA API

See you next week!

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

Similar presentations

Presentation on theme: "Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

Similar presentations

Presentation on theme: "Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008."— Presentation transcript:

Similar presentations

About project

Feedback