Download presentation
Presentation is loading. Please wait.
1
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008
2
Previously CUDA programming model CUDA programming model CUDA hardware model CUDA hardware model
3
Today CAJ registration CAJ registration CUDA use cases CUDA use cases The CUDA API The CUDA API
4
CAJ Registration Needed to get a grade for the course Needed to get a grade for the course No grade wanted do not register No grade wanted do not register
5
Some Use Cases A few videos A few videos –Neural simulation –Geo-physical computations –Bio-medical computations
6
Some Use Cases: gpgpu.org –Concurrent Number Cruncher: a GPU implementation of a general sparse linear solver –Einstein@Home, a distributed computing software –OpenSteer, a game-like application –Radiation therapy, deformable image registration –Real-time Visual Tracker by Stream Processing
7
Some Use Cases http://www.nvidia.com/object/io_12093 86593154.html http://www.nvidia.com/object/io_12093 86593154.html –Molecular visualization and analysis –Computational fluid dynamics –Molecular dynamics
8
Some Use Cases CUDA is being presented and promoted at CUDA is being presented and promoted at –Universities: talks, scholarship kits, courses –Supercomputing conferences International Supercomputing Conference 2008, Dresden, Germany International Supercomputing Conference 2008, Dresden, Germany Supercomputing 2008, Texas, USA Supercomputing 2008, Texas, USA
9
The CUDA API
10
Minimal extension to C Minimal extension to C Consists of a runtime library Consists of a runtime library –Host component: runs on host –Device component: runs on device –Common component: runs on both Only those C functions can run on device that are included in this component Only those C functions can run on device that are included in this component
11
Extensions to C 4 extensions 4 extensions –Function type qualifiers –Variable type qualifiers –Kernel calling directive –5 built-in variables
12
Extensions to C 4 extensions 4 extensions –Function type qualifiers –Variable type qualifiers –Kernel calling directive –5 built-in variables
13
Function type qualifiers Specify Specify –Where a function executes –Where can a function be called from
14
Function type qualifiers __global__ __global__ –Specifies a kernel –Callable from host only –Executes on the device –Must return void
15
Function type qualifiers __device__ __device__ –Callable from device only –Executes on device
16
Function type qualifiers __host__ __host__ –Callable from host only –Executes on host –The qualifier is optional –Can be combined with __device__ to compile for both host and device
17
Function type qualifiers Table for function qualifiers (on board) Table for function qualifiers (on board)
18
Function type qualifiers - caveats __host__ and __device__ can be used together __host__ and __device__ can be used together __host__ and __global__ cannot be used together __host__ and __global__ cannot be used together Function pointers to __device__ functions are not allowed Function pointers to __device__ functions are not allowed
19
Function type qualifiers - caveats Functions that execute on the device ( __device__, __global__ ) cannot Functions that execute on the device ( __device__, __global__ ) cannot –Support recursion –Declare static variables –Have a variable number of arguments
20
Function type qualifiers - caveats __global__ functions __global__ functions –Must return void –Must be called using the kernel calling directive –Are called asynchronously Command returns to host immediately Command returns to host immediately –Have a parameter size limit of 256 bytes Device pointer takes 8 bytes Device pointer takes 8 bytes
21
Extensions to C 4 extensions 4 extensions –Function type qualifiers –Variable type qualifiers –Kernel calling directive –4 built-in variables
22
Variable type qualifiers Specify Specify –Where a variable resides in device memory Lifetime of the variable Lifetime of the variable Where the variable can be accessed from Where the variable can be accessed from
23
Variable type qualifiers __device__ __device__ –Resides in global memory Lifetime of the application Lifetime of the application Accessible from Accessible from –All threads in the grid –Host –Can be used with __constant__ or __shared__
24
Variable type qualifiers __constant__ __constant__ –Resides in constant memory Lifetime of the application Lifetime of the application Accessible from Accessible from –All threads in the grid –Host –Can be used with __device__
25
Variable type qualifiers __shared__ __shared__ –Resides in shared memory Lifetime of the block Lifetime of the block Accessible from Accessible from –All threads in the block –Can be used with __device__ –Values assigned to __shared__ variables are guaranteed to be visible to other threads in the block only after a call to __syncthreads()
26
Variable type qualifiers - caveats Qualifiers not allowed on Qualifiers not allowed on –Members of struct and union –Local variables of host functions __device__ can be used with one other qualifier __device__ can be used with one other qualifier __shared__ and __constant__ variables are static __shared__ and __constant__ variables are static
27
Variable type qualifiers - caveats __constant__ variables are read only from device code __constant__ variables are read only from device code –Can be set through host __shared__ variables cannot be initialized on declaration __shared__ variables cannot be initialized on declaration Unqualified variables in device code are created in registers Unqualified variables in device code are created in registers –Large structures may be placed in local memory, SLOW
28
Extensions to C 4 extensions 4 extensions –Function type qualifiers –Variable type qualifiers –Kernel calling directive –5 built-in variables
29
Execution configuration Must for calls to __global__ functions Must for calls to __global__ functions Specifies Specifies –Number of threads that will execute the function –Amount of shared memory to be allocated per block, optional –Stream number, optional
30
Execution configuration >> >> – Dg Of type dim3 Of type dim3 Grid dimension Grid dimension – Db Of type dim3 Of type dim3 Block dimension Block dimension –#threads/function call = (Dg.x*Dg.y) * (Db.x*Db.y*Db.z)
31
Execution configuration >> >> – Ns Of type size_t Of type size_t Optional, defaults to 0 Optional, defaults to 0 Num bytes in shared memory dynamically allocated to each block Num bytes in shared memory dynamically allocated to each block Accessed from inside the kernel as extern __shared__ float sh_data[]; Accessed from inside the kernel as extern __shared__ float sh_data[];
32
Execution configuration >> >> – S Of type cudaStream_t Of type cudaStream_t Optional, defaults to 0 Optional, defaults to 0 Specifies the stream* on which the function should launch * - to be covered later Specifies the stream* on which the function should launch * - to be covered later
33
Execution configuration - caveats Function call fails if Dg or Db are greater than the device limit. Function call fails if Dg or Db are greater than the device limit. Shared memory Shared memory –Execution configuration –Function parameters –Function/static variables Function call fails if Ns is greater than device limit minus sum of above Function call fails if Ns is greater than device limit minus sum of above
34
Extensions to C 4 extensions 4 extensions –Function type qualifiers –Variable type qualifiers –Kernel calling directive –5 built-in variables
35
5 built-in variables gridDim gridDim –Of type dim3 –Contains grid dimensions blockDim blockDim –Of type dim3 –Contains block dimensions
36
5 built-in variables blockIdx blockIdx –Of type uint3 –Contains block index in the grid threadIdx threadIdx –Of type uint3 –Contains thread index in the block
37
5 built-in variables warpSize warpSize –Of type int –Contains #threads in a warp
38
5 built-in variables - caveats Cannot have pointers to these variables Cannot have pointers to these variables Cannot assign values to these variables Cannot assign values to these variables
39
The NVCC compiler Separates device code and host code Separates device code and host code Compiles device code into binary, cubin object Compiles device code into binary, cubin object Host code is compiled by some other tool, e.g. g++ Host code is compiled by some other tool, e.g. g++
40
The NVCC compiler Host code can be in C++ syntax Host code can be in C++ syntax –Will anyway be compiled by external tool Device code has to be in C format Device code has to be in C format
41
Synchronization & Optimization
42
Host Synchronization All kernel launches are asynchronous All kernel launches are asynchronous –Control returns to host immediately –Kernel executes after all previous CUDA calls have completed –Host and device can run simultaneously
43
Host Synchronization cudaMemcpy() is synchronous cudaMemcpy() is synchronous –Control returns to host after copy completes –Copy starts after all previous CUDA calls have completed cudaThreadSynchronize() cudaThreadSynchronize() –Blocks until all previous CUDA calls complete
44
Timing Timing a kernel execution Timing a kernel execution –Wrong: measures only call time // start timer kernel >> (... ) // stop timer –Right: also measures execution time // start timer kernel >> (... ) cudaThreadSynchronize(); // stop timer
45
__syncthreads or cudaThreadSynchronize ? __syncthreads() __syncthreads() –Invoked from within device code –Synchronizes all threads in a block –Used to avoid inconsistencies in shared memory cudaThreadSynchronize() cudaThreadSynchronize() –Invoked from within host code –Halts execution until device is free
46
Error reporting All CUDA calls return an error code All CUDA calls return an error code –Except kernel launches – cudaError_t type cudaError_t cudaGetLastError() cudaError_t cudaGetLastError() –Returns the code for the last error (also for no error) –Can be used to get error from kernel execution??
47
Error reporting char* cudaGetErrorString (cudaError_t code) char* cudaGetErrorString (cudaError_t code) – printf (“%s\n”, cudaGetErrorString (cudaGetLastError())
48
Device information See deviceQuery.cu in the deviceQuery project See deviceQuery.cu in the deviceQuery project cudaGetDeviceCount (int* count) cudaGetDeviceCount (int* count) cudaGetDeviceProperties (cudaDeviceProp* prop) cudaGetDeviceProperties (cudaDeviceProp* prop) cudaSetDevice (int device_num) cudaSetDevice (int device_num) –Device 0 set be default
49
Memory optimizations Host Device memory transfers slow Host Device memory transfers slow Device Device memory transfers faster Device Device memory transfers faster ipc692 bandwidthTest ipc692 bandwidthTest –Host -> Device: 1.9 GB/s –Device -> Host: 1.5 GB/s –Device -> Device: 5.0 GB/s
50
Memory optimizations – pinned memory Pinned memory = portion of main memory that cannot be swapped out Pinned memory = portion of main memory that cannot be swapped out Allows faster memory access, cudaMemcpy() Allows faster memory access, cudaMemcpy()
51
Memory optimizations – pinned memory ipc692 bandwidthTest ipc692 bandwidthTest –Host -> Device: 1.9 GB/s –Device -> Host: 1.5 GB/s –Device -> Device: 49.7 GB/s ipc692 bandwidthTest -- memory=pinned ipc692 bandwidthTest -- memory=pinned –Host -> Device: 2.5 GB/s –Device -> Host: 1.9 GB/s –Device -> Device: 49.7 GB/s
52
Memory optimizations cudaMallocHost() instead of malloc() cudaMallocHost() instead of malloc() cudaFreeHost() instead of free() cudaFreeHost() instead of free() Use with caution Use with caution –Pinning too much memory leaves leaves little memory for the system
53
All for today Next time Next time –More on the CUDA API
54
See you next week!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.