1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce: Declaring routines that are be executed on device and on the host Declaring local variable on device

2 CUDA extensions to declare kernel routines Host = CPU Device = GPU __global__indicates routine can only be called from host and only executed on device __device__indicates routine can only be called from device and only executed on device __host__indicates routine can only be called from host and only executed on host (generally only used in combination with __device__, see later) Two underscores each Note cannot call a routine from the kernel to be executed on host

3 … __global__ void add(int *a,int *b, int *c) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N) c[tid] = a[tid]+b[tid]; } int main(int argc, char *argv[]) { int T = 10, B = 1; // threads per block and blocks per grid int a[N],b[N],c[N]; int *dev_a, *dev_b, *dev_c; … cudaMalloc((void**)&dev_a,N * sizeof(int)); cudaMalloc((void**)&dev_b,N * sizeof(int)); cudaMalloc((void**)&dev_c,N * sizeof(int)); cudaMemcpy(dev_a, a, N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_c, c, N*sizeof(int),cudaMemcpyHostToDevice); add >>(dev_a,dev_b,dev_c); cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost); … cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; } So far we have seen __global__: Executed on device Called from host __global__ must have void return type. Why? Note __global__ asynchronous. Returns before complete

4 Routines to be executed on device Generally cannot call C library routines from device! However CUDA has math routines for device that are equivalent to standard C math routines with the same names, so in practice can call math routines such as sin(x) – need to check CUDA docs* before use Also CUDA has GPU-only routines implemented, faster less accurate (have __ names)* * See NVIDIA CUDA C Programming Guide for more details

5 __device__ routines __global__ void gpu_sort (int *a, int *b, int N) { … swap (&list[m],&list[j]); … } __device__ void swap (int *x, int *y) { int temp; temp = *x; *x = *y; *y = temp; } int main (int argc, char *argv[]) { … gpu_sort >>(dev_a, dev_b, N); … return 0; } Recursion is possible with __device __ routines so far as I can tell

6 Routines executable on both host and device __device__ and __host__ qualifiers can be used together Then routine callable and executable on both host and device. Routine will be compiled for both. Feature might be used to create code that optionally uses a GPU or for test purposes. Generally will need statement s that differentiate between host and device Note: __global__ and __host__ qualifiers cannot be used together

7 __CUDA_ARCH__ macro Indicates compute capability of GPU being used. Can be used to create different paths thro device code for different capabilities. __CUDA_ARCH__ = 100 for 1.0 compute capability __CUDA_ARCH__ = 110 for 1.1 compute capability …

8 __host__ __device__ func() { #ifdef __CUDA_ARCH__ …// Device code #else …// Host code #endif } Could also select specific compute capabilities Example

9 Declaring local variables for host and for device

10 Local variables on host In C, scope of a variable is block it is declared in, which does not extend to routines called from block. If scope is to include main and all within it, including called routines, place declaration outside main: #include int cpuA[10];... void clearArray() { for (int i = 0; i < 10; i++) cpuA[i] = 0; } void setArray(int n) { for (int i = 0; i < 10; i++) cpuA[i] = n; } int main(int argc, char *argv[]) { … clearArray(); … setArray(N); … return 0; }

11 Declaring local kernel variables Declaring variable outside main but use __device__ keyword (now used as a variable type qualifier rather than function type qualifier) Without further qualification, variable is in global (GPU) memory. Accessible by all threads #include __device__ int gpu_A[10];... __global__ void clearArray() { for (int i = 0; i < 10; i++) gpuA[i] = 0; } int main(int argc, char *argv[]) { … clearArray(); … setArray(N); … return 0; }

12 Accessing kernel variables from host Accessible by host using: cudaMemcpyToSymbol(), cudaMemcpyFromSymbol(),… where name of variable given as an argument: int main(int argc, char *argv[]) { int cpuA[10]: … cudaMemcpyFromSymbol(&cpuA, "gpuA", sizeof(cpuA), 0, cudaMemcpyDeviceToHost); … return 0; }

13 #include int cpu_hist[10]; // globally accessible on cpu // histogram computed on cpu __device__ int gpu_hist[10]; // globally accessible on gpu // histogram computed on gpu void cpu_histogram(int *a, int N) { … } __global__ void gpu_histogram(int *a, int N) { … } int main(int argc, char *argv[]) { … gpu_histogram >>(dev_a,N); cpu_histogram(a,N); … return 0; } Example of both local host and device variables

Questions

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

Similar presentations

Presentation on theme: "1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

Similar presentations

Presentation on theme: "1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:"— Presentation transcript:

Similar presentations

About project

Feedback