1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Speed, Accurate and Efficient way to identify the DNA.
List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Adding GPU Computing to Computer Organization Courses Karen L. Karavanic Portland State University with David Bunde, Knox College and Jens Mache, Lewis.
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Martin Kruliš by Martin Kruliš (v1.0)1.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
Synchronization These notes introduce:
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
CUDA Programming Model
Device Routines and device variables
CUDA Grids, Blocks, and Threads
Device Routines and device variables
CUDA Grids, Blocks, and Threads
CUDA Programming Model
Measuring Performance
CUDA Programming Model
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
Synchronization These notes introduce:
Presentation transcript:

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce: Declaring routines that are be executed on device and on the host Declaring local variable on device

2 CUDA extensions to declare kernel routines Host = CPU Device = GPU __global__indicates routine can only be called from host and only executed on device __device__indicates routine can only be called from device and only executed on device __host__indicates routine can only be called from host and only executed on host (generally only used in combination with __device__, see later) Two underscores each Note cannot call a routine from the kernel to be executed on host

3 … __global__ void add(int *a,int *b, int *c) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N) c[tid] = a[tid]+b[tid]; } int main(int argc, char *argv[]) { int T = 10, B = 1; // threads per block and blocks per grid int a[N],b[N],c[N]; int *dev_a, *dev_b, *dev_c; … cudaMalloc((void**)&dev_a,N * sizeof(int)); cudaMalloc((void**)&dev_b,N * sizeof(int)); cudaMalloc((void**)&dev_c,N * sizeof(int)); cudaMemcpy(dev_a, a, N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(dev_c, c, N*sizeof(int),cudaMemcpyHostToDevice); add >>(dev_a,dev_b,dev_c); cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost); … cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; } So far we have seen __global__: Executed on device Called from host __global__ must have void return type. Why? Note __global__ asynchronous. Returns before complete

4 Routines to be executed on device Generally cannot call C library routines from device! However CUDA has math routines for device that are equivalent to standard C math routines with the same names, so in practice can call math routines such as sin(x) – need to check CUDA docs* before use Also CUDA has GPU-only routines implemented, faster less accurate (have __ names)* * See NVIDIA CUDA C Programming Guide for more details

5 __device__ routines __global__ void gpu_sort (int *a, int *b, int N) { … swap (&list[m],&list[j]); … } __device__ void swap (int *x, int *y) { int temp; temp = *x; *x = *y; *y = temp; } int main (int argc, char *argv[]) { … gpu_sort >>(dev_a, dev_b, N); … return 0; } Recursion is possible with __device __ routines so far as I can tell

6 Routines executable on both host and device __device__ and __host__ qualifiers can be used together Then routine callable and executable on both host and device. Routine will be compiled for both. Feature might be used to create code that optionally uses a GPU or for test purposes. Generally will need statement s that differentiate between host and device Note: __global__ and __host__ qualifiers cannot be used together

7 __CUDA_ARCH__ macro Indicates compute capability of GPU being used. Can be used to create different paths thro device code for different capabilities. __CUDA_ARCH__ = 100 for 1.0 compute capability __CUDA_ARCH__ = 110 for 1.1 compute capability …

8 __host__ __device__ func() { #ifdef __CUDA_ARCH__ …// Device code #else …// Host code #endif } Could also select specific compute capabilities Example

9 Declaring local variables for host and for device

10 Local variables on host In C, scope of a variable is block it is declared in, which does not extend to routines called from block. If scope is to include main and all within it, including called routines, place declaration outside main: #include int cpuA[10];... void clearArray() { for (int i = 0; i < 10; i++) cpuA[i] = 0; } void setArray(int n) { for (int i = 0; i < 10; i++) cpuA[i] = n; } int main(int argc, char *argv[]) { … clearArray(); … setArray(N); … return 0; }

11 Declaring local kernel variables Declaring variable outside main but use __device__ keyword (now used as a variable type qualifier rather than function type qualifier) Without further qualification, variable is in global (GPU) memory. Accessible by all threads #include __device__ int gpu_A[10];... __global__ void clearArray() { for (int i = 0; i < 10; i++) gpuA[i] = 0; } int main(int argc, char *argv[]) { … clearArray(); … setArray(N); … return 0; }

12 Accessing kernel variables from host Accessible by host using: cudaMemcpyToSymbol(), cudaMemcpyFromSymbol(),… where name of variable given as an argument: int main(int argc, char *argv[]) { int cpuA[10]: … cudaMemcpyFromSymbol(&cpuA, "gpuA", sizeof(cpuA), 0, cudaMemcpyDeviceToHost); … return 0; }

13 #include int cpu_hist[10]; // globally accessible on cpu // histogram computed on cpu __device__ int gpu_hist[10]; // globally accessible on gpu // histogram computed on gpu void cpu_histogram(int *a, int N) { … } __global__ void gpu_histogram(int *a, int N) { … } int main(int argc, char *argv[]) { … gpu_histogram >>(dev_a,N); cpu_histogram(a,N); … return 0; } Example of both local host and device variables

Questions