Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA Grids, Blocks, and Threads
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Introduction to CUDA Programming Architecture Overview Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Introduction to CUDA Programming
Some things are naturally parallel
Programming Massively Parallel Graphics Processors
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
CUDA Grids, Blocks, and Threads
Chapter 4:Parallel Programming in CUDA C
6- General Purpose GPU Programming
Parallel Computing 18: CUDA - I
Presentation transcript:

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green

Programmer’s view GPU as a co-processor CPU Memory GPU GPU Memory 1GB on our systems 3GB/s – 8GB.s 6.4GB/sec – 31.92GB/sec 8B per transfer 141GB/sec

Target Applications int a[N]; // N is large for all elements of a compute a[i] = a[i] * fade Lots of independent computations –CUDA thread need not be independent

Programmer’s View of the GPU GPU: a compute device that: –Is a coprocessor to the CPU or host –Has its own DRAM (device memory) –Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Why are threads useful? Concurrency: –Do multiple things in parallel –Put more hardware  Get higher performance Needs more functional units

Why are threads useful #2 – Tolerating stalls Often a thread stalls, e.g., memory access Multiplex the same functional unit Get more performance at a fraction of the cost

GPU vs. CPU Threads GPU threads are extremely lightweight Very little creation overhead In the order of microseconds GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

Execution Timeline time 1. Copy to GPU mem 2. Launch GPU Kernel GPU / Device 2’. Synchronize with GPU 3. Copy from GPU mem CPU / Host

GBT: Grids of Blocks of Threads Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds

Grids of Blocks of Threads: Dimension Limits Grid of Blocks 1D or 2D –Max x: –Max y: Block of Threads: 1D, 2D, or 3D –Max number of threads: 512 –Max x: 512 –Max y: 512 –Max z: 64 Limits apply to Compute Capability 1.0, 1.1, 1.2, and 1.3 –GTX280 = 1.3

Block and Thread IDs Threads and blocks have IDs –So each thread can decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) IDs and dimensions are easily accessible through predefined “variables”, e.g., blockDim.x and threadIdx.x

Thread Batching A kernel is executed as a grid of thread blocks –All threads share data memory space A thread block: threads that can cooperate with each other by: –Synchronizing their execution For hazard-free shared memory accesses –Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

Thread Coordination Overview Race-free access to data

Programmer’s view: Memory Model

Programmer’s View: Memory Detail – Thread and Host Each thread can: –R/W per-thread registers –R/W per-thread local memory –R/W per-block shared memory –R/W per-grid global memory –Read only per-grid constant memory –Read only per-grid texture memory The host can R/W: global, constant, and texture memories

Memory Model: Global, Constant, and Texture Memories Global memory –Main means of communicating R/W Data between host and device –Contents visible to all threads Texture and Constant Memories –Constants initialized by host –Contents visible to all threads

Memory Model Summary MemoryLocationCachedAccessScope Localoff-chipNoR/Wthread Sharedon-chipN/AR/Wall threads in a block Globaloff-chipNoR/Wall threads + host Constantoff-chipYesROall threads + host Textureoff-chipYesROall threads + host

Execution Model: Ordering Execution order is undefined Do not assume and use: block 0 executes before block 1 Thread 10 executes before thread 20 And any other ordering even if you can observe it

Execution Model Summary Grid of blocks of threads –1D/2D grid of blocks –1D/2D/3D blocks of threads All blocks are identical: –same structure and # of threads Block execution order is undefined Same block threads: –can synchronize and share data fast (shared memory) Threads from different blocks: –Cannot cooperate –Communication through global memory Threads and Blocks have IDs –Simplifies data indexing –Can be 1D, 2D, or 3D (threads) Blocks do not migrate: execute on the same processor Several blocks may run over the same processor

CUDA Software Architecture cuda…() cu…() e.g., fft()

Reasoning about CUDA call ordering GPU communication via cuda…() calls and kernel invocations –cudaMalloc, cudaMemCpy, Asynchronous from the CPU’s perspective –CPU places a request in a “CUDA” queue –requests are handled in-order Streams allow for multiple queues –More on this much later one

CUDA API: Example int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x; 1.Allocate CPU Data Structure 2.Initialize Data on CPU 3.Allocate GPU Data Structure 4.Copy Data from CPU to GPU 5.Define Execution Configuration 6.Run Kernel 7.CPU synchronizes with GPU 8.Copy Data from GPU to CPU 9.De-allocate GPU and CPU memory

1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N);... } Pinned memory allocation results in faster CPU to/from GPU copies More on this later cudaMallocHost (…)

2. Initialize CPU Data float *ha; int i; for (i = 0; i < N; i++) ha[i] = i;

3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side –NOT: da = cudaMalloc (…) Assignment is done internally: –That why we pass &da Allocated space in Global Memory

GPU Memory Allocation The host manages GPU memory allocation: –cudaMalloc (void **ptr, size_t nbytes) –Must explicitly cast to ( void **) cudaMalloc ((void **) &da, sizeof (float) * N); –cudaFree (void *ptr); cudaFree (da); –cudaMemset (void *ptr, int value, size_t nbytes); –cudaMemset (da, 0, N * sizeof (int)); Check the CUDA Reference Manual

4. Copy Initialized CPU data to GPU float *da; float *ha; cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION

Host/Device Data Transfers The host initiates all transfers: cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction) Asynchronous from the CPU’s perspective –CPU thread continues In-order processing with other CUDA requests enum cudaMemcpyKind –cudaMemcpyHostToDevice –cudaMemcpyDeviceToHost –cudaMemcpyDeviceToDevice

5. Define Execution Configuration How many blocks and threads/block int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1; Alternatively: blocks = (N + threads_block – 1) / threads_block;

6. Launch Kernel & 7. CPU/GPU Synchronization Instructs the GPU to launch blocks x thread_blocks threads: darradd > (da, 10f, N); cudaThreadSynchronize (); darradd: kernel name >> execution configuration –More on this soon (da, x, N): arguments –256 – 8 byte limit / No variable arguments

CPU/GPU Synchronization CPU does not block on cuda…() calls –Kernel/requests are queued and processed in-order –Control returns to CPU immediately Good if there is other work to be done –e.g., preparing for the next kernel invocation Eventually, CPU must know when GPU is done Then it can safely copy the GPU results cudaThreadSynchronize () –Block CPU until all preceding cuda…() and kernel requests have completed

8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);

The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = da[i] + x; } BlockIdx: Unique Block ID. –Numerically asceding: 0, 1, … ThreadIdx: Unique per Block Index –0, 1, … –Per Block BlockDim: Dimensions of Block –BlockDim.x, BlockDim.y, BlockDim.z –Unused dimensions default to 0

Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x; a[0]a[63]a[64]a[127]a[128]a[255]a[256] blockIdx.x 0blockIdx.x 1blockIdx.x 2 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 i = 0i = 63i = 64i = 127i = 128i = 255 i = 256 Assuming blockDim.x = 64

Generic Unique Thread and Block Index Calculations #1 1D Grid / 1D Blocks: UniqueBlockIndex = blockIdx.x; UniqueThreadIndex = blockIdx.x * blockDim.x + threadIdx.x; 1D Grid / 2D Blocks: UniqueBlockIndex = blockIdx.x; UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x; 1D Grid / 3D Blocks: UniqueBockIndex = blockIdx.x; UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y * blockDim.z + threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; Source:

Generic Unique Thread and Block Index Calculations #2 2D Grid / 1D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x; UniqueThreadIndex = UniqueBlockIndex * blockDim.x + threadIdx.x; 2D Grid / 2D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x; UniqueThreadIndex =UniqueBlockIndex * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; 2D Grid / 3D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x; UniqueThreadIndex = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.z + threadIdx.y * blockDim.x + threadIdx.x; UniqueThreadIndex means unique per grid.

CUDA Function Declarations __global__ defines a kernel function –Must return void –Can only call __device__ functions __device__ and __host__ can be used together Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc() host

__device__ Example Add x to a[i] multiple times __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }

Kernel and Device Function Restrictions __device__ functions cannot have their address taken –e.g., f = &addmany; *f(…); For functions executed on the device: –No recursion darradd (…) { darradd (…) } –No static variable declarations inside the function darradd (…) { static int canthavethis; } –No variable number of arguments e.g., something like printf (…)

Predefined Vector Datatypes Can be used both in host and in device code. –[u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4] Structures accessed with.x,.y,.z,.w fields default constructors, “ make_TYPE (…)”: –float4 f4 = make_float4 (1f, 10f, 1.2f, 0.5f); dim3 –type built on uint3 –Used to specify dimensions –Default value is (1, 1, 1)

Execution Configuration Must specify when calling a __global__ function: >> where: –dim3 Dg : grid dimensions in blocks –dim3 Db : block dimensions in threads –size_t Ns : per block additional number of shared memory bytes to allocate optional, defaults to 0 more on this much later on –cudaStream_t S : request stream(queue) optional, default to 0. Compute capability >= 1.1

Built-in Variables dim3 gridDim –Number of blocks per grid, in 2D (.z always 1) uint3 blockIdx –Block ID, in 2D (blockIdx.z = 1 always) dim3 blockDim –Number of threads per block, in 3D uint3 threadIdx –Thread ID in block, in 3D

Execution Configuration Examples 1D grid / 1D blocks dim3 gd(1024) dim3 bd(64) akernel >>(...) gridDim.x = 1024, gridDim.y = 1, blockDim.x = 64, blockDim.y = 1, blockDim.z = 1 2D grid / 3D blocks dim3 gd(4, 128) dim3 bd(64, 16, 4) akernel >>(...) gridDim.x = 4, gridDim.y = 128, blockDim.x = 64, blockDim.y = 16, blockDim.z = 4

Error Handling Most cuda…() functions return a cudaError_t –If cudaSuccess : Request completed without a problem cudaGetLastError(): –returns the last error to the CPU –Use with cudaThreadSynchronize(): cudaError_t code; cudaThreadSynchronize (); code = cudaGetLastError (); char *cudaGetErrorString(cudaError_t code); –returns a human-readable description of the error code

Error Handling Utility Function void cudaDie (const char *msg) { cudaError_t err; cudaThreadSynchronize (); err = cudaGetLastError(); if (err == cudaSuccess) return; fprintf (stderr, "CUDA error: %s: %s.\n", msg, cudaGetErrorString (err)); exit(EXIT_FAILURE); } adapted from:

Error Handling Macros CUDA_SAFE_CALL ( some cuda call ) CUDA_SAFE_CALL (cudaMemcpy (a_h, a_d, arr_size, cudaMemcpyDeviceToHost) ); Must define #define _DEBUG –No code emitted when undefined: Performance Use make dbg=1 under NVIDIA_CUDA_SDK

Measuring Time -- gettimeofday Unix-based: #include struct timeval start, end; gettimeofday (&start, NULL); WHAT WE ARE INTERESTED IN gettimeofday (&end, NULL); timeCpu = (float)(end.tv_sec - start.tv_sec); if (end.tv_usec < start.tv_usec) { timeCpu -= 1.0; timeCpu += (double)( end.tv_usec - start.tv_usec)/ ; } else timeCpu += (double)(end.tv_usec - start.tv_usec)/ ;

Using CUDA clock () clock_t clock (); Can be used in device code returns per multiprocessor counter which is incremented every clock cycle Sample at the beginning and end upper bound since threads are time-sliced uint start = clock();... compute (less than 3 sec).... uint end = clock(); if (end > start) time = end - start; else time = end + (0xffffffff - start) Look at the clock example under projects in SDK Using takes some effort –Every thread measures start and end –Then must find min start and max end –Accurate

Using cutTimer…() library calls #include unsigned int htimer; cutCreateTimer (&htimer); CudaThreadSynchronize (); cutStartTimer(htimer); WHAT WE ARE INTERESTED IN cudaThreadSynchronize (); cutStopTimer(htimer); printf (“time: %f\n", cutGetTimerValue(htimer));

Code Overview: Host side #include unsigned int htimer; float *ha, *da; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N); for (int i = 0; i < N; i++) ha[i] = i; cutCreateTimer (&htimer); cudaMalloc ((void **) &da, sizeof (float) * N); cudaMemCpy ((void *) da, (void *) ha, sizeof (float) * N, cudaMemcpyHostToDevice); cudaThreadSynchronize (); cutStartTimer(htimer); blocks = (N + threads_block – 1) / threads_block; darradd > (da, 10f, N) cudaThreadSynchronize (); cutStopTimer(htimer); cudaMemCpy ((void *) ha, (void *) da, sizeof (float) * N, cudaMemcpyDeviceToHost); cudaFree (da); free (ha); printf (“processing time: %f\n", cutGetTimerValue(htimer)); }

Code Overview: Device Side __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }

Variable Declarations – Will revisit next time __device__ –stored in device memory (large, high latency, no cache) –Allocated with cudaMalloc (__device__qualifier implied) –accessible by all threads –lifetime: application __constant__ –same as __device__, but cached and read-only by GPU –written by CPU via cudaMemcpyToSymbol(...) call –lifetime: application __shared__ –stored in on-chip shared memory (very low latency) –accessible by all threads in the same thread block –lifetime: kernel launch Unqualified variables: –scalars and built-in vector types are stored in registers –arrays of more than 4 elements or run-time indices stored in device memory

Measurement Methodology You will not get exactly the same time measurements every time –Other processes running / external events (e.g., network activity) –Cannot control –“Non-determinism” Must take sufficient samples –say 10 or more –There is theory on what the number of samples must be Measure average Will discuss this next time or will provide a handout online

Handling Large Input Data Sets – 1D Example Recall gridDim.[xy] <= Host calls kernel multiple times: float *dac = da; // starting offset for current kernel while (n_blocks) { int bn = n_blocks; int elems; // array elements processed in this kernel if (bn > 65535) bn = 65535; elems = bn * block_size; darradd >> (dac, 10.0f, elems); n_blocks -= bn; dac += elems; }