Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Similar presentations


Presentation on theme: "Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller."— Presentation transcript:

1 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller

2 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap Organization GPGPU motivation and platforms: CUDA, Stream, OpenCL Simple ray tracing

3 Programming with CUDA, WS09 Waqar Saleem, Jens Müller GPU wars Nvidia is far more popular ATI reports more powerful number crunchers image courtesy of Udeepta Bordoloi, AMD

4 Programming with CUDA, WS09 Waqar Saleem, Jens Müller GPU performance images courtesy of AnandTech, http://www.anandtech.com/video/showdoc.aspx?i=3643&p=8 http://www.anandtech.com/video/showdoc.aspx?i=3643&p=8

5 Programming with CUDA, WS09 Waqar Saleem, Jens Müller CUDA or Stream? OpenCL is vendor independent OpenCL drivers provided by ATI and Nvidia for their cards OpenCL like initiative for Windows: Microsoft DirectCompute

6 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Why CUDA? Many of the concepts from CUDA carry over almost exactly to OpenCL CUDA has been around since Feb 2007 and is very well documented CUDA home, http://www.nvidia.com/object/cuda_home.html http://www.nvidia.com/object/cuda_home.html links to programming guide, numerous university courses, multimedia presentations... OpenCL v1.0 was released in Nov/Dec 2008 GPU drivers are less than 6 months old A lot of our material will borrow heavily from the above

7 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Today Motivational videos CUDA hardware and programming models Threads, blocks and grids CUDA memory hierarchy Device compute capability Example kernel Thread IDs Memory overhead

8 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA The G80 architecture, e.g. GeForce 8800

9 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Simpler than graphics mode G80 in graphics mode

10 Programming with CUDA, WS09 Waqar Saleem, Jens Müller

11 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Thinking CUDA Break down the problem into serial and parallel parts Serial parts execute on the host (few threads) Parallel parts execute on the device (massively parallel)

12 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Program execution The host launches a C program Compute intensive, data parallel computations are written in special functions, kernels, in extended C The host launches a kernel on the compute device with a grid configuration The device starts threads according to the provided configuration All threads run in parallel on the device All threads execute the same kernel

13 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Thread organization A kernel launch specifies a grid of thread blocks that execute the kernel A grid is a 1D or 2D array of blocks Each block is a 1D, 2D or 3D array of threads Blocks and threads have IDs Choose organization to suit your problem and optimize performance

14 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Thread specific information Each thread may use block and thread ID to access its data and make control decisions

15 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Thread execution Each thread block is assigned to a single multiprocessor (MP) If there are more blocks than MPs, multiple blocks are assigned to MPs An MP breaks assigned block(s) into warps The order of execution of blocks and warps is determined by the thread scheduler We can thus write code independent of our device specification Thread organization according to the problem rather than the hardware Caution: For optimum performance, device specification needs to be considered

16 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Device independent code This also allows for scalable code Execution on more MPs is faster

17 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Memory Hierarchy Each block is mapped to a MP and each thread in the block to a processor in the MP

18 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Texture Memory Constant Memory

19 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Thread communication Threads in a block cooperate via shared memory, barrier synchronization and atomic operations Threads from different blocks cannot cooperate

20 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Host and Device communication Host can read/write all device memories except registers and shared memory Device can read/write global memory large but slow (600 clocks) Device can read texture memory large, slow but cached after first read Device can read constant memory small, cached optimized for certain memory accesses

21 Programming with CUDA, WS09 Waqar Saleem, Jens Müller CUDA devices The compute capability of a device is a number. The major revision number represents a fundamental change in card architecture The minor revision number represents incremental changes within the major revision CUDA ready devices have compute capability >= 1.0

22 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Example kernel: Vector addition Time taken for the CPU version: N * 1 addition

23 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Example kernel: Vector addition Time taken for the GPU version: 1 addition

24 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Example kernel Host code for ( int i = 0; i < N; i++ ) C[i] = A[i] + B[i]; Device kernel __global__ void vAdd ( float* A, float* B, float* C ) { int i = threadIdx.x; C[i] = A[i] + B[i]; }

25 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Kernel quantifier and thread ID Kernels MUST be quantified as __global__ Kernels MUST be declared void threadIdx is the 3 dimensional index of the thread in its block A thread with threadIdx (x,y,z) in a block of blockDim (Dx, Dy, Dz) has thread ID x + y.Dx + z.Dx.Dy For missing dimensions, the blockDim entry is 1 and the threadIdx entry is 0

26 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Kernel invocation Host program void main() { // allocate h_A, h_B,h_C, size N // assign values to host vectors for ( int i = 0; i < N; ++ ) h_C[i] = h_A[i] + h_B[i]; // output h_C // free host variables } Host + device program void main() { // allocate h_A, h_B,h_C, size N // assign values to host vectors // initialize device // allocate d_A,d_B,d_C, size N // copy h_A,h_B to d_A,d_B vAdd >>(d_A,d_B,d_ C) // copy d_C to h_C // output h_C // free host variables // free device variables }

27 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Memory overhead Necessary evil: device needs data in its own memory Overhead is justified if the kernel is compute intensive With multiple kernels, memory overhead can be overlaid with computation using streams Bandwidth between device and host is high

28 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Host-Device bandwidth Host Device

29 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Memory usage Host data is typically copied to/from global memory on device Threads have to fetch their data from global memory (slow) If the data is to be operated on several times, copy it from global memory to shared memory or registers (fast) Write result back to global memory at the end of computation

30 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Memory usage __global__ void myKernel ( float* in1, float* in2, float* out ) { // initialize s_in1,s_in2,s_out in shared memory // copy in1,in2 to s_in1,s_in2 // perform heavy computations on s_in1,s_in2 // store result in s_out // copy s_out to out }

31 Programming with CUDA, WS09 Waqar Saleem, Jens Müller

32 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Other issues New time for exercises Some ray tracing issues: to be clarified in the next exercise session

33 Programming with CUDA, WS09 Waqar Saleem, Jens Müller Next time Copying memory between host and device Clarify grid parameters CUDA additions to C Memory limitations

34 Programming with CUDA, WS09 Waqar Saleem, Jens Müller See you next time!


Download ppt "Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller."

Similar presentations


Ads by Google