Presentation on theme: "1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA."— Presentation transcript:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA dynamic parallelism facility introduced in the last set of slides and provides code examples and an application
2 CUDA Dynamic Parallelism A facility introduced by NVIDIA in their GK110 chip/architecture and embodied in our new K20 GPU server coit-grid08.uncc.edu Allows a kernel to call another kernel from within it without returning the host. Each such kernel call has the same calling construction - the grid and block sizes and dimensions are set at the time of the call. Facility allows computations to be done with dynamically altering grid structures and recursion, to suit the computation. For example allows a 2/3D simulation mesh to be non-uniform with increased precision at places of interest, see previous slides.
3 Host code kernel1 >>(…) ; __global__ void kernel1 (…) kernel2 >>(…) ; return 0; Dynamic Parallelism __global__ void kernel2 (…) kernel3 >>(…) ; return 0; Kernels calling other kernels Notice the kernel call is a standard syntax and allows each kernel call to have different grid/block structures Device code Nested depth limited by memory and <= 63 or 64
4 Host code kernel1 >>(…) ; __global__ void kernel1 (…) kernel1 >>(…) ; return 0; Dynamic Parallelism Apparently recursion is allowed (To confirm) Device code Question: How do you get out of an infinite loop in recursion?
5 Derived from Fig 20.4 of Kirk and Hwu 2 nd Ed. Grid A (Parent) Grid B (Child) Grid B launch Grid A threads Grid B completes Host (CPU) thread Parent-Child Launch Nesting Grid A launch Time Grid B threads Grid A completes “Implicit synchronization between parent and child forcing parent to wait until all children exit before it can exit.”
6 Kernel Execution configuration re-visited Derived from CUDA 5 Toolkit documentation* The execution configuration is specified by: kernel >> G specifies dimension and size of grid, B specifies dimension and size of each block Ns specifies number of bytes in shared memory that is dynamically allocated per block for this call in addition to statically allocated memory; Ns is optional which defaults to 0; S specifies associated stream; S is optional which defaults to 0. * http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-language- extensions Earlier CUDA versions appear to have three arguments G, B and S?S
7 Streams re-visited Stream: A sequence of operations that execute in-order on GPU Provides for concurrency -- multiple streams can be supported simultaneously on GPU Compute Capability 2.0+ up to 16 CUDA streams* Compute Capability 3.5 up to 32 CUDA streams Stream can be specified in kernel launch as 4th parameter of execution configuration: Kernel >>(…); // Stream 3 If missing, default is 0 0 So technically can launch multiple concurrent kernel from host. Apparently difficult to get more than 4 streams to run concurrently *Some devices, query concurrentKernels device property
8 Application Static Heat Equation/Laplace’s Equation ∂u∂u 2-D Heat equation = 22 + 2 With boundary conditions defined, 0 as t tends to infinity. Then get time-independent equation called Laplace’s equation = 22 + 2 0 2 2
9 Solving Static Heat Equation/Laplace’s Equation Finite Difference Method Solve for f over the two- dimensional x-y space. For computer solution, finite difference methods appropriate Two-dimensional solution space “discretized” into large number of solution points.
Divide area into fine mesh of points, h i,j. Temperature at an inside point taken to be average of temperatures of four neighboring points. Convenient to describe edges by points. Temperature of each point by iterating the equation: ( 0 < i < n, 0 < j < n) for a fixed number of iterations or until the difference between iterations less than some very small amount. 6.11
Heat Distribution Problem For convenience, edges also represented by points, but having fixed values, and used by computing internal values. 6.12
13 Multigrid Method First, a coarse grid of points used. With these points, iteration process will start to converge quickly. At some stage, number of points increased to include points of coarse grid and extra points between points of coarse grid. Initial values of extra points found by interpolation. Computation continues with this finer grid. Grid can be made finer and finer as computation proceeds, or computation can alternate between fine and coarse grids. Coarser grids take into account distant effects more quickly and provide a good starting point for the next finer grid.
15 Various strategies V-cycle and W cycles between resolutions Gradually decreasing resolution Could a make an interesting project. There is a mathematical basis behind the method Leads to much faster results that using a single resolution h 2h 4h 8h