Presentation on theme: "L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms."— Presentation transcript:
L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms
Administrative STRSM due March 17 (EXTENDED) Midterm coming In class April 4, open notes Review notes, readings and review lecture (before break) Will post prior exams Design Review Intermediate assessment of progress on project, oral and short Tentatively April 11 and 13 Final projects Poster session, April 27 (dry run April 25) Final report, May 4 CS6963L15: Tree Algorithms
Outline Mapping trees to data-parallel architectures Sources: Parallel scan from Lin and Snyder, _Principles of Parallel Programming_ “An Effective GPU Implementation of Breadth-First Search,” Lijuan Luo, Martin Wong and Wen-mei Hwu, DAC ‘10, June “Inter-block GPU communication via fast barrier synchronization,” S. Xiao and W. Feng, ?2009 Va. Tech TR?. “Stackless KD-Tree Traversal for High Performance GPU Ray Tracing,” S. Popov, J. Gunther, H-P Seidel, P. Slusallek, Eurographics 2007, 26(3), CS6963L15: Tree Algorithms
Mapping Challenge From this: CS6963L15: Tree Algorithms To this:
Simple Example Parallel Prefix Sum: Compute a partial sum from A,…,A[n-1] Standard way to express it for (i=0; i
Graphical Depiction of Sum Code CS6963L15: Tree Algorithms Original Order Pairwise Order Which decomposition is better suited for parallel execution.
Parallelization Strategy 1.Map tree to flat array 2.Two passes through tree: a.Bottom up: Compute sum for each subtree and propagate all the way up to root node b.Top down: Each non-leaf node receives a value from its parent for sum up to current element. It sends the right child the sum of the parent plus left child value computed on top-down pass. Leaves add the prefix value from parent and saved value to compute final result. 3.Solution can be found on my website from CS4961 last fall CS6963L15: Tree Algorithms
Breadth First Search 9 Definition: Beginning at a node s in a connected component, explore all the neighboring nodes. Then for each of the neighbors, explore their unexplored neighbors until all nodes have been visited. The result of this search is the set of nodes reachable from s. Input: G=(V,E) and distinguished vertex s Output: a breadth first spanning tree with root s that contains all reachable vertices Key data structure: a frontier queue of nodes that have been visited at the current level of the tree CS6963L15: Tree Algorithms
CS6963L15: Tree Algorithms GPU Challenges Very little work at the root node. Much more work as the algorithm progresses to leaves. Managing global frontier queue can lead to high overhead. Sparse matrix implementation (L multiplications corresponding to L levels) can be slower than sequential CPU algorithms for large graphs. Assumption for this paper: graph is sparse
CS6963L15: Tree Algorithms Overview of Strategy Parallelism comes from propagation from all frontier vertices in parallel. With sparse graph, searching each neighbor of a frontier vertex in parallel will not have as much work associated with it. Vary the amount of GPU that is being used depending on threshold of profitability A single warp as the baseline A single block as the next level The entire device as the outermost level
Overview of Data Decomposition 12 Corresponding mapping of frontier queue to levels in implementation Hierarchical frontier queue leads to infrequent global synchronization. Split the frontier queue into levels according to position in the tree Lowest level is for a single warp Next level for per block shared memory Outermost level is for a larger global memory structure. CS6963L15: Tree Algorithms
CS6963L15: Tree Algorithms A Few Details Warp level writes to W-Frontier atomically. Add a single element to queue and update end of queue Different warps write to different W-Frontier queues. A B-Frontier is the union of 8 W-Frontiers. A single thread walks the W-Frontiers to derive indices of frontier nodes. A G-Frontier is shared across the device. Copies from B-Frontier to G-Frontier are done atomically. (Can use coalescing.)
CS6963L15: Tree Algorithms Global synchronization across blocks uses reference to Va Tech TR. Approach depends on: Using the entire device, all SMs Exact match of blocks to SMs All SMs must be included in “global barrier” to prevent deadlock. Global Synchronization Approach
CS6963L15: Tree Algorithms Another Algorithm: K-Nearest Neighbor Ultimate goal: identify the k nearest neighbors to a distinguished point in space from a set of points Create an acceleration data structure, called a KD- tree, to represent the points in space. Given this tree, finding neighbors can be identified for any distinguished point One application of K-Nearest Neighbor is ray tracing
CS6963L15: Tree Algorithms Constructing a KD-Tree Hierarchically partition a set of points (2D example) Slide source: spring/TA/manuals/CGAL/ref-manual2/SearchStructures/Chapter_main.html
CS6963L15: Tree Algorithms Representing the KD-Tree Like the parallel prefix sum, we flatten the tree data structure to represent KD-Tree in memory. Difference: Tree is not fully populated. So, cannot use linearized structure of parallel prefix sum. An auxiliary structure, called “ropes”, provides a link between neighboring cells. Called a “stackless” KD-tree traversal because it is not recursive.
CS6963L15: Tree Algorithms Summary of Lecture To map trees to data parallel architectures with little or no support for recursion Flatten hierarchical data structures into dense vectors Use hierarchical storage corresponding to hierarchical algorithm decomposition to avoid costly global synchronization and minimize global memory accesses. Possibly vary amount of device participating in computation for different levels of the tree.