Download presentation

Presentation is loading. Please wait.

Published byClifford Boyd Modified over 4 years ago

1
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 ECE 498HK Computational Thinking for Many-core Computing Lecture 15: Dealing with Dynamic Data

2
Objective To understand the nature of dynamic input data extraction To learn efficient queue structures that support massively parallel extraction of input data form bulk data structures To learn efficient kernel structures that cope with dynamic variation of working set size ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

3
Examples of Computation with Dynamic Input Data Graph search and optimization problems –Breadth-first search –Shortest path –Graph coverage – … Rendering and game physics –Ray tracing –Collision detection –… ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

4
Dynamic Data Extraction The data to be processed in each phase of computation need to be dynamically determined and extracted from a bulk data structure –Harder when the bulk data structure is not organized for massively parallel access, such as graphs. Graph algorithms are popular examples that deal with dynamic data –Widely used in EDA and large scale optimization applications –We will use Breadth-First Search (BFS) as an example ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

5
Main Challenges of Dynamic Data Input data need to be organized for locality, coalescing, and contention avoidance as they are extracted during execution The amount of work and level of parallelism often grow and shrink during execution –As more or less data is extracted during each phase –Hard to efficiently fit into one CUDA/OPenCL kernel configuration, which cannot be changed once launched –Different kernel strategies fit different data sizes ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

6
Breadth-First Search (BFS) sr t u v wx y sr t u v wx y sr t u v wx y sr t u v wx y s r w v t x uy Frontier vertex Level 1 Level 2 Level 3 Level 4 Visited vertex ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

7
Sequential BFS Store the frontier in a queue Complexity (O(V+E)) s r w v t x u y Level 1 Level 2 Level 3 Level 4 sr tv x y wu DequeueEnqueue ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

8
Parallelism in BFS Parallel Propagation in each level One GPU kernel per level s r w v t x uy Parallel v t x uy Level 1 Level 2 Level 3 Level 4 Kernel 3 Example kernel Kernel 4 Parallel global barrier ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

9
BFS in VLSI CAD Maze Routing blockage net terminal ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

10
BFS in VLSI CAD Logic Simulation/Timing Analysis 0 0 0 0 0 0 0 1 ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

11
BFS in VLSI CAD In formal verification for reachabiliy analysis. In clustering for finding connected components. In logic synthesis …….. ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

12
Potential Pitfall of Parallel Algorithms Greatly accelerated n 2 algorithm is still slower than an nlogn algorithm. Always need to keep an eye on fast sequential algorithm as the baseline. Running Time Problem Size ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

13
Node Oriented Parallelization IIIT-BFS –P. Harish et. al. “Accelerating large graph algorithms on the GPU using CUDA” –Each thread is dedicated to one node –Every thread examines neighbor nodes to determine if its node will be a frontier node in the next phase –Complexity O(VL+E) (Compared with O(V+E)) –Slower than the sequential version for large graphs Especially for sparsely connect graphs rst uvwx y rst uvwx y v t x uy ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

14
Matrix-based Parallelization Yangdong Deng et. al. “Taming Irregular EDA applications on GPUs” Propagation is done through matrix-vector multiplication –For sparsely connected graphs, the connectivity matrix will be a sparse matrix Complexity O(V+EL) (compared with O(V+E)) –Slower than sequential for large graphs s u v s u v s u v s u v s uv ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

15
Need a More General Technique To efficiently handle most graph types Use more specialized formulation when appropriate as an optimization Efficient queue-based parallel algorithms –Hierarchical scalable queue implementation –Hierarchical kernel arrangements ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

16
An Initial Attempt Manage the queue structure –Complexity: O(V+E) –Dequeue in parallel –Each frontier node is a thread –Enqueue in sequence. Poor coalescing Poor scalability –No speedup vtx u y uy v t x uy Parallel Sequential ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

17
Parallel Insert-Compact Queues C.Lauterbach et.al.“Fast BVH Construction on GPUs” Parallel enqueue with compaction cost Not suitable for light-node problems vt y x u Φ ΦΦΦ uy Compact Propagate v t x uy ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

18
Basic ideas Each thread processes one or more frontier nodes Find the index of each new frontier node Build queue of next frontier hierarchically h Local queues Global queue q1 q2 q3 Index = offset of q2 (#node in q1) + index in q2 a bc gij Local Globalh ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

19
Two-level Hierarchy Block queue (b-queue) –Inserted by all threads in a block –Reside in Shared Memory Global queue (g-queue) –Inserted only when a block completes Problem: –Collision on b-queues –Threads in the same block can cause heavy contention g-queue b-queue Global Mem Shared Mem ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

20
Warp-level Queue Time WiT7WiT7 WiT6WiT6 WiT0WiT0 W i T 15 W i T 14 WiT8WiT8 W i T 23 W i T 22 W i T 16 W i T 30 W i T 29 W i T 24 WjT7WjT7 WjT6WjT6 WjT0WjT0 W j T 15 W j T 14 WjT8WjT8 W j T 23 W j T 22 W j T 16 W j T 30 W j T 29 W j T 24 Warp i Warp j Thread Scheduling Divide threads into 8 groups (for GTX280) –Number of SP’s in each SM in general –One queue to be written by each SP in the SM Each group writes to one warp-level queue Still should use atomic operation –But much lower level of contention ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

21
Three-level hierarchy w-queue b-queue g-queue b-queue ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

22
Shared Memory: –Interleaved queue layout, no bank conflict Global Memory: –Coalescing when releasing a b-queue to g-queue –Moderate contention across blocks Texture memory : –Store graph structure (random access, no coalescing) –Fermi cache may help. W-queues[][8] Hierarchical Queue Management W-queue0W-queue1 W-queue7

23
Hierarchical Queue Management Advantage and limitation –The technique can be applied to any inherently sequential data structure –The w-queues and b-queues are limited by the capacity of shared memory. If we know the upper limit of the degree, we can adjust the number of threads per block accordingly. ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

24
ANY FURTHER QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google