Download presentation

Presentation is loading. Please wait.

Published byGordon Fields Modified about 1 year ago

1
Understanding the SIMD Efficiency of Graph Traversal on GPU Yichao Cheng, Hong An, Zhitao Chen, Feng Li, Zhaohui Wang, Xia Jiang and Yi Peng University of Science and Technology of China

2
Breadth-first Search (BFS) AC DEF GHI A C D E F G H I Source

3
Breadth-first Search (BFS) A B C D E F G H I BFS_Iteration: for u ∈ Current Frontier for v ∈ u’ s neighbors do if v has not been labeled label v put v in Next Frontier

4
Application of BFS Many datasets in real world are represented by graph VLSI circuits Social relationship Road connections Primitive for building complex algorithms Path-finding Belief propagation Points-to Analysis (PTA)

5
The Problem GPU relies on high SIMD lanes occupancy to boost performance 100% efficiency is achieved only if all SIMD lanes fall in the same path I Do_something_common(); If (thread_id > 5) { do_something_red(); } else { do something_blue(); } 100% utilization

6
The Problem GPU relies on high SIMD lanes occupancy to boost performance 100% efficiency is achieved only if all SIMD lanes fall in the same path I 37.5% utilization Do_something_common(); If (thread_id > 5) { do_something_red(); } else { do something_blue(); }

7
The Problem GPU relies on high SIMD lanes occupancy to boost performance 100% efficiency is achieved only if all SIMD lanes fall in the same path I 62.5% utilization Do_something_common(); If (thread_id > 5) { do_something_red(); } else { do something_blue(); }

8
Traditional Implementation GPU_BFS_Iteration u = C[tid] for v ∈ u’ s neighbors do end for The # of sub-iterations depends on the size of u ’s adjacent list task 1 = 4 sub-iterations task 2 = 2 sub-iterations …

9
Visualizing the Irregularity vertex range < 8 Highly skewed outlier exists irregular but concentrate distributed between a wide rage

10
Alternative Way Assign each task with a warp of threads Vectorize the sub-iterations! I So, what’s the relationship between graph topology and SIMD efficiency?

11
Topology and Utilization Assign each vertex with a group of threads Thread WarpGroup task 1 = 2 sub-iterations task 2 = 1 sub-iteration

12
Topology and Utilization Divide the SIMD underutilization into two parts Inte R -group Underutilization (U R ) Intr A -group Underutilization (U A ) SIMD Window

13
Conclusions From the Model U R is induced by the heterogeneity of workloads Affected by the graph topology U R is sensitive to the group size (S) Large logical SIMD window can narrow the gap When S = 32, U R = 0 U A is determined by the intrinsic irregularity of vertex degree It can be limited by shrink the S When S = 1, U A = 0 U R and U A can convert to each other

14
Comparing Different Mapping Strategies Expansion Rate (ME/s) Scalability good poor low high

15
Evaluating the SIMD Efficiency Metrics derived from the model: UR = inter-group underutilization UA = intra-group underutilization ME = mapping efficiency UR + UA + ME = 100% Captures utilization trend with increasing S

16
Explaining the Result Expansion Rate (ME/s) Scalability good poor low high alleviate the UR ， introducing minor UA

17
Explaining the Result Expansion Rate (ME/s) Scalability good poor low high ME in a high level (~80%)

18
Explaining the Result Expansion Rate (ME/s) Scalability good poor low high outweighed by the fast-growing UA

19
Explaining the Result Expansion Rate (ME/s) Scalability good poor low high do little help to UR but lead to severe UA

20
Conclusion Study the link between graph topo & hardware util Present a model for analyzing the components of SIMD underutilization Discover that the SIMD are wasted due to: Develop 3 metrics for quantifying SIMD efficiency Provide a foundation for developing techniques of static analysis and runtime optimization imbalance of vertex degree distribution heterogeneity of each vertex degree

21
Q&A

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google