Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Similar presentations


Presentation on theme: "Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),"— Presentation transcript:

1 Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU), and Haixun Wang (IBM T.J. Watson)

2 Reachability Query 12 34 67 8 5 9 1310 11 12 14 15 ?Query(1,11) Yes ?Query(3,9) No The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ? Directed Graph  DAG (directed acyclic graph) by coalescing the strongly connected components

3 Applications XML Biological networks Ontology Knowledge representation (Lattice operation) Object programming (Class relationship) Distributed systems (Reachable states) Graph Databases

4 MethodQuery timeConstructionIndex size DFS/BFSO(n+m) Transitive ClosureO(1)O(nm)/O(n 3 )O(n 2 ) Optimal Chain Cover (Jagadish, TODS’90) O(k)O(nm)O(nk) Optimal Tree Cover (Agrawal et al., SIGMOD’89) O(n)O(nm)O(n 2 ) Dual-Labeling (Wang et al., ICDE’06) O(1)O(n+m+t 3 )O(n+t 2 ) Labeling+SSPI (Chen et al., VLDB’05) O(m-n)O(n+m) GRIPP (Triβl et al., SIGMOD’07) O(m-n)O(n+m) Prior Work 2-HOP (O(nm 1/2 ), and O(n 4 )), HOPI, and heuristic algorithms

5 Limitation of Tree-based approaches Finding a good tree cover is expensive Tree cover cannot represent some common types of DAGs, like Grid Compression limitations –Chain (1-parent, 1-child) –Tree (1-parent, multiple children) –Most existing methods which utilize the tree cover are greatly affected by how many edges are left uncovered

6 Overview of Path-Tree Chain->Tree->Path-Tree (2 parents / multiple children) Path-tree cover is a spanning subgraph of G in a tree shape (T) A node in the tree T corresponds to a path in G and an edge in T corresponds to the edges between two paths in G 3-tuple labeling exists for any path-tree to answer reachability query in O(1)

7 Path-Tree in a Nutshell 12 34 67 8 5 9 1310 11 12 14 15 P1 P2 P3 P4 P1 P2 P3 P4 Path-Graph is not necessarily a planar graph The reachability between any two nodes can be answered in O(1)

8 Key Problems How to construct a path-tree? –Algorithm How can a path-tree help with reachability queries? –Labeling –Transitive Closure Compression How does path-tree compare with the existing methods? –Optimality

9 Constructing Path-Tree Step 1: Path-Decomposition of DAG Step 2: Minimal Equivalent Edge Set between any two paths Step 3: Path-Graph Construction Step 4: Path-Tree Cover Extraction

10 Step 1: Path-Decomposition 12 34 67 8 5 9 1310 11 12 14 15 P1 P2 P3 P4 (PID,SID) =(2, 5) For any two nodes (u, v) in the same path, u  v if and only if (u.sid  v.sid) Simple linear algorithm based on topological sort can achieve a path-decomposition

11 Step 2: Minimal equivalent edge set 12 34 67 1310 11 14 15 P1 P2 P1  P2 The reachability between any two paths can be captured by a unique minimal set of edges 12 34 67 1310 11 14 15 P1 P2 P1  P2 The edges in the minimal equivalent edge set do not cross (always parallel)!

12 Step 3: Path-Graph Construction 12 34 67 8 5 9 1310 11 12 14 15 P1 P2 P3 P4 P1 P2 P3 P4 4 5 1 22 1 1 2 Weighted Directed Path-Graph Weight reflects the cost we have to pay for the transitive closure computation if we exclude this path-tree edge

13 Step 4: Extracting Path-Tree Cover P1 P2 P3 P4 4 5 1 22 1 1 2 Weighted Directed Path-Graph P1 P2 P3 P4 5 2 2 Maximal Directed Spanning Tree Chu-Liu/Edmonds algorithm, O(m’+ k logk )

14 Key Problems How to construct a path-tree? –Algorithm How can path-tree help with reachability queries? –Labeling –Transitive Closure Compression How does path-tree compare with the existing methods? –Optimality

15 3-Tuple Labeling for Reachability 12 34 67 8 5 9 1310 11 12 14 15 P1 P2 P3 P4 P1 P2 P3 P4 DFS labeling (1-tuple) Interval labeling (2-tuple) High-level description about paths Pi  Pj ? [1,1] [2,2] [1,3] [1,4]

16 DFS labeling 1 2 3 4 6 7 8 5 9 13 10 11 12 14 15 P1 P2 P3 P4 1.Starting from the first vertex in the root-path 2.Always try to visit the next vertex in the same path 3.Label a node when all its neighbors has been visited L(v)=N-x, x is the # of nodes has been labeled 15 14 13 12 11 10 9 8 7 6 5 4 3 21

17 3-Tuple Labeling for Reachability 1 2 3 4 6 7 8 5 9 13 10 11 12 14 15 P1 P2 P4 15 14 13 12 11 10 9 8 7 6 5 4 3 21 P1 P2 P3 P4 [1,1] [2,2] [1,3] [1,4] u  v if and only if 1) Interval label I(u)  I(v) 2) DFS label L(u)  L(v) ?Query(9,15) P4[1,4]  P1[1,1] and 5 < 15 Yes ?Query(9,2) ?Query(5,9) P3

18 Transitive Closure Compression An efficient procedure can compute and compress the transitive closure in O(mk), k is number of paths in path-tree 12 34 67 8 5 9 1310 11 12 14 15 Path-tree cover (including labeling) can be constructed in O(m + n logn)

19 Key Problems How to construct a path-tree? –Algorithm How can path-tree help with reachability query? –Labeling –Transitive Closure Compression How does path-tree compare with the existing methods? –Optimality

20 Theoretical Analysis Optimal Path-Tree Cover (OPTC) Problem: –Given a path-decomposition, what is the optimal path- tree cover to maximally compress the transitive closure? –OptIndex weight assignment based on computing the predecessor set Optimal Path-Decomposition (OPD) Problem: –Assuming we only use path-decomposition to compress the transitive closure, what is the optimal path-decomposition to maximally compress the transitive closure? –Minimal-cost flow problem –What is the overall optimal path-decomposition?

21 Superiority of Path-Tree Cover The optimal tree cover is a special case of path-tree cover when each vertex corresponds to a single path and the weight is based on OptIndex. The path-tree cover approach can compress the transitive closure with size being smaller than or equal to the optimal tree cover approach (and consequently optimal chain cover approach).

22 Experimental Evaluation Implementation in C++ 12 Real datasets used in Dual-labeling paper and GRIPP paper Synthetic datasets –Sparse DAG with edge density = 2 AMD Opteron 2.0GHz/ 2GB/ Linux PTree1 (OptIndex) and PTree2 –Mainly compare with Optimal Tree Cover

23 Real Datasets Graph Name#V#EDAG #VDAG #E AgroCyc 13969176941268413408 aMaze 118772870037103600 Anthra 13736173071249913104 Ecoo157 13800173081262013350 HpyCyc 5565847447715859 Human 40051438793881139576 Kegg 142713517036173908 Mtbrv 1069713922960210245 Nasa 5704794256057735 Reactome 367814447901846 Vchocyc 1069414207949110143 Xmark 6483765460807028

24 Experimental Result (Real Data) Transitive Closure SizeConstruction Time (in ms)Query Time (in ms) TreePtree-1Ptree-2TreePtree-1Ptree-2TreePtree-1Ptree-2 AgroCyc 135509622133149.8224.853142.31146.6291014.393 aMaze 51781571172741062.2834.69763.74819.47821.52961.925 Anthra 131557332620141.11212.258143.56844.9589.31716.498 Ecoo157 134939733592151.46229.29141.95146.67411.22416.739 HpyCyc 59464224466157.378106.55271.67531.53912.08915.503 Human 396369652910446.32648.005465.14870.10720.00823.008 Kegg 5121170330344746.031057.1186.39617.50927.28275.448 Mtbrv 102888123664111.48173.382106.58340.3919.8119.815 Nasa 91625063667085.291111.39753.13937.03716.21420.771 Reactome 1293383106917.24418.1896.317.5656.46713.037 Vchocyc 101838302262109.47170.714103.03640.0268.99914.274 Xmark 8237235610614204.76247.62868.35837.83417.12241.549 On average 10 times better than TreeOn average 3 times better than Tree

25 Experimental Result (Synthetic Data)

26

27

28 Conclusion A novel Path-Tree structure is proposed to assist the compression of transitive closure and answering reachability query Path-tree has potential to integrate with other existing methods to further improve the efficiency of reachability query processing

29 Thanks!!

30 Step 3: Path-Graph Construction 12 34 67 8 5 9 1310 11 12 14 15 P1 P2 P3 P4 P1 P2 P3 P4 4 5 1 22 1 1 2 Weighted Directed Path-Graph Weight reflects the penalty if we exclude this path-tree edge

31 Step 2: Constructing Minimal Equivalent Edge Set (Pi  Pj) 12 34 67 1310 11 14 15 P1 P2 P1  P2 1.Ordering the vertices in Pi and Pj by decreasing order 2.Finding the first vertex v in P_j that P_i can reach 3.Finding the last vertex u in P_i that reach v 4.Removing all the edges cross (u,v) and repeat 2-4

32 3-Tuple Labeling for Reachability 12 34 67 8 5 9 1310 11 12 14 15 P1 P2 P3 P4 P1 P2 P3 P4 DFS labeling (1-tuple) Interval labeling (2-tuple) High-level description about paths Pi  Pj ? [1,1] [2,2] [1,3] [1,4]


Download ppt "Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),"

Similar presentations


Ads by Google