Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao
Graph Reachability Query Given a directed graph G = (V, E) and two vertices u and v, u is said to reach v if there exists a path from u to v over G. Any directed graph can be easily transformed into a DAG trivial if u and v are in the same connect component Query( v 1, v 8 ) Reachable Query( v 2, v 11 ) Unreachable
The Issue and the Challenge ‘Big Data’ era brings us large graph with millions of nodes and edges. web-uk dataset: 133 million nodes, 5 billion edges DAG of web-uk: 22 million nodes, 38 million edges Traditional approaches are not applicable.
Related Work Recent works builds index, label( u ), offline for every node u. Label-Only Approach: answer Query( u, v ) only by label( u ) and label( v ) only Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, … Transitive Closure Compression: Chain-Cover, Tree-Cover, … non-linear index construction time and index size, may generate unacceptable large index Label+ G Approach: answer Query( u, v ) by label( u ) and label( v ) with the possibility of accessing G if needed interval labeling: GRIPP, GRAIL, Ferrari, … linear index size, but may perform DFS
Main Idea of IP Labeling Both are time/space consuming if an exact answer is needed for large sets.
Main Idea of IP Labeling based on Min-wise Independent Permutation high probability guarantee to answer query linear index construction time and index size
Min-wise Independent Permutation
K -min-wise Independent Permutation We propose to use top-k smallest numbers instead of top-1 smallest number to improve the performance.
K -min-wise Independent Permutation
Independent Permutation Generation Knuth Shuffle
IP Label The IP label of u consists of two parts: L out ( u ): the min k { } set of Out( u ), min k {Out( u )} L in ( u ): the min k { } set of In( u ), min k {In( u )}
IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 {10}{4} {2, 10} {3} {8} {2, 3, 4, 10} {2, 10} {2, 3, 4, 8, 10}
IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = L out (v 2 ) = {2, 3, 4, 8, 10} L out (v 7 ) = {1} Q 1 : Query (v 2, v 7 )
IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = L out (v 3 ) = {1, 2, 3, 4, 6} L out (v 2 ) = {2, 3, 4, 8, 10} L in (v 2 ) = {7, 8} L in (v 3 ) = {6, 7} Q 2 : Query (v 3, v 2 )
IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = Q 2 : Query (v 1, v 8 ) Need to Perform DFS Effect?
IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = Q 2 : Query (v 1, v 3 )
IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = Q 4 : Query (v 1, v 3 )
IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = Q 4 : Query (v 1, v 3 )
IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = The probability increase significantly! Q 4 : Query (v 1, v 3 )
IP Label While DFS becomes deeper, it is much more likely to answer the unreachability queries, and therefore, it can stop in an early stage.
Two Optimizations Huge-Vertex Label: build additional index to handle the huge vertices of the graph Level Label: use the topological structure to prune the search space
Level Label
Huge-Vertex Label Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5}
Huge-Vertex Label Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5} Query (v 0, v 11 )
Huge-Vertex Label Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5} Query (v 0, v 1 )
Huge-Vertex Label Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5} Query (v 5, v 6 )
Performance Studies Real Dataset: Dataset|V(G)||E(G)|d avg R-ratio uniprotenc25M E-7 twitter18M E-2 web-uk22M38M E-1 citeseerx6.5M15M E-4 go-uniprot6.9M34M E-6 govwild8.0M23M E-5
Performance Studies Index Construction Time (in second) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc twitter web-uk citeseerx go-uniprot govwild
Performance Studies Query Time (in millisecond) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc twitter web-uk citeseerx go-uniprot govwild
Performance Studies
Distribution of the number of vertices visited
Conclusion We propose a new IP labeling approach, the first one to explore the randomness to answer reachability queries. Our new labeling approach has linear index construction time and index size. By independent permutation, the query performance is guaranteed by high probability. We analyze the performance of our proposed approach by extensive experimental studies and our approach shows both good efficiency and scalability.