Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.

Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao

Graph Reachability Query Given a directed graph G = (V, E) and two vertices u and v, u is said to reach v if there exists a path from u to v over G. Any directed graph can be easily transformed into a DAG trivial if u and v are in the same connect component 01 23 54 67 981110 Query( v 1, v 8 ) Reachable Query( v 2, v 11 ) Unreachable

The Issue and the Challenge ‘Big Data’ era brings us large graph with millions of nodes and edges. web-uk dataset: 133 million nodes, 5 billion edges DAG of web-uk: 22 million nodes, 38 million edges Traditional approaches are not applicable.

Related Work Recent works builds index, label( u ), offline for every node u. Label-Only Approach: answer Query( u, v ) only by label( u ) and label( v ) only Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, … Transitive Closure Compression: Chain-Cover, Tree-Cover, … non-linear index construction time and index size, may generate unacceptable large index Label+ G Approach: answer Query( u, v ) by label( u ) and label( v ) with the possibility of accessing G if needed interval labeling: GRIPP, GRAIL, Ferrari, … linear index size, but may perform DFS

Main Idea of IP Labeling Both are time/space consuming if an exact answer is needed for large sets.

Main Idea of IP Labeling based on Min-wise Independent Permutation high probability guarantee to answer query linear index construction time and index size

Min-wise Independent Permutation

K -min-wise Independent Permutation We propose to use top-k smallest numbers instead of top-1 smallest number to improve the performance.

K -min-wise Independent Permutation

Independent Permutation Generation 01 23 54 67 98 11 10 711 86 03 21 410 5 9 Knuth Shuffle

IP Label The IP label of u consists of two parts: L out ( u ): the min k { } set of Out( u ), min k {Out( u )} L in ( u ): the min k { } set of In( u ), min k {In( u )}

IP Label 711 86 03 21 410 5 9 Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 {10}{4} {2, 10} {3} {8} {2, 3, 4, 10} {2, 10} {2, 3, 4, 8, 10}

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 01 23 54 67 98 11 10 L out (v 2 ) = {2, 3, 4, 8, 10} L out (v 7 ) = {1} Q 1 : Query (v 2, v 7 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 01 23 54 67 98 11 10 L out (v 3 ) = {1, 2, 3, 4, 6} L out (v 2 ) = {2, 3, 4, 8, 10} L in (v 2 ) = {7, 8} L in (v 3 ) = {6, 7} Q 2 : Query (v 3, v 2 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 01 23 54 67 98 11 10 Q 2 : Query (v 1, v 8 ) Need to Perform DFS Effect?

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 01 23 54 67 98 11 10 Q 2 : Query (v 1, v 3 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 01 23 54 67 98 11 10 Q 4 : Query (v 1, v 3 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 01 23 54 67 98 11 10 The probability increase significantly! Q 4 : Query (v 1, v 3 )

IP Label While DFS becomes deeper, it is much more likely to answer the unreachability queries, and therefore, it can stop in an early stage.

Two Optimizations Huge-Vertex Label: build additional index to handle the huge vertices of the graph Level Label: use the topological structure to prune the search space

Level Label

Huge-Vertex Label 01 23 54 67 981110 Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5}

Huge-Vertex Label 01 23 54 67 981110 Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5} Query (v 0, v 11 )

Performance Studies Real Dataset: Dataset|V(G)||E(G)|d avg R-ratio uniprotenc25M 0.9991.30E-7 twitter18M 1.0137.39E-2 web-uk22M38M1.6781.50E-1 citeseerx6.5M15M2.2954.07E-4 go-uniprot6.9M34M4.9903.64E-6 govwild8.0M23M2.9487.20E-5

Performance Studies Index Construction Time (in second) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc58.52922.28058.24224.29218.96 twitter15.29113.71932.32319.97212.44 web-uk---24.24044.03126.92717.46 citeseerx91.87712.04523.17019.7927.54 go-uniprot38.66818.27744.55740.3659.68 govwild30.52018.58429.23719.9248.45

Performance Studies Query Time (in millisecond) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc119.164119.618820.249116.35154.205 twitter102.923104.698---82.21279.285 web-uk---146.429---214.857253.082 citeseerx230.318111.32928774131.534101.444 go-uniprot55.279153.214499.505313.30034.577 govwild254.785128.199719.494295.432112.990

Performance Studies

Distribution of the number of vertices visited

Conclusion We propose a new IP labeling approach, the first one to explore the randomness to answer reachability queries. Our new labeling approach has linear index construction time and index size. By independent permutation, the query performance is guaranteed by high probability. We analyze the performance of our proposed approach by extensive experimental studies and our approach shows both good efficiency and scalability.

Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.

Similar presentations

Presentation on theme: "Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.

Similar presentations

Presentation on theme: "Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao."— Presentation transcript:

Similar presentations

About project

Feedback