Presentation on theme: "Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao."— Presentation transcript:
Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao
Graph Reachability Query Given a directed graph G = (V, E) and two vertices u and v, u is said to reach v if there exists a path from u to v over G. Any directed graph can be easily transformed into a DAG trivial if u and v are in the same connect component 01 23 54 67 981110 Query( v 1, v 8 ) Reachable Query( v 2, v 11 ) Unreachable
The Issue and the Challenge ‘Big Data’ era brings us large graph with millions of nodes and edges. web-uk dataset: 133 million nodes, 5 billion edges DAG of web-uk: 22 million nodes, 38 million edges Traditional approaches are not applicable.
Related Work Recent works builds index, label( u ), offline for every node u. Label-Only Approach: answer Query( u, v ) only by label( u ) and label( v ) only Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, … Transitive Closure Compression: Chain-Cover, Tree-Cover, … non-linear index construction time and index size, may generate unacceptable large index Label+ G Approach: answer Query( u, v ) by label( u ) and label( v ) with the possibility of accessing G if needed interval labeling: GRIPP, GRAIL, Ferrari, … linear index size, but may perform DFS
Main Idea of IP Labeling Both are time/space consuming if an exact answer is needed for large sets.
Main Idea of IP Labeling based on Min-wise Independent Permutation high probability guarantee to answer query linear index construction time and index size
Performance Studies Index Construction Time (in second) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc58.52922.28058.24224.29218.96 twitter15.29113.71932.32319.97212.44 web-uk---24.24044.03126.92717.46 citeseerx91.87712.04523.17019.7927.54 go-uniprot38.66818.27744.55740.3659.68 govwild30.52018.58429.23719.9248.45
Performance Studies Query Time (in millisecond) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc119.164119.618820.249116.35154.205 twitter102.923104.698---82.21279.285 web-uk---146.429---214.857253.082 citeseerx230.318111.32928774131.534101.444 go-uniprot55.279153.214499.505313.30034.577 govwild254.785128.199719.494295.432112.990
Conclusion We propose a new IP labeling approach, the first one to explore the randomness to answer reachability queries. Our new labeling approach has linear index construction time and index size. By independent permutation, the query performance is guaranteed by high probability. We analyze the performance of our proposed approach by extensive experimental studies and our approach shows both good efficiency and scalability.