Presentation on theme: "Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao."— Presentation transcript:
Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao
Graph Reachability Query Given a directed graph G = (V, E) and two vertices u and v, u is said to reach v if there exists a path from u to v over G. Any directed graph can be easily transformed into a DAG trivial if u and v are in the same connect component Query( v 1, v 8 ) Reachable Query( v 2, v 11 ) Unreachable
The Issue and the Challenge ‘Big Data’ era brings us large graph with millions of nodes and edges. web-uk dataset: 133 million nodes, 5 billion edges DAG of web-uk: 22 million nodes, 38 million edges Traditional approaches are not applicable.
Related Work Recent works builds index, label( u ), offline for every node u. Label-Only Approach: answer Query( u, v ) only by label( u ) and label( v ) only Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, … Transitive Closure Compression: Chain-Cover, Tree-Cover, … non-linear index construction time and index size, may generate unacceptable large index Label+ G Approach: answer Query( u, v ) by label( u ) and label( v ) with the possibility of accessing G if needed interval labeling: GRIPP, GRAIL, Ferrari, … linear index size, but may perform DFS
Main Idea of IP Labeling Both are time/space consuming if an exact answer is needed for large sets.
Main Idea of IP Labeling based on Min-wise Independent Permutation high probability guarantee to answer query linear index construction time and index size
Min-wise Independent Permutation
K -min-wise Independent Permutation We propose to use top-k smallest numbers instead of top-1 smallest number to improve the performance.
Performance Studies Index Construction Time (in second) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc twitter web-uk citeseerx go-uniprot govwild
Performance Studies Query Time (in millisecond) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc twitter web-uk citeseerx go-uniprot govwild
Distribution of the number of vertices visited
Conclusion We propose a new IP labeling approach, the first one to explore the randomness to answer reachability queries. Our new labeling approach has linear index construction time and index size. By independent permutation, the query performance is guaranteed by high probability. We analyze the performance of our proposed approach by extensive experimental studies and our approach shows both good efficiency and scalability.