Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.

Slides:



Advertisements
Similar presentations
What is a graph ? G=(V,E) V = a set of vertices E = a set of edges edge = unordered pair of vertices
Advertisements

Artificial Intelligence By Mr. Ejaz CIIT Sahiwal.
Finding Skyline Nodes in Large Networks. Evaluation Metrics:  Distance from the query node. (John)  Coverage of the Query Topics. (Big Data, Cloud Computing,
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
Distance-Constraint Reachability Computation in Uncertain Graphs Ruoming Jin, Lin Liu Kent State University Bolin Ding UIUC Haixun Wang MSRA.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
GRAIL: Scalable Reachability Index for Large Graphs VLDB2010 Vineet Chaoji Mohammed J. Zaki.
Graph Search Methods A vertex u is reachable from vertex v iff there is a path from v to u
Breadth-First Search Seminar – Networking Algorithms CS and EE Dept. Lulea University of Technology 27 Jan Mohammad Reza Akhavan.
Graph Search Methods A vertex u is reachable from vertex v iff there is a path from v to u
Graph Search Methods Spring 2007 CSE, POSTECH. Graph Search Methods A vertex u is reachable from vertex v iff there is a path from v to u. A search method.
CS171 Introduction to Computer Science II Graphs Strike Back.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Graph & BFS.
Evaluating Reachability Queries over Path Collections* P. Bouros 1, S. Skiadopoulos 2, T. Dalamagas 3, D. Sacharidis 3, T. Sellis 1,3 1 National Technical.
Graph COMP171 Fall Graph / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D E A C F B Vertex Edge.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.
Shortest Path Problems Directed weighted graph. Path length is sum of weights of edges on path. The vertex at which the path begins is the source vertex.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
CS344: Lecture 16 S. Muthu Muthukrishnan. Graph Navigation BFS: DFS: DFS numbering by start time or finish time. –tree, back, forward and cross edges.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),
Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 14 Strongly connected components Definition and motivation Algorithm Chapter 22.5.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
Directed graphs Definition. A directed graph (or digraph) is a pair (V, E), where V is a finite non-empty set of vertices, and E is a set of ordered pairs.
TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.
MA/CSSE 473 Day 12 Insertion Sort quick review DFS, BFS Topological Sort.
Nattee Niparnan. Graph  A pair G = (V,E)  V = set of vertices (node)  E = set of edges (pairs of vertices)  V = (1,2,3,4,5,6,7)  E = ((1,2),(2,3),(3,5),(1,4),(4,
Tree Decomposition Benoit Vanalderweireldt Phan Quoc Trung Tram Minh Tri Vu Thi Phuong 1.
1 Exact Top-k Nearest Keyword Search in Large Networks Minhao Jiang†, Ada Wai-Chee Fu‡, Raymond Chi-Wing Wong† † The Hong Kong University of Science and.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
MA/CSSE 473 Day 15 BFS Topological Sort Combinatorial Object Generation Intro.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
Finding maximal planar subgraphs Wen-Lian Hsu 1/33.
Path-Hop: efficiently indexing large graphs for reachability queries Tylor Cai and C.K. Poon CityU of Hong Kong.
Fast and practical indexing and querying of very large graphs Silke Triβl, Ulf Leser Humboldt-Universitat zu Berlin Presenter: Liwen Sun (Stephen) SIGMOD’07.
Markov Chains and Random Walks. Def: A stochastic process X={X(t),t ∈ T} is a collection of random variables. If T is a countable set, say T={0,1,2, …
A correction The definition of knot in page 147 is not correct. The correct definition is: A knot in a directed graph is a subgraph with the property that.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Lecture 11 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea.
GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.
Data Structures and Algorithms in Parallel Computing Lecture 2.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Ricochet Robots Mitch Powell Daniel Tilgner. Abstract Ricochet robots is a board game created in Germany in A player is given 30 seconds to find.
Graph Connectivity This discussion concerns connected components of a graph. Previously, we discussed depth-first search (DFS) as a means of determining.
Graphs + Shortest Paths David Kauchak cs302 Spring 2013.
Constraint Programming for the Diameter Constrained Minimum Spanning Tree Problem Thiago F. Noronha Celso C. Ribeiro Andréa C. Santos.
Mathematics of the Web Prof. Sara Billey University of Washington.
CSC317 1 At the same time: Breadth-first search tree: If node v is discovered after u then edge uv is added to the tree. We say that u is a predecessor.
Tracing An Algorithm for Strongly Connected Components that uses Depth First Search Graph obtained from Text, page a-al: Geetika Tewari.
Approximating the MST Weight in Sublinear Time
Query-Friendly Compression of Graph Streams
Probably Approximately
Fast Nearest Neighbor Search on Road Networks
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs
Strongly Connected Components
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Distance-Constraint Reachability Computation in Uncertain Graphs
Presentation transcript:

Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao

Graph Reachability Query Given a directed graph G = (V, E) and two vertices u and v, u is said to reach v if there exists a path from u to v over G. Any directed graph can be easily transformed into a DAG trivial if u and v are in the same connect component Query( v 1, v 8 ) Reachable Query( v 2, v 11 ) Unreachable

The Issue and the Challenge ‘Big Data’ era brings us large graph with millions of nodes and edges. web-uk dataset: 133 million nodes, 5 billion edges DAG of web-uk: 22 million nodes, 38 million edges Traditional approaches are not applicable.

Related Work Recent works builds index, label( u ), offline for every node u. Label-Only Approach: answer Query( u, v ) only by label( u ) and label( v ) only Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, … Transitive Closure Compression: Chain-Cover, Tree-Cover, … non-linear index construction time and index size, may generate unacceptable large index Label+ G Approach: answer Query( u, v ) by label( u ) and label( v ) with the possibility of accessing G if needed interval labeling: GRIPP, GRAIL, Ferrari, … linear index size, but may perform DFS

Main Idea of IP Labeling Both are time/space consuming if an exact answer is needed for large sets.

Main Idea of IP Labeling based on Min-wise Independent Permutation high probability guarantee to answer query linear index construction time and index size

Min-wise Independent Permutation

K -min-wise Independent Permutation We propose to use top-k smallest numbers instead of top-1 smallest number to improve the performance.

K -min-wise Independent Permutation

Independent Permutation Generation Knuth Shuffle

IP Label The IP label of u consists of two parts: L out ( u ): the min k { } set of Out( u ), min k {Out( u )} L in ( u ): the min k { } set of In( u ), min k {In( u )}

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = 5 {10}{4} {2, 10} {3} {8} {2, 3, 4, 10} {2, 10} {2, 3, 4, 8, 10}

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = L out (v 2 ) = {2, 3, 4, 8, 10} L out (v 7 ) = {1} Q 1 : Query (v 2, v 7 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = L out (v 3 ) = {1, 2, 3, 4, 6} L out (v 2 ) = {2, 3, 4, 8, 10} L in (v 2 ) = {7, 8} L in (v 3 ) = {6, 7} Q 2 : Query (v 3, v 2 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = Q 2 : Query (v 1, v 8 ) Need to Perform DFS Effect?

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = Q 2 : Query (v 1, v 3 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = Q 4 : Query (v 1, v 3 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = Q 4 : Query (v 1, v 3 )

IP Label Vertex L out L in v0v0 {0, 1, 2, 3, 4}{7} v1v1 {0, 1, 2, 3, 4}{11} v2v2 {2, 3, 4, 8, 10}{7, 8} v3v3 {1, 2, 3, 4, 6}{6, 7} v4v4 {2, 3, 4, 10}{3, 6, 7, 8, 11} v5v5 {0, 1, 5, 9, 10}{0, 7, 11} v6v6 {2, 10}{2, 3, 6, 7, 8} v7v7 {1}{0, 1, 6, 7, 11} v8v8 {10}{0, 2, 3, 6, 7} v9v9 {4}{3, 4, 6, 7, 8} v 10 {9}{0, 7, 9, 11} v 11 {5}{0, 5, 7, 11} for k = The probability increase significantly! Q 4 : Query (v 1, v 3 )

IP Label While DFS becomes deeper, it is much more likely to answer the unreachability queries, and therefore, it can stop in an early stage.

Two Optimizations Huge-Vertex Label: build additional index to handle the huge vertices of the graph Level Label: use the topological structure to prune the search space

Level Label

Huge-Vertex Label Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5}

Huge-Vertex Label Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5} Query (v 0, v 11 )

Huge-Vertex Label Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5} Query (v 0, v 1 )

Huge-Vertex Label Vertex L hv VertexL hv v0v0 {0}v6v6 {0, 4} v1v1 v7v7 {0, 5} v2v2 {0}v8v8 {0, 5} v3v3 {0}v9v9 {0, 4} v4v4 v 10 {0, 5} v5v5 v 11 {0, 5} Query (v 5, v 6 )

Performance Studies Real Dataset: Dataset|V(G)||E(G)|d avg R-ratio uniprotenc25M E-7 twitter18M E-2 web-uk22M38M E-1 citeseerx6.5M15M E-4 go-uniprot6.9M34M E-6 govwild8.0M23M E-5

Performance Studies Index Construction Time (in second) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc twitter web-uk citeseerx go-uniprot govwild

Performance Studies Query Time (in millisecond) DatasetTF-LabelDLGRAILFerrariIP+ uniprotenc twitter web-uk citeseerx go-uniprot govwild

Performance Studies

Distribution of the number of vertices visited

Conclusion We propose a new IP labeling approach, the first one to explore the randomness to answer reachability queries. Our new labeling approach has linear index construction time and index size. By independent permutation, the query performance is guaranteed by high probability. We analyze the performance of our proposed approach by extensive experimental studies and our approach shows both good efficiency and scalability.