Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.

Slides:



Advertisements
Similar presentations
Bounded Conjunctive Queries Yang Cao 1,2, Wenfei Fan 1,2, Tianyu Wo 2, Wenyuan Yu 3 1 University of Edinburgh, 2 Beihang University, 3 Facebook Inc.
Advertisements

gSpan: Graph-based substructure pattern mining
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
New Models for Graph Pattern Matching Shuai Ma ( 马 帅 )
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Mauro Sozio and Aristides Gionis Presented By:
Graphs Chapter 12. Chapter Objectives  To become familiar with graph terminology and the different types of graphs  To study a Graph ADT and different.
Schema Summarization cong Yu Department of EECS University of Michigan H. V. Jagadish Department of EECS University of Michigan
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
The number of edge-disjoint transitive triples in a tournament.
Los Angeles September 27, 2006 MOBICOM Localization in Sparse Networks using Sweeps D. K. Goldenberg P. Bihler M. Cao J. Fang B. D. O. Anderson.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
Analysis of Algorithms CS 477/677
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,
UNIVERSITY OF JYVÄSKYLÄ Resource Discovery in Unstructured P2P Networks Distributed Systems Research Seminar on Mikko Vapa, research student.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Yinghui Wu, SIGMOD 2012 Query Preserving Graph Compression Wenfei Fan 1,2 Jianzhong Li 2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute.
1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering.
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.
Virtual Network Mapping: A Graph Pattern Matching Approach Yang Cao 1,2, Wenfei Fan 1,2, Shuai Ma University of Edinburgh 2 Beihang University.
MCS312: NP-completeness and Approximation Algorithms
1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard Department of Computer Science University.
Finding dense components in weighted graphs Paul Horn
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Analysis of Algorithms
Querying Structured Text in an XML Database By Xuemei Luo.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Association Rules with Graph Patterns Yinghui Wu Washington State University Wenfei Fan Jingbo Xu University of Edinburgh Southwest Jiaotong University.
CS 415 – A.I. Slide Set 5. Chapter 3 Structures and Strategies for State Space Search – Predicate Calculus: provides a means of describing objects and.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Yinghui Wu, ICDE Adding Regular Expressions to Graph Reachability and Pattern Queries Wenfei Fan Shuai Ma Nan Tang Yinghui Wu University of Edinburgh.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Answering pattern queries using views Yinghui Wu UC Santa Barbara Wenfei Fan University of EdinburghSouthwest Jiaotong University Xin Wang.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
NP-Complete problems.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity.
Graphs Chapter 12. Chapter 12: Graphs2 Chapter Objectives To become familiar with graph terminology and the different types of graphs To study a Graph.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Graph Indexing From managing and mining graph data.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Yinghui Wu, SIGMOD Incremental Graph Pattern Matching Wenfei Fan Xin Wang Yinghui Wu University of Edinburgh Jianzhong Li Jizhou Luo Harbin Institute.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
1 CS270 Project Overview Maximum Planar Subgraph Danyel Fisher Jason Hong Greg Lawrence Jimmy Lin.
Gspan: Graph-based Substructure Pattern Mining
Cohesive Subgraph Computation over Large Graphs
Answering pattern queries using views
A paper on Join Synopses for Approximate Query Answering
DEMON A Local-first Discovery Method For Overlapping Communities
Computing Full Disjunctions
Associative Query Answering via Query Feature Similarity
Spatio-temporal Pattern Queries
Data Integration with Dependent Sources
On Efficient Graph Substructure Selection
Simulation based approach Shang Zechao
Lu Xing CS59000GDM 9/21/2018.
Diversified Top-k Subgraph Querying in a Large Graph
NP-Complete Problems.
Graph Homomorphism Revisited for Graph Matching
Links Liang Zheng
Presentation transcript:

Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University

2 Challenges introduced by big graphs Graph pattern matching for querying data graphs intractable for subgraph isomorphism; O((|V|+|V Q |)(|E|+|E Q |)) for graph simulation. Can we still answer queries on big data with limited resources? What happens when it comes to big graphs? D Using SSD of 6G/s, a linear scan of a data set D would take D 1.9 days when D is of 1PB (10 15 B) D 5.28 years when D is of 1EB (10 18 B) O(n) time is already beyond reach on big data in practice! 2 Social graphs are typically huge Facebook graph: 1.26 billion nodes, 140 billion links, 300PB

3 Making big graphs small: effectively boundedness Question: Can we find a class L of queries such that, for each Q in L and for any (possibly big) graph, a fraction G Q of G such that Q(G) = Q(G Q ), and G Q can be identified in time determined by Q? Making the cost of computing Q(G) independent of |G|! |G Q | is independent of the size of G Scales with G no matter how big G grows 3 Q( ) G G GQGQ GQGQ GQGQ GQGQ “Effectively bounded” queries “Effectively bounded” queries

4 An example: Graph Search (IMDb) Find pairs of first-billed actor and actress from the same country who co-starred in an award-winning film released in (C1) In each year, every award is presented to no more than 4 movies; (C2) Each movie has at most 30 first-billed actors and actresses; (C3) Each person has only one country of origin; (C4) There are no more than 135 years, 24 major movie awards and 196 countries. Semantic constraints on IMDb 4

5 Effectively bounded query evaluation Accessing = nodes and = edges in total (C4) Identify a set V 1 of 135 year nodes, 24 award and 196 country nodes. (C1) Fetch a set V 2 of at most 24*3*4=288 award-winning movie nodes, with no more than 288*2=576 edges connecting movies to awards and years. (C2) Fetch a set V 3 of at most (30+30)*288=17280 actors and actresses with edges. (C3) Connect the actors and actresses in V 3 to country nodes in V, with at most edges. A query plan 5 NO MATTER HOW BIG the IMDb graph can be (Q is effectively bounded under constraints) “Effectively bounded” queries under semantic constraints “Effectively bounded” queries under semantic constraints

6 Questions raised A package of effectively bounded evaluation for pattern queries to answer these questions. (1) Given a pattern query Q and a set A of “semantic constraints”, can we determine whether Q is effectively bounded under A? (2) If Q is effectively bounded, how can we generate a query plan to compute Q(G) in big G by accessing a bounded G Q ? (3) If Q is not bounded, can we make it “bounded” in G by adding simple extra constraints (indices)? (4) Does the approach work on both localized queries (subgraph isomorphism) and non-localized queries (graph simulation) ? 6

7 Overview Formalization of effective boundedness for graph pattern queries – Semantics constraints – Effectively bounded queries Deciding effectively bounded localized pattern queries – Characterization and complexity Generating effectively bounded query plans if so. Make Q instance-bounded if it is not effectively bounded. Extend the study to non-localized queries 7

88 Effectively bounded pattern queries: formulation

9 Access constraints on graphs An access constraint is of form S  (l, N) S: a set of labels; l: a label. G satisfies it if for any S-labelled set V S, there exist at most N l-labelled common neighbours of V S. Index on G: given an V S, find relevant l-labelled neighbours. Access schema: A set of access constraints Combining cardinality constraint and index Examples Discovery: functional dependencies, simple aggregate queries, degree bounds, global constraints. Maintenance: incrementally and locally by inspecting changes to G only, independent of G.

10 Effectively bounded graph patterns Coping with big data: Independent of the size of G for any (big) graph G that satisfies A, there exists a subgraph G Q of G such that Q(G) = Q(G Q ), and G Q can be identified in time determined by Q and A only. Query plan (effectively bounded): Identify V Q and E Q by using indices in A only Node fetching operations Building G Q Return the evaluation results of Q on G Q (V Q,E Q ) Graph pattern Q is effectively bounded under access schema A :

11 Localized and non-localized patterns Data locality: Q is localized if for any G that matches Q, any u and neighbor u’ of u in Q, and for any match v of u in G, there must exists a match v’ of u’ in G such that v’ is a neighbor of v in G. Localized query: subgraph queries (via subgraph isomorphism) Non-localized query: simulation queries (via graph simulation) Data locality makes localized queries more likely effectively bounded

Effective boundedness of subgraph queries 12

13 The effective boundedness problem EBnd( Q,A ) Input: A subgraph query Q, an access schema A Question: Is Q effectively bounded under A ? When Q can be answered scale independently on any big graphs G satisfying A, with indices in A ? Sufficient and necessary condition for effective boundedness What is the complexity?

14 Characterization for subgraph queries Node coverage Edge coverage Subgraph query Q is effectively bounded under access schema A iff (1) VCov(Q,A) = V Q and (2) ECov(Q,A) = E Q. Subgraph query Q is effectively bounded under access schema A iff (1) VCov(Q,A) = V Q and (2) ECov(Q,A) = E Q.

15 Characterization for subgraph queries A subgraph query Q is effectively bounded under an access schema A iff (1) VCov(Q,A) = V Q and (2) ECov(Q,A) = E Q.

16 The complexity of EBnd for subgraph queries We prove this by providing such an algorithm EBChk, which (1)Combines Q and A via a notion of actualized constraints (2)Use inverted index on actualized constraints to compute coverages.

Generating query plans for subgraph queries 17

18 Effectively bounded query plans A query plan ξ for pattern query Q under A consists of (a) Node fetching: a sequence of node fetching operations of the form ft (u, V S, φ, g Q (u)) u is a l-labelled node in Q V S is a S-labelled set of nodes in Q - φ is an access constraint in A - g Q (u) is the matching predicates on node u (b) Building G Q : fetches E Q over V Q via node fetching operations ξ is effectively bounded if for all G satisfying A, if ξ(G,A) = G Q satisfies -Q(G Q ) = Q(G) - the time of all operations in ξ depends on A and Q only.

19 Optimal effectively bounded query plans Optimal effectively bounded query plan ξ: For each graph G satisfying A, ξ(G,A) = G Q is the smallest among all G Q ’ for any other plan ξ’ with ξ’(G,A)=G Q ’. What about a weaker optimal effectively bounded query plan? There exists no instance optimal effectively bounded query plan. Instance optimal

20 Generating worst-case optimal query plans Worst-case optimal query plans are within reach in practice! Given Q, A, we provide an algorithm that finds a worst-case optimal effectively bounded query plan in O(|V Q ||E Q ||A||) time. Worst-case optimal effectively bounded query plan ξ:

Making queries instance bounded 21

22 Instance-bounded patterns What can we do if query Q in L is not effectively bounded under A ? Instance boundedness aims to process a finite set L Q of queries on a particular instance G by accessing a bounded amount of data. M-bounded extension A M of A on G: extending A with access constraints S→(l, N) with | S | = 0 or1 such that N ≤ M. Instance-bounded patterns Given a G satisfying A M, a finite set L Q of patterns is instance- bounded in G under A M if for all Q in L Q, there exists a subgraph G Q of G such that (a) Q(G Q ) = Q(G); and (b) (b) G Q can be found in time determined by A M and Q only.

23 The extended effectively bounded problem EEP(L Q,A,M,G) Input: finite set L Q of subgraph queries, access schema A, natural number M, a graph G satisfying A. Question: Does there exist a M-bounded extension A M of A such that L Q is instance-bounded in G under A M ? Want a stronger result? minEEP(L Q,A,G): Input: L Q, A and G Output: minimum M such that L Q is instance-bounded in G under A M EEP(LQ,A,M,G) is in O(|G|+(|A|+|L Q |)|E LQ |+(||A||+|L Q |)|V LQ | 2 ) time. minEEP(LQ,A,G) is logAPX-hard.

Effectively bounded simulation queries 24

25 Characterization for simulation queries Simulation query Q is effectively bounded under A iff sVCov(Q,A) = V Q and sECov(Q,A) = E Q Ebnd problem for simulation queries. Input: A simulation query Q, an access schema A Question: Is Q effectively bounded under A ? If pattern Q is effectively bounded under A via simulation, then Q is also effectively bounded under A via subgraph isomorphism. Characterization for simulation queries: sVCov(Q,A) and sECov(Q,A) are revisions of Vcov(Q,A) and Ecov(Q,A) for subgraph queries, by taking care of data locality.

26 Ebnd and EEP revisited for simulation queries Given a simulation query Q and access schema A, we provide an algorithm that finds a worst-case effectively bounded query plan in O(|V Q ||E Q ||A|) time. For simulation queries, EEP(LQ,A,M,G) is in O(|G|+(|A|+|L Q |)|E LQ |+(||A||+|L Q |)|V LQ | 2 ) Complexities for simulation queries are the same as for subgraph queries. For simulation queries Q, EBnd(Q,A) is in (1) O(|A||E Q | + ||A|||V Q | 2 ) time in general; and (2) O(|A||E Q | + |V Q | 2 ) time in special cases as for subgraph queries.

Experimental study 27

28 Experimental settings  Real-life datasets (1) Webbase-2011 (WebBG): 0.1 billion nodes, 1 billion edges and 0.18 billion labels 204 access constraints (2) Internet Movie Data graph (IMDbG): 5.1 million nodes, 19.5 million edges and 168 labels. 168 access constraints (3) Knowledge graph (DBpediaG): 4.1 million nodes, 19.5 million edges and 1434 labels 315 access constraints  Pattern queries randomly generated 100 pattern queries for each dataset, controlled by # of nodes, edges, match predicates.

29 Experimental results Effectiveness of effective boundedness (1) Percentage of effectively bounded queries Subgraph queries: 61%, 67%, 58% of queries on IMDbG, DBpediaG, WebBG are effectively bounded Simulation queries: 32%, 41% and 33%. (2) Effectiveness of bounded queries Evaluation time is independent of |G| Effective for both localized and non-localized queries Outperform optimized VF2 and graphSim by 4 and 3 orders of magnitude on average on WebBG, respectively. (3) Effectiveness of instance boundedness Small M suffices to make queries instance-bounded: – 0.006% (resp %) of |G| for 95% of subgraph (resp. simulation) queries on WebBG.

Summing up 30

31 Effectively bounded pattern queries We propose to answer graph pattern queries by making use of effective boundedness, by developing techniques:  access constraints on graphs and effectively bounded pattern queries,  Identify the complete class of effectively bounded graph patterns,  Generating (worst-case) optimal query plans if so, and otherwise,  Instance-boundedness for queries that are not in the class. Outlook:  Systematic method for discovering access constraints on graphs  Incremental boundedness 31