Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering.

Similar presentations


Presentation on theme: "1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering."— Presentation transcript:

1 1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering using views Bounded incremental query evaluation

2 2 How to make big graphs small Input: A class Q of queries Question: Can we effectively find, given queries Q  Q and any (possibly big) graph G, a small G Q such that Q(G) = Q(G Q )? Effective methods for making big graphs small Distributed query processing Boundedly evaluable graph queries Query preserving graph compression Query answering using views Bounded incremental evaluation Q( ) G G GQGQ GQGQ Much smaller than G

3 Graph pattern matching by graph simulation Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R Using views? Incremental? 3 Maximum simulation relation: always exists and is unique If a match relation exists, then there exists a maximum one Otherwise, it is the empty set – still maximum Complexity: O((| V | + | V Q |) (| E | + | E Q | ) The output is a unique relation, possibly of size |Q||V|

4 Graph pattern matching using views 4 4

5 Answering queries using views 5 The complexity is no longer a function of |G| can we compute Q(G) without accessing G, i.e., independent of | G |? The cost of query processing: f(|G|, |Q|) Query answering using views: given a query Q in a language L and a set V views, find another query Q’ such that Q and Q’ are equivalent Q’ only accesses V ( G ) for any G, Q ( G ) = Q’( G ) Answering graph pattern queries on big social graphs: Regardless of how big G is – the cost is “independent” of G V ( G ) is often much smaller than G (4% -- 12% on real-life data) Q’( ) Q( ) V(G) G G

6 Querying collaborative network 6 customer developer project manager query 1 Customer developer query 2 PM 2 PM 1 customer 2developer 3developer 2 customer 2 developer 3 developer 2 customer 3 customerdeveloper project manager A collaborative pattern PM 2 PM 1 customer 2 customer 1 developer 2 developer 3 developer 1 customer 3 A collaborative (chat) network developer k customer 3 customer n … … tester expensive! Detecting Coordination Problems in Collaborative Software Development Environments, Amrit Chintan et al, Information System management, 2010 views

7 Answering query using views 7 query A database D database views V(D) Q(D) query result query Q A( V ) query result 19952000 2011 relational algebra 2002 XPath 2007 XML 2006 tree pattern query 1998 regular path queries RDF/SPARQL graph pattern query simulation When possible? What to choose? How to evaluate? A classical techniques, but in their infancy for graphs

8 When a pattern can be matched using views 8 Pattern containment: a characterization A necessary and sufficient condition

9 Pattern containment 9 customerdeveloper project manager customer developer project manager View 1 customer developer View 2 (customer, developer) {(customer 2, developer 2), (customer 3, developer 3)} (developer, customer) {(developer 2, customer 2), (developer 2, customer 3), (developer 3, customer 2)} (project manager, developer) {(PM 1, developer 2), (PM 2, developer 3)} (project manager, customer) {(PM 1, customer 2), (PM 2, customer 2)} (project manager, developer)(PM 1, developer 2) (project manager, customer)(PM 1, customer 2) (developer, customer)(developer 2, customer 2) (customer, developer)(customer 2, developer 2) Query result How to determine the existence of ?

10 Determining Pattern containment 10 NP-complete for relational conjunctive queries, undecidable for relational algebra A practical characterization: patterns are small in practice

11 Pattern containment: example 11 customer developer project manager View 1 customer developer View 2 customerdeveloper project manager query as “data graph” λ customer project manager developer view matches V : the set of views; Q: query Query containment: given Q and Q’, it is to determine whether for any graph G, Q(G) is contained in Q’(G)? A classical problem. What is its complexity for pattern queries? efficient

12 Test: Pattern query containment Pattern query PM DBAPRG DBAPRG PM DBAPRG View 1 e1e1 e2e2 DBAPRG View 2 e3e3 e4e4 It takes 0.5 second to check containment of large cyclic patterns 12

13 Query evaluation using views 13 Input: pattern query Q, graph G, a set of views V and extensions in G, and a mapping λ Output: Find the query result Q(G) Algorithm ◦ Collect edge matches for each query edge e and λ(e) ◦ Iteratively remove non-matches until no change happens ◦ Return Q(G) Q(G) can be evaluated in O(|Q|| V (G)| + | V (G)| 2 ) time Recall simulation algorithm More efficient. Why?

14 Query evaluation using views 14 customerdeveloper query project manager customer developer project manager View 1 customer developer View 2 (customer, developer) {(customer 2, developer 2), (customer 3, developer 3)} (developer, customer) {(developer 2, customer 2), (developer 2, customer 3), (developer 3, customer 2)} (project manager, developer) {(PM 1, developer 2), (PM 2, developer 3)} (project manager, customer) {(PM 1, customer 2), (PM 2, customer 2)} (project manager, developer){(PM 1, developer 2), (PM 2, developer 3)} (project manager, customer){(PM 1, customer 2), (PM 2, customer 2)} (developer, customer){(developer 2, customer 2), (developer 2, customer 3), (developer 3, customer 2)} (customer, developer){(customer 2, developer 2), (customer 3, developer 3)} Query result “bottom-up” strategy Without accessing the underlying big graph G 4% -- 12% of G Are we done yet?

15 What views to choose? 15 customer developer project manager software tester customer software customer developer project manager customer developer software customer developer project manager software customer developer project manager software tester developer software query view 2 view 1 view 3 view 4 view 5 view 6 choose all? Why do we care? efficiency

16 Minimum containment 16 Minimum containment is NP-complete ◦ APX-hard as optimization What can we do? Give two options

17 An log|Ep|-approximation 17 Idea: greedily select views V that “cover” more query edges E c : already covered To decide whether to include a particular view V Approximation: performance guarantees

18 Minimum containment: example 18 customer developer project manager software tester customer software customer developer project manager customer developer project manager software customer developer project manager software tester developer software query view 2 view 1 view 4 view 6 view 5 customer developer software view 3 Ec Greedy: based on the metric

19 Minimal containment 19 Algorithm ◦ Computes view match for each view ◦ Iteratively selects a view that extends E c ◦ Repeats until Ec= Ep or return empty set O(|Q| 2 card( V ) + | V | 2 + |Q| | V |) time new addition Minimal containment is in PTIME

20 Minimal containment: example 20 customer developer project manager software tester customer software customer developer project manager customer developer project manager software customer developer project manager software tester developer software query view 2 view 1 view 4 view 6 view 5 customer developer software view 3 Eliminate redundant views

21 Putting together 21 ProblemComplexityAlgorithm containmentPTIMEO(card(V)|Q| 2 +|V| 2 +|Q||V|) minimum containment NP-c/APX- hard log|E p |-approximable O(card(V)|Q| 2 +|V| 2 +|Q||V|+|Q|card(V) 3/2 ) minimal containment PTIMEO(card(V)|Q| 2 +|V| 2 +|Q||V|) evaluationPTIMEO(|Q||V(G)| + |V(G)| 2 ) characterization: sufficient and necessary condition for deciding whether a query can be answered using a set of views evaluation: how to evaluate queries using views view section: what views to choose for answering queries The study is still in its infancy for graph queries Subgraph isomorphism? View maintenance? Improvement: 23 times faster

22 Bounded incremental graph pattern matching 22

23 Incremental query answering 23 Minimizing unnecessary recomputation Incremental query processing: Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M Changes to the output New output Changes to the input Old output When changes ∆G to the graph G are small, typically so are the changes ∆M to the output Q(G ⊕ ∆G) Changes ∆G are typically small Compute Q(G) once, and then incrementally maintain it Real-life data is dynamic – constantly changes, ∆G Re-compute Q(G ⊕ ∆G) starting from scratch? 5%/week in Web graphs

24 Complexity of incremental problems Bounded: the cost is expressible as f(|CHANGED|, |Q|)? Optimal: in O(|CHANGED| + |Q|)? 24 Complexity analysis in terms of the size of changes Incremental query answering Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M The cost of query processing: a function of |G| and |Q| incremental algorithms: |CHANGED|, the size of changes in the input: ∆G, and the output: ∆M The updating cost that is inherent to the incremental problem itself The amount of work absolutely necessary to perform for any incremental algorithm Incremental algorithms? Incremental graph simulation: bounded G. Ramalingam, Thomas W. Reps: On the Computational Complexity of Dynamic Graph Problems. TCS 158(1&2), 1996 24

25 Why study incremental query answering? View maintenance: in response to changes to the underlying graph Compressed graphs: maintenance in the presence of changes Indexing structure: 2-hop covers 25 An important issue Incremental query answering Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M E-commerce systems: a fixed set of (parameterized) queries –Repeatedly invoked and evaluated One of important issues for querying big graphs

26 |CHANGED|: the affected area Result graphs: Gr = (Vr, Er) for graph simulation 26 Q * 1 2 1 Ann, CTO Pat, DB John, DB Bill, BioMat, Bio simulation  Vr : the nodes in G that match pattern nodes in Q  Er: the paths in G that match edges in Q Affected Area (AFF) the difference between Gr and Gr’ The size of changes in the output The complexity and boundedness analyses of incremental matching the result graph of Q(G ⊕ ∆G) |CHANGED| = |∆G| + |AFF| the result graph of Q(G)

27 Incremental graph pattern matching 27 Ann, CTO Pat, DBDan, DB Bill, BioMat, Bio Don, CTOPat, DB Ann, CTO John, DB Bill, Bio Mat, Bio Ross, Med Tom, Bio Q * 1 2 1 CTO DB Bio Insert e 2 G Gr ∆G Insert e 1 e2e2 John, CTO Tom, Bio e3e3 e4e4 e5e5 e1e1 Insert e 3 Insert e 4 Insert e 5 Comparing the cost of incremental matching with its batch counterpart affected area 27

28 Incremental simulation matching Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M 28 2 times faster than its batch counterpart for changes up to 10% in O(|AFF|) time Optimal for –single-edge deletions and general patterns –single-edge insertions and DAG patterns Incremental simulation is in unbounded O(|∆G|(|Q||AFF| + |AFF| 2 )) time General patterns and graphs; batch updates Batch updates

29 Semi-boundedness Incremental simulation is in 29 Semi-boundedness is good enough! Independent of | G | Semi-bounded: the cost is a PTME function f(|CHANGED|, |Q|) | Q | is small O(|∆G|(|Q||AFF| + |AFF| 2 )) time for batch updates and general patterns Independent of | G |

30 unit deletions and general patterns: Algorithm IncMatch optimal with the size of changes - Ann, CTO Pat, DBDan, DB Bill, BioMat, Bio Don, CTOPat, DB Ann, CTO Dan, DB Bill, Bio Mat, Bio Q CTO DB Bio delete e 6 G Graffected area / ∆Gr e6e6 e6e6 1. identify s-s edges 2. find invalid match 3. propagate affected area and refine matches Incremental Simulation: optimal results e = (v, v’), if v and v’ are matches Use a stack, upward propagation Linear time wrt. the size of changes

31 unit insertion and DAG patterns: Algorithm IncMatch optimal with the size of changes + Ann, CTO Pat, DB Dan, DB Bill, BioMat, Bio Don, CTOPat, DB Ann, CTO Dan, DB Bill, Bio Mat, Bio Q CTO DB Bio insert e 7 G Gr candidate 1.identify cs and cc edges 2. find new valid matches 3. propagate affected area and refine matches e7e7 e7e7 e7e7 Linear time wrt. the size of changes Incremental Simulation: optimal results e = (v, v’), if v’ is a match and v a candidate e = (v, v’), if v’ and v are candidate

32 Incremental subgraph isomorphism Input: Q, G, M iso (Q, G), ∆G Output: ∆M such that M iso (Q, G ⊕ ∆G) = M iso (Q, G) ⊕ ∆M Boundedness and complexity Incremental matching via subgraph isomorphism is unbounded even for unit updates over DAG graphs for path patterns Incremental subgraph isomorphism is NP-complete even when G is fixed 32 Neither bounded nor semi-bounded not semi-bounded unless P = NP Input: Q, G, M(Q, G), ∆G Question: whether there exists a subgraph in G ⊕ ∆G that is isomorphic to Q What should we do?

33 Compress G by leveraging the equivalence relation Equivalence relation: reachability relation R e : a node pair (u,v) ∈ R e iff they have the same set of ancestors and descendants in G. for any graph G, there is a unique maximum R e, i.e., the reachability equivalence relation of G Recall reachability queries Reachability Input: A directed graph G, and a pair of nodes s and t in G Question: Does there exist a path from s to t in G? O(|V| + |E|) time 33

34 Incremental Reachability Preserving Compression Incremental reachability preserving compression (RCM) –unbounded even for unit update, i.e., a single edge insertion and deletion RCM is solvable in O(|AFF||Gc|) time without decompressing Gc 16 Reduction from single source reachability problem FA 1 C2C2 C1C1 FA 2 G FA 1 C1C1 FA 2 C2C2 Gr C1C1 FA 2 C2C2 Gr’ C1C1 FA 1 FA 2 C2C2 Gr’’ 1. Update topological ranking, initialize AFF FA 1 C1C1 FA 2 C2C2 2. (iteratively) split/merge nodes and update Gc Without decompressing Gc

35 Graph pattern matching by graph simulation Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R 35 Bisimulation: a binary relation B over V of G, such that for each node pair (u,v) ∈ B, L(u) = L(v) for each edge (u,u’) ∈ E, there exists (v,v’) ∈ E, s.t. (u’,v’) ∈ B, for each edge (v,v’) ∈ E, there exists (u,u’) ∈ E, s.t. (u’,v’) ∈ B Equivalence relation Rb: the unique maximum bisimulation relation Compress G by leveraging the equivalence relation

36 Incremental simulation Preserving Compression 17 G BSA 1 MSA 2 BSA 2 … MSA 1 FA 1 FA 2 FA 3 FA 4 C1C1 C2C2 C3C3 C4C4 FA 2 C2C2 FA 1 FA 3 FA 4 … C1C1 C3C3 C4C4 MSA 2 MSA 1 BSA 1 BSA 2 GqGq Incremental pattern preserving compression (PCM) is unbounded even for unit update RCM is solvable in O(|AFF| 2 +|Gc|) time without decompressing Gc 1. Update node ranking, initialize AFF 2. Iteratively split/merge nodes in Gc and update AFF Affected area Incremental compression without recomputation

37 Incremental graph compression Input: G, Gc = R(G), ∆G Output: ∆Gc such that R(G ⊕ ∆G) = R(G) ⊕ ∆Gc Compressed once and incrementally maintained No need to decompress Gc Gc is computed once for all queries Q in L Boundedness and complexity unbounded even for unit updates in O( |AFF| 2 + | Gc | ) time 37

38 Putting together 38 Prove (semi-)boundedness: develop a (semi-)bounded incremental algorithms Disprove (semi-)boundedness: by contradiction or reduction Semi-bounded incremental algorithms for querying big data Bounded and semi-bounded incremental algorithms Incremental graph simulation: semi-bounded – Cyclic patterns and graphs – Batch updates Optimal for –single-edge deletions and general patterns –single-edge insertions and DAG patterns

39 Summing up 39

40 40 Making big data small Yes, it is doable! Parallel query processing: divide and conquer Bounded evaluable queries: dynamic reduction Query preserving compression: convert big data to small data Query answering using views: make big data small Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data... Combinations of these are more effective Including but not limited to graph queries MapReduce not the only way, and it is not the best way! 5.28 years * 365 * 24 * 3600 (EB)  24 second! Improvement: 28587 times (bounded evaluability), 60% 55 times (parallel processing via partial evaluation) 23 times (query answering using views) 2.3 times faster (compression) 2 times faster for changes up to 10% (incremental)

41 41 Summary and review What is query answering using views? What is query containment? What is the complexity of deciding query containment for relations? For XML? Graph pattern queries via graph simulation? What questions do we have to answer for answering graph queries using views? What is incremental query evaluation? What are the benefits? What is a unit update? Batch updates? When can we say that an incremental problem is bounded? Semi-bounded? How to show that an incremental problem is bounded? How to disprove it?

42 42 Project (1) 42 Develop a characterization (a sufficient and necessary condition) for deciding whether subgraph queries can be answered using views. Develop an algorithm for determining whether a subgraph query can be answered using views, based on your characterization. Develop an algorithm that, given a graph G, a set V of views and a subgraph query Q that can be answered using the views, computes Q(G) by using views in V Give correctness and complexity analyses of your algorithms. Experimentally evaluate your algorithms, especially their scalability with the size of graphs A research and development project Recall graph pattern matching via subgraph isomorphism (Lecture 3),referred to as subgraph queries in the sequel.

43 43 Project (2) 43 Study incremental maintenance of 2-hop covers, in response to node insertion node deletion edge insertion edge deletion Develop an incremental algorithm in each of these settings. Is the incremental problem bounded in each of the settings? If so, show that your incremental algorithm is bounded; otherwise disprove the boundedness of the incremental problem Implement your algorithms, and prove their correctness Experimentally evaluate your algorithms, especially their scalability A research and development project Recall 2-hop covers for reachability queries (Lecture 2): for each node v in G, maintain 2hop(v) = (L in (v), L out (v)) such that for a node s can reach t if and only if L out (s)  L in (t)  

44 44 Project (3) 44 Study incremental maintenance of SSC, in response to node insertion node deletion edge insertion edge deletion Develop an incremental algorithm in each of these settings. Is the incremental problem bounded in each of the settings? If so, show that your incremental algorithm is bounded; otherwise disprove the boundedness of the incremental problem Implement your algorithms, and prove their correctness; Experimentally evaluate your algorithms, especially their scalability A research and development project Recall strongly connected components (SSC, Lecture 2).

45 45 W. Le, S. Duan, A. Kementsietsidis, F. Li, and M. Wang. Rewriting queries on SPARQL views. In WWW, 2011. http://www.cs.fsu.edu/~lifeifei/papers/rdfview.pdf D. Saha. An incremental bisimulation algorithm. In FSTTCS, 2007. http://cs.famaf.unc.edu.ar/~rfervari/sites/all/files/readings/incremental- bis-07.pdf S. K. Shukla, E. K. Shukla, D. J. Rosenkrantz, H. B. H. Iii, and R. E. Stearns. The polynomial time decidability of simulation relations for finite state processes: A HORNSAT based approach. In DIMACS Ser. Discrete, 1997. (search Google Scholar) W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views) W. Fan, X. Wang, and Y. Wu. Incremental Graph Pattern Matching, TODS 38(3), 2013. (bounded incremental query answering) Papers for you to review


Download ppt "1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering."

Similar presentations


Ads by Google