CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49

Query processing over large datasets 2

3 Theory –Tractability revisited for querying big data –Parallel scalability –Bounded evaluability Techniques –Approximate query processing –Parallel algorithms –Bounded evaluability and access constraints –Query-preserving compression –Query answering using views –Bounded incremental query processing Querying Big Data: Theory and Practice

4 Tractability revised for querying big data (BD-tractability) Parallel scalability (in future lecture) Bounded evaluability (in future lecture) New theory for querying big data 4 To query big data, we have to determine whether it is feasible at all. For a class Q of queries, can we find an algorithm T such that given any Q in Q and any big dataset D, T efficiently computes the answers Q(D) of Q in D within our available resources? Fundamental question Is this feasible or not for Q ?

A review of P, NP and intractability 55

What is polynomial-time?

What are P and NP? P problems –(The original definition) Problems that can be solved by deterministic Turing machine in polynomial-time. –(A equivalent definition) Problems that are solvable in polynomial time. NP problems –(The original definition) Problems that can be solved by nondeterministic Turing machine in polynomial-time. –(A equivalent definition) Problems that are verifiable in polynomial time. Given a solution, there is a polynomial-time algorithm to tell if this solution is correct.

What are P and NP?

What are NP-complete problems? A NP-complete problem(NPC) is –a NP problem –at least as hard as the hardest NP problems In other words, NPC problems are the hardest NP problems So far, no polynomial time algorithms are found for any of NPC problems However, “P is equal to NP or not” is currently unknown –But most of computer scientists do not believe P=NP NP P NPC NP=NPC=P Most scientists believe: However, it is not ruled out that:

How to cope with a NPC problem? To prove problem A is a NPC problem –Show A can be verified in PTIME (an NP problem) – upper bound –Choose a NPC (MP-hard) problem B – lower bound –Develop a polynomial-time algorithm translate B to A –a reduction algorithm –If A can be solved in polynomial time, then B can be solved in polynomial time. Contradiction. To solve a NPC problem –Buy more expensive machines and wait (brute force) (could be 1000 years) –Turn to approximation algorithms Prove approximation hardness Algorithms that produce near optimal solutions –Turn to heuristics Prove sub-problems with performance guarantees Algorithms with good features (one-pass, probabilistic bounds, bounded resource, etc)

BD-tractability 11

12 The good, the bad and the ugly Traditional computational complexity theory of almost 50 years: The good: polynomial time computable (PTIME) The bad: NP-hard (intractable) The ugly: PSPACE-hard, EXPTIME-hard, undecidable… Polynomial time queries become intractable on big data! What happens when it comes to big data? D Using SSD of 6G/s, a linear scan of a data set D would take D 1.9 days when D is of 1PB (10 15 B) D 5.28 years when D is of 1EB (10 18 B) O(n) time is already beyond reach on big data in practice! 12

13 Complexity classes within P NC (Nick’s class): highly parallel feasible parallel polylog time polynomially many processors Too restrictive to include practical queries feasible on big data BIG open: P = NC? L: O(log n) space NL: nondeterministic O(log n) space polylog-space: log c (n) space 13 Polynomial time algorithms are no longer tractable on big data. So we may consider “smaller” complexity classes parallel log c (n) as hard as P = NP L  NL  polylog-space  P, NC  P

14 Tractability revisited for queries on big data A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that for any database D on which queries of Q are defined, D’ =  (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC) BD-tractable queries are feasible on big data Does it work? If a linear scan of D could be done in log(|D|) time: 15 seconds when D is of 1 PB instead of 1.99 days 18 seconds when D is of 1 EB rather than 5.28 years D D  (D)  Q 1 (  (D)) Q 2 (  (D)) 。。。。 14

15 BD-tractable queries A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that for any database D on which queries of Q are defined, D’ =  (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC) Preprocessing: a common practice of database people one-time process, offline, once for all queries in Q indices, compression, views, incremental computation, … not necessarily reduce the size of D BDTQ 0 : the set of all BD-tractable query classes 15

16 What query classes are BD-tractable? Boolean selection queries Input: A dataset D Query: Does there exist a tuple t in D such that t[A] = c? Build a B + -tree on the A-column values in D. Then all such selection queries can be answered in O(log(|D|)) time. Some natural query classes are BD-tractable Relational algebra + set recursion on ordered relational databases D. Suciu and V. Tannen: A query language for NC, PODS 1994 What else? Graph reachability queries Input: A directed graph G Query: Does there exist a path from node s to t in G? 16

17 Deal with queries that are not BD-tractable Many query classes are not BD-tractable. No. Preprocessing does not help us answer such queries. Breadth-Depth Search (BDS) Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G? Starts at a node s, and visits all its children, pushing them onto a stack in the reverse order induced by the vertex numbering. After all of s’ children are visited, it continues with the node on the top of the stack, which plays the role of s Is this problem (query class) BD-tractable? 17

18 Fundamental problems for BD-tractability BD-tractable queries help practitioners determine what query classes are tractable on big data. Are we done yet? No, a number of questions in connection with a complexity class! Reductions: how to transform a problem to another in the class that we know how to solve, and hence make it BD-tractable? Complete problems: Is there a natural problem (a class of queries) that is the hardest one in the complexity class? A problem to which all problems in the complexity class can be reduced How large is BDTQ 0 ? Compared to P? NC? Analogous to our familiar NP-complete problems Why do we care? Fundamental to any complexity classes: P, NP, … Name one NP-complete problem that you know Why do we need reduction? 18

19 Polynomial hierarchy revised Tractability revised for querying big data NP and beyond P P BD-tractable not BD-tractable Parallel polylog time 19

20 What can we get from BD-tractability? Guidelines for the following. Why we need to study theory for querying big data What query classes are feasible on big data? What query classes can be made feasible to answer on big data? How to determine whether it is feasible to answer a class Q of queries on big data? Reduce Q to a complete problem Q c for BDTQ If so, how to answer queries in Q ? Compose the reduction and the algorithm for answering queries of Q c 20

Techniques for querying big data: Overview 21 Approximate query processing Bounded evaluable queries Query preserving compression: convert big data to small data Query answering using views: make big data small Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data Parallel query processing (MapReduce, GraphLab, etc)

22 Approximate query answering 1. Query-driven approximation 2. Data-driven approximation: resource-bounded query answering Querying big data within our available resources 22 We can’t afford to always find exact query answers in big data Some queries are expensive (e.g., subgraph isomorphism) We may have constrained resources -- cannot afford unlimited resources Applications may demand real-time response Q’( ) Q( ) D D D D Q’( ) Q( ) D D D’

Boundedly evaluate a query 23 Given a query Q, access constraints A and a big dataset D 1.Decide whether Q is effectively bounded under A 2.If so, generate a bounded query plan for Q 3.Otherwise, do one of the following: ① Extend access schema or instantiate some parameters of Q, to make Q effectively bounded ② Use other tricks to make D small ③ Compute approximate query answers to Q in D Very effective for conjunctive queries 77% of conjunctive queries are boundedly evaluable Efficiency: 9 seconds vs. 14 hours of MySQL 60% of graph pattern queries are boundedly evaluable (via subgraph isomorphism) Improvement: 4 orders of magnitudes 23

Answering queries using views 24 The complexity is no longer a function of |D| can we compute Q(G) without accessing G, i.e., independent of | G |? The cost of query processing: f(|D|, |Q|) Query answering using views: given a query Q in a language L and a set V views, find another query Q’ such that Q and Q’ are equivalent Q’ only accesses V ( D ) for any G, Q ( G ) = Q’( G ) V ( D ) is often much smaller than D (4% -- 12% on real-life data) Improvement: 31 times faster for graph pattern matching Q’( ) Q( ) V(D) 24 D D

25 Bounded evaluability using views Input: A class Q of queries, a set of views V, access constraint A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction D Q of D such that |D Q |  M, a rewriting Q’ of Q using V, Q(D) = Q’(D Q, V (D)), and D Q can be identified in time determined by Q, V, and A ? access views, and additionally a bounded amount of data Query Q may not be boundedly evaluable, but may be boundedly evaluable with views! Q’(, ) D D Q( ) DQDQ DQDQ DQDQ DQDQ V V 25

Incremental query answering 26 Minimizing unnecessary recomputation Incremental query processing: Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M Changes to the output New output Changes to the input Old output When changes ∆G to the data G are small, typically so are the changes ∆M to the output Q(G ⊕ ∆G) Changes ∆G are typically small Compute Q(G) once, and then incrementally maintain it Real-life data is dynamic – constantly changes, ∆G Re-compute Q(G ⊕ ∆G) starting from scratch? 5%/week in Web graphs At least twice as fast for pattern matching for changes up to 10% 26

27 Incremental bounded evaluability Input: A class Q of queries, access constraint A Question: Can we find by using A, for any query Q  Q, any dataset D, and any changes  D to D, a fraction D Q of D such that |D Q |  M, Q(D   D) = Q(D)   Q(  D, D Q ), and D Q can be identified in time determined by Q and A ? access an additional bounded amount of data Query Q may not be boundedly evaluable, but may be incrementally boundedly evaluable!  Q(, ) D D Q( ) DQDQ DQDQ DD DD D D DD DD  old output DQDQ DQDQ 27

Parallel query processing 28 Divide and conquer partition D into fragments (D 1, …, D n ), distributed to various sites manageable sizes upon receiving a query Q, evaluate Q( D i ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( D ) in the entire D evaluate Q on smaller D i Parallel processing = Partial evaluation + message passing Q( ) D D D1D1 D1D1 DnDn DnDn D2D2 D2D2 … 28

Query preserving compression 29 The cost of query processing: f(|D|, |Q|) Query preserving compression for a class L of queries For any data collection D, D C = R(D) For any Q in L, Q( D) = P(Q, Dc) 29 Compressing Post-processing Q( ) D D DCDC DCDC reduce the parameter? 18 times faster on average for reachability queries In contrast to lossless compression, retain only relevant information for answering queries in L. Query preserving No need to restore the original dataset D or decompress the data. Better compression ratio R Q P Q Q( ) D D DcDc DcDc D D DcDc DcDc

30 A principled approach: Making big data small Bounded evaluable queries Query preserving compression: convert big data to small data Query answering using views: make big data small Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data Parallel query processing (MapReduce, GraphLab, etc) …. Combinations of these can do much better than MapReduce! 30

Approximate query processing 31

32 Approximate query answering 1. Query-driven approximation Feasible query models: from intractable to low polynomial time Top-k query answering 2. Data-driven approximation: resource-bounded query answering Querying big data within our available resources 32 We can’t afford to always find exact query answers in big data Some queries are expensive (e.g., subgraph isomorphism) We may have constrained resources -- cannot afford unlimited resources Applications may demand real-time response

33 Revised graph query model Relaxing the semantics of queries: case study Effectiveness: capture more sensible matches in social graphs Efficiency: from intractable to low polynomial time Subgraph isomorphism NP-completeExponentially many matches Quadratic/cubic time a polynomial time algorithm Use “cheaper” queries whenever possible Works better for social network analysis 33

34 Gray: Best-effort graph pattern matching (Hanghang T. KDD 09) 34 Output Input Attributed Data Graph Query Graph Matching Subgraph

35 G-Ray: quick overview (for loop ) Step 1: SF Step 2: NE Step 3: BR Step 4: NE Step 5: BR Step 6: NE Step 7: BR Step 8: BR SF: Seed-Finder NE: Neighborhood -Expander BR: Bridge

36 G-Ray: example pattern and matches Not exact answers No approximation guarantee Lose topological information In linear time on |G| Best-effort match

Top-k query answering 37 Traditional query answering: compute Q ( D ) Top-k query answering: Input: Query Q, database D and a positive integer k. Output: A top-ranked set of k matches of Q It is expensive to compute when D is large The result Q ( D ) is excessively large for the users to inspect – larger than D 37 How many matches do you check when you use, e.g., Google? Early termination: return top-k matches without computing Q(D)

38 Top-k graph querying Input: Graph G, Query Q, Integer k, answer quality measure Output: top-k answer set that maximizes object function F Top-k algorithms Exact top-k Approximate top-k Any-time top-k Early terminating Difference between Top-k graph problems and top-k table aggregation? Valid match expansion Join (no monotonicity) Hard to show instance optimality (top-k Join queries are special cases!)

39 GraphTA: A template Initialize candidate list L for node/edge in Q For each list L sort L with ranking function; Set a cursor to each list; set an upper bound U For each cursor c in each list L do generate a match that contains c; update Q(G,k); update threshold H with lowest score in Q(G,k); move all cursors one step ahead; update the upper bound U; if k matches are identified and H>=U then break; Return Q(G, k) nodes/edges of interests nodes/edges of interests

Finding best candidates 40 Project Manager* Programmer DB manager Tester PM 1 BA PM 2 PM 3 PM 4 PRG 1 DB 1 DB 2 PRG 3 DB 3 PRG 4 PRG 2 UD 1 UD 2 ST 1 ST 2 ST 3 ST 4 Query: find good PM (project manager) candidates collaborated with PRG (programmer), DB (database developer) and ST (software tester). Collaboration network G “query focus” complete matching relation (project manager, PM 1 ), (project manager, PM 2 ) (project manager, PM 3 ), (project manager, PM 4 ) (programmer, PRG 1 ), (programmer, PRG 2 ) (programmer, PRG 3 ), (programmer, PRG 4 ) (DBmanager, DB 1 ), (DBmanager, DB 2 ) (DBmanager, DB 3 ) (tester, ST 1 ), (tester, ST 2 ) (tester, ST 3 ), (tester, ST 4 ) Pattern graph Q Querying collaborative networks: we just want top-ranked PMs

41 Input: graph G = (V, E, f A ), pattern Q = (V Q, E Q, f v, u o ) Output: Q(G, u o ) = { v | ( u o, v)  Q(G) } Graph pattern matching with output node Output: k nodes vs. the entire set Q ( G ) Output node Matches of the output node Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q ( G, u o ) PM DBPRG ST Pattern Q * pm 1 pm 2 pm 3 pm n db 1 db 2 db 3 prg 1 prg 2 prg 3 st 1 st 2 st 3 st 4 st m …… Top-2 matches How to rank the answers? 41

Top-k answers 42 Top-k matching: top-k matches that maximize the total relevance PM 2 DB 2 PRG 3 DB 3 PRG 4 PRG 2 ST 2 ST 3 ST 4 Relevant set R(u,v) for a match v of a query node u: all descendants of v as matches of descendants of u a unique, maximum relevance set Relevance function ◦ The more reachable matches, the better

Finding Top-k Matches 43 Finding Top-k matches for acyclic patterns ◦ Initializes a heap S, and a vector for each candidate v ◦ Computes a set of matches for some query nodes (can be determined without following steps) ◦ Iteratively updates vectors of other candidates by propagating the partial answers ◦ Termination condition: (1) each v in S is a match of u o, and (2) min v ∈ S (l(u o, v)) ≥ max v′ ∈ can(uo)\S (h(u o, v)), where l(u o, v) and h(u o, v) denote a lower bound and upper bound of r(u o, v). xXv: match? v.R: relevance set v.l ower, v.upper: relevance bound

Finding Top-k Matches 44 Project Manager* Programmer DB manager PM 1 BA PM 2 PM 3 PM 4 PRG 1 DB 1 DB 2 PRG 3 DB 3 PRG 4 vv.T = PM 1 PM 2 PM 3 PM 4 PRG 1 PRG j (j ∈ [3,4]) DB k (k ∈ [1,3]) vv.T = PM 1 PM 2 PM 3 PM 4 PRG 1 PRG j (j ∈ [3,4]) DB 2 DB k (k ∈ [1,3]) After initialization propagation from DB 2 a valid match, and its relevant set includes the most matches compared with others. Early termination condition is met.

Reading 45 1. M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS 1999. http://web.ing.puc.cl/~marenas/publications/pods99.pdf 2. Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, 2007. http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd07/bhattac harya-tkdd.pdf 3. P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB 2011. http://www.vldb.org/pvldb/vol4/p956-li.pdf 4. W. Fan and F. Geerts ， Relative information completeness, PODS, 2009. 5. Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD 2013. 6. P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001.

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Similar presentations

Presentation on theme: "CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Similar presentations

Presentation on theme: "CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49."— Presentation transcript:

Similar presentations

About project

Feedback