1 Querying Big Data: Theory and Practice Theory –Tractability revisited for querying big data –Parallel scalability –Bounded evaluability Techniques –Parallel.

Slides:



Advertisements
Similar presentations
Bounded Conjunctive Queries Yang Cao 1,2, Wenfei Fan 1,2, Tianyu Wo 2, Wenyuan Yu 3 1 University of Edinburgh, 2 Beihang University, 3 Facebook Inc.
Advertisements

Lecture 24 MAS 714 Hartmut Klauck
1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
New Models for Graph Pattern Matching Shuai Ma ( 马 帅 )
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
The Big Picture Chapter 3. We want to examine a given computational problem and see how difficult it is. Then we need to compare problems Problems appear.
1 NP-completeness Lecture 2: Jan P The class of problems that can be solved in polynomial time. e.g. gcd, shortest path, prime, etc. There are many.
The Theory of NP-Completeness
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Complexity 12-1 Complexity Andrei Bulatov Non-Deterministic Space.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
Tractable and intractable problems for parallel computers
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
The Theory of NP-Completeness
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Analysis of Algorithms CS 477/677
Chapter 11: Limitations of Algorithmic Power
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
NP and NP- Completeness Bryan Pearsaul. Outline Decision and Optimization Problems Decision and Optimization Problems P and NP P and NP Polynomial-Time.
TDD: Topics in Distributed Databases
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Yinghui Wu, SIGMOD 2012 Query Preserving Graph Compression Wenfei Fan 1,2 Jianzhong Li 2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute.
1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering.
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.
Advanced Topics NP-complete reports. Continue on NP, parallelism.
Complexity Classes Kang Yu 1. NP NP : nondeterministic polynomial time NP-complete : 1.In NP (can be verified in polynomial time) 2.Every problem in NP.
1 Propagating Functional Dependencies with Conditions Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University of Edinburgh Yanli HuNational.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
1 QSX: Querying Social Graphs Querying Big Graphs Parallel scalability Making big graphs small –Bounded evaluability –Query-preserving graph compression.
Scott Perryman Jordan Williams.  NP-completeness is a class of unsolved decision problems in Computer Science.  A decision problem is a YES or NO answer.
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
Complexity Classes (Ch. 34) The class P: class of problems that can be solved in time that is polynomial in the size of the input, n. if input size is.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.
The Complexity of Optimization Problems. Summary -Complexity of algorithms and problems -Complexity classes: P and NP -Reducibility -Karp reducibility.
Analysis of Algorithms
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Cliff Shaffer Computer Science Computational Complexity.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
CSE373: Data Structures & Algorithms Lecture 22: The P vs. NP question, NP-Completeness Lauren Milne Summer 2015.
Unit 9: Coping with NP-Completeness
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
NP-Complete problems.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
CSE 589 Part V One of the symptoms of an approaching nervous breakdown is the belief that one’s work is terribly important. Bertrand Russell.
Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Chapter 11 Introduction to Computational Complexity Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Lecture. Today Problem set 9 out (due next Thursday) Topics: –Complexity Theory –Optimization versus Decision Problems –P and NP –Efficient Verification.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
CPT-S 415 Topics in Computer Science Big Data
Parallelizing Sequential Graph Computations
Objective of This Course
Akshay Tomar Prateek Singh Lohchubh
Complexity Theory: Foundations
Presentation transcript:

1 Querying Big Data: Theory and Practice Theory –Tractability revisited for querying big data –Parallel scalability –Bounded evaluability Techniques –Parallel algorithms –Bounded evaluability and access constraints –Query-preserving compression –Query answering using views –Bounded incremental query processing TDD: Topics in Distributed Databases

2 Tractability revised for querying big data Parallel scalability Bounded evaluability New theory for querying big data 2 To query big data, we have to determine whether it is feasible at all. For a class Q of queries, can we find an algorithm T such that given any Q in Q and any big dataset D, T efficiently computes the answers Q(D) of Q in D within our available resources? Fundamental question Is this feasible or not for Q ?

BD-tractability 33

4 The good, the bad and the ugly Traditional computational complexity theory of almost 50 years: The good: polynomial time computable (PTIME) The bad: NP-hard (intractable) The ugly: PSPACE-hard, EXPTIME-hard, undecidable… Polynomial time queries become intractable on big data! What happens when it comes to big data? D Using SSD of 6G/s, a linear scan of a data set D would take D 1.9 days when D is of 1PB (10 15 B) D 5.28 years when D is of 1EB (10 18 B) O(n) time is already beyond reach on big data in practice! 4

5 Complexity classes within P NC (Nick’s class): highly parallel feasible parallel polylog time polynomially many processors Too restrictive to include practical queries feasible on big data BIG open: P = NC? L: O(log n) space NL: nondeterministic O(log n) space polylog-space: log k (n) space 5 Polynomial time algorithms are no longer tractable on big data. So we may consider “smaller” complexity classes parallel log k (n) as hard as P = NP L  NL  polylog-space  P, NC  P

6 Tractability revisited for queries on big data A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that for any database D on which queries of Q are defined, D’ =  (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC) BD-tractable queries are feasible on big data Does it work? If a linear scan of D could be done in log(|D|) time: 15 seconds when D is of 1 PB instead of 1.99 days 18 seconds when D is of 1 EB rather than 5.28 years D D  (D)  Q 1 (  (D)) Q 2 (  (D)) 。。。。 parallel log k (|D|, |Q|) 6

7 BD-tractable queries A class Q of queries is BD-tractable if there exists a PTIME preprocessing function  such that for any database D on which queries of Q are defined, D’ =  (D) for all queries Q in Q defined on D, Q(D) can be computed by evaluating Q on D’ in parallel polylog time (NC) Preprocessing: a common practice of database people one-time process, offline, once for all queries in Q indices, compression, views, incremental computation, … not necessarily reduce the size of D BDTQ 0 : the set of all BD-tractable query classes in parallel with more resources What is the maximum size of D’ ? 7

8 What query classes are BD-tractable? Boolean selection queries Input: A dataset D Query: Does there exist a tuple t in D such that t[A] = c? Build a B + -tree on the A-column values in D. Then all such selection queries can be answered in O(log(|D|)) time. Some natural query classes are BD-tractable Relational algebra + set recursion on ordered relational databases D. Suciu and V. Tannen: A query language for NC, PODS 1994 What else? Graph reachability queries Input: A directed graph G Query: Does there exist a path from node s to t in G? NL-complete 8

9 Deal with queries that are not BD-tractable Many query classes are not BD-tractable. Can we make it BD-tractable? No. The problem is well known to be P-complete! We need PTIME to process each query (G, (u, v)) ! Preprocessing does not help us answer such queries. Breadth-Depth Search (BDS) Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G? Starts at a node s, and visits all its children, pushing them onto a stack in the reverse order induced by the vertex numbering. After all of s’ children are visited, it continues with the node on the top of the stack, which plays the role of s Is this problem (query class) BD-tractable? D is empty, Q is (G, (u, v)) What is P-complete? 9

10 Make queries BD-tractable Factorization: partition instances to identify a data part D for preprocessing, and a query part Q for operations BDTQ: The set of all query classes that can be made BD-tractable Preprocessing:  (G) performs BDS on G, and returns a list M consisting of nodes in V in the same order as they are visited For all queries (u, v), whether u occurs before v can be decided by a binary search on M, in log(|M|) time Breadth-Depth Search (BDS) Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V Question: Is u visited before v in the breadth-depth search of G? Factorization: D is G = (V, E), Q is (u, v) after proper factorization 10

11 Fundamental problems for BD-tractability BD-tractable queries help practitioners determine what query classes are tractable on big data. Are we done yet? No, a number of questions in connection with a complexity class! Reductions: how to transform a problem to another in the class that we know how to solve, and hence make it BD-tractable? Complete problems: Is there a natural problem (a class of queries) that is the hardest one in the complexity class? A problem to which all problems in the complexity class can be reduced How large is BDTQ? BDTQ 0 ? Compared to P? NC? Analogous to our familiar NP-complete problems Why do we care? Fundamental to any complexity classes: P, NP, … Name one NP-complete problem that you know Why do we need reduction? 11

12 Reductions Departing from our familiar polynomial-time reductions, we need reductions that are in NC, and deal with both data D and query Q! transform a given problem to one that we know how to solve NC-factor reductions  NC : a pair of NC functions that allow re- factorizations (repartition data and query part), for BDTQ F-reductions  F : a pair of NC functions that do not allow re- factorizations, for BDTQ 0 transformations for making queries BD-tractable Properties: transitivity: if Q 1  NC Q 2 and Q 2  NC Q 3, then Q 1  NC Q 3 (also  F ) compatibility: if Q 1  NC Q 2 and Q 2 is in BDTQ, then so is Q 1. if Q 1  F Q 2 and Q 2 is in BDTQ 0, then so is Q 1. to determine whether a query class is BD-tractable 12

13 Complete problems for BDTQ A query class Q is complete for BDTQ if Q is in BDTQ, and moreover, for any query class Q’ in BDTQ, Q’  NC Q A query class Q is complete for BDTQ 0 if Q is in BDTQ 0, and for any query class Q’ in BDTQ 0, Q’  F Q There exists a natural query class Q that is complete for BDTQ BDS is both P-complete and BDTQ-complete! Is there a complete problems for BDTQ? Breadth-Depth Search (BDS) What does this tell us? 13

14 Is there a complete problem for BDTQ 0 ? A query class Q is complete for BDTQ 0 if Q is in BDTQ 0, and for any query class Q’ in BDTQ 0, Q’  F Q Unless P = NC, a query class complete for BDTQ 0 is a witness for P \ NC Whether P = NC is as hard as whether P = NP If we can find a complete problem for BDTQ 0 and show that it is not in NC, then we solve the big open whether P = NC It is hard to find a complete problem for BDTQ 0 An open problem 14

Comparing with P and NC NC  BDTQ = P All PTIME query classes can be made BD-tractable! Unless P = NC, NC  BDTQ 0  P Unless P = NC, not all PTIME query classes are BD-tractable need proper factorizations to answer PTIME queries on big data How large is BDTQ? How large is BDTQ 0 ? separation PTIME BD-tractable not BD-tractable Properly contained in P Not all polynomial-time queries are BD-tractable 15

16 Polynomial hierarchy revised Tractability revised for querying big data NP and beyond P P BD-tractable not BD-tractable Parallel polylog time 16

17 What can we get from BD-tractability? Guidelines for the following. Why we need to study theory for querying big data What query classes are feasible on big data? What query classes can be made feasible to answer on big data? How to determine whether it is feasible to answer a class Q of queries on big data? Reduce Q to a complete problem Q c for BDTQ via  NC If so, how to answer queries in Q ? Identify factorizations (  NC reductions) such that Q  NC Q c Compose the reduction and the algorithm for answering queries of Q c BDTQ 0 BDTQ 17

Parallel scalability 18

Parallel query answering BD-tractability is hard to achieve. Parallel processing is widely used, given more resources DB P M P M P M interconnection network 10,000 processors Parallel scalable: the more processors, the “better”? 19 D Using SSD of 6G/s, a linear scan of D might take: D 1.9 days/10000 = 16 seconds when D is of 1PB (10 15 B) D 5.28 years/10000 = 4.63 days when D is of 1EB (10 18 B) Only ideally! How to define “better”?

20 Degree of parallelism -- speedup Speedup: for a given task, TS/TL, TS: time taken by a traditional DBMS TL: time taken by a parallel system with more resources TS/TL: more sources mean proportionally less time for a task Linear speedup: the speedup is N while the parallel system has N times resources of the traditional system resources Speed: throughput response time Linear speedup Question: can we do better than linear speedup?

21 Degree of parallelism -- scaleup Scaleup: TS/TL A task Q, and a task Q N, N times bigger than Q A DBMS M S, and a parallel DBMS M L,N times larger TS: time taken by M S to execute Q TL: time taken by M L to execute Q N Linear scaleup: if TL = TS, i.e., the time is constant if the resource increases in proportion to increase in problem size resources and problem size TS/TL Question: can we do better than linear scaleup?

22 Better than linear scaleup/speedup? NO, even hard to achieve linear speedup/scaleup! Startup costs: initializing each process Interference: competing for shared resources (network, disk, memory or even locks) Skew: it is difficult to divide a task into exactly equal-sized parts; the response time is determined by the largest part Linear speedup is the best we can hope for -- optimal! Give 3 reasons Think of blocking in MapReduce Data shipment cost for shard-nothing architectures In the real world, linear scaleup is too ideal to get! A weaker criterion: the more processors are available, the less response time it takes.

23 Parallel query answering Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Performance guarantees: bounds on response time and data shipment Performance Parallel query answering Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Response time (aka parallel computation cost): Interval from the time when Q is submitted to the time when Q(G) is returned Data shipment (aka network traffic): the total amount of data shipped between different processors, as messages What is it? Why do we need to worry about it?

24 Parallel scalability A distributed algorithm is useful if it is parallel scalable Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Complexity t(|G|, |Q|): the time taken by a sequential algorithm with a single processor T(|G|, |Q|, n): the time taken by a parallel algorithm with n processors Parallel scalable: if T(|G|, |Q|, n) = O(t(|G|, |Q|)/n) + O((n + |Q|) k ) Polynomial reduction (including the cost of data shipment, k is a constant) When G is big, we can still query G by adding more processors if we can afford them

25 linear scalability Querying big data by adding more processors An algorithm T for answering a class Q of queries Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Algorithm T is linearly scalable in computation if its parallel complexity is a function of |Q| and |G|/n, and in data shipment if the total amount of data shipped is a function of |Q| and n The more processors, the less response time Independent of the size |G| of big G Is it always possible?

26 Graph pattern matching via graph simulation Input: a graph pattern graph Q and a graph G Output: Q(G) is a binary relation S on the nodes of Q and G O((| V | + | VQ |) (| E | + | EQ| )) time each node u in Q is mapped to a node v i n G, such that (u, v) ∈ S for each (u,v) ∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) i n G, such that (u’,v’ ) ∈ S 26 Parallel scalable?

27 Impossibility Nontrivial to develop parallel scalable algorithms There exists NO algorithm for distributed graph simulation that is parallel scalable in either computation, or data shipment Why? Pattern: 2 nodes Graph: 2n nodes, distributed to n processors Possibility: when G is a tree, parallel scalable in both response time and data shipment

28 Weak parallel scalability Rational: we can partition G as preprocessing, such that |E f | is minimized (an NP-complete problem, but there are effective heuristic algorithms), and When G grows, |E f | does not increase substantially Algorithm T is weakly parallel scalable in computation if its parallel computation cost is a function of |Q| |G|/n and |E f |, and in data shipment if the total amount of data shipped is a function of |Q| and |E f | The cost is not a function of |G| in practice Doable: graph simulation is weakly parallel scalable edges across different fragments

29 MRC: Scalability of MapReduce algorithms Characterize scalable MapReduce algorithms in terms of disk usage, memory usage, communication cost, CPU cost and rounds. For a constant  > 0 and a data set D, |D| 1-  machines, a MapReduce algorithm is in MRC if Disk: each machine uses O(|D| 1-  ) disk, O(|D| 2-2  ) in total. Memory: each machine uses O(|D| 1-  ) memory, O(|D| 2-2  ) in total. Data shipment: in each round, each machine sends or receives O(|D| 1-  ) amount of data, O(|D| 2-2  ) in total. CPU: in each round, each machine takes polynomial time in |D|. The number of rounds: polylog in |D|, that is, log k (|D|) the larger D is, the more processors The response time is still a polynomial in |D|

30 MMC: a revision of MRC For a constant  > 0 and a data set D, n machines, a MapReduce algorithm is in MMC if Disk: each machine uses O(|D|/n) disk, O(|D|) in total. Memory: each machine uses O(|D|/n) memory, O(|D|) in total. Data shipment: in each round, each machine sends or receives O(|D|/n) amount of data, O(|D|) in total. CPU: in each round, each machine takes O(Ts/n) time, where Ts is the time to solve the problem in a single machine. The number of rounds: O(1), a constant number of rounds. Speedup: O(Ts/n) time the more machines are used, the less time is taken Compared with BD-tractable and parallel scalability What algorithms are in MRC? Recursive computation? Restricted to MapReduce

Bounded evaluability 31

32 Scale independence Input: A class Q of queries Question: Can we find, for any query Q  Q and any (possibly big) dataset D, a fraction D Q of D such that |D Q |  M, and Q(D) = Q(D Q )? Making the cost of computing Q(D) independent of |D|! Independent of the size of D Particularly useful for A single dataset D, eg, the social graph of Facebook Minimum D Q – the necessary amount of data for answering Q Q( ) D D DQDQ DQDQ DQDQ DQDQ 32

33 Facebook: Graph Search Find me restaurants in New York my friends have been to in 2013 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2013 How many tuples do we need to access? Facebook: 5000 friends per person Each year has at most 366 days Each person dines at most once per day pid is a key for relation person Access constraints (real-life limits) 33

34 Bounded query evaluation Accessing * 366 tuples in total Fetch 5000 pid’s for friends of p friends per person For each pid, check whether she lives in NYC – 5000 person tuples For each pid living in NYC, finds restaurants where they dined in 2013 – 5000 * 366 tuples at most Contrast to Facebook : more than 1.38 billion nodes, and over 140 billion links A query plan Find me restaurants in New York my friends have been to in 2013 select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy =

35 Access constraints On a relation schema R: X  (Y, N) X, Y: sets of attributes of R for any X-value, there exist at most N distinct Y values Index on X for Y: given an X value, find relevant Y values Access schema: A set of access constraints Combining cardinality constraints and index friend(pid1, pid2): pid1  (pid2, 5000) 5000 friends per person dine(pid, rid, dd, mm, yy): pid, yy  (rid, 366) each year has at most 366 days and each person dines at most once per day person(pid, name, city): pid  (city, 1) pid is a key for relation person Examples 35

36 Finding access schema On a relation schema R: X  (Y, N) Bounded evaluability: only a small number of access constraints Functional dependencies X  Y: X  (Y, 1) Keys X: X  (R, 1) Domain constraints, e.g., each year has at most 366 days Real-life bounds: 5000 friends per person (Facebook) The semantics of real-life data, e.g., accidents in the UK from dd, mm, yy  (aid, 610) at most 610 accidents in a day aid  (vid, 192) at most 192 vehicles in an accident Discovery: extension of function dependency discovery, TANE How to find these? 36

37 Bounded queries Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction D Q of D such that |D Q |  M, and Q(D) = Q(D Q )? Examples The graph search query at Facebook All Boolean conjunctive queries are bounded – Boolean: Q(D) is true or false – Conjunctive: SPC, selection, projection, Cartesian product Boundedness: to decide whether it is possible to compute Q(D) by accessing a bounded amount of data at all What are these? But how to find D Q ? 37

38 Boundedly evaluable queries Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction D Q of D such that |D Q |  M, Q(D) = Q(D Q ), and moreover, D Q can be identified in time determined by Q and A ? Examples The graph search query at Facebook All Boolean conjunctive queries are bounded but are not necessarily effectively bounded! effectively find If Q is boundedly evaluable, for any big D, we can efficiently compute Q(D) by accessing a bounded amount of data! 38

39 Deciding bounded evaluability Input: A query Q, an access schema A Question: Is Q boundedly evaluable under A ? Yes. doable Conjunctive queries (SPC) with restricted query plans: Characterization: sound and complete rules PTIME algorithms for checking effective boundedness and for generating query plans, in |Q| and | A | Relational algebra (SQL): undecidable Many practical queries are in fact boundedly evaluable! What can we do? Special cases Sufficient conditions Parameterized queries in recommendation systems, even SQL 39

Techniques for querying big data 40

An approach to querying big data 41 Given a query Q, an access schema A and a big dataset D 1.Decide whether Q is effectively bounded under A 2.If so, generate a bounded query plan for Q 3.Otherwise, do one of the following: ① Extend access schema or instantiate some parameters of Q, to make Q effectively bounded ② Use other tricks to make D small (to be seen shortly) ③ Compute approximate query answers to Q in D Very effective for conjunctive queries 77% of conjunctive queries are boundedly evaluable Efficiency: 9 seconds vs. 14 hours of MySQL 60% of graph pattern queries are boundedly evaluable (via subgraph isomorphism) Improvement: 4 orders of magnitudes 41

42 Bounded evaluability using views Input: A class Q of queries, a set of views V, an access schema A Question: Can we find by using A, for any query Q  Q and any (possibly big) dataset D, a fraction D Q of D such that |D Q |  M, a rewriting Q’ of Q using V, Q(D) = Q’(D Q, V (D)), and D Q can be identified in time determined by Q, V, and A ? access views, and additionally a bounded amount of data Query Q may not be boundedly evaluable, but may be boundedly evaluable with views! Q’(, ) D D Q( ) DQDQ DQDQ DQDQ DQDQ V V 42

43 Incremental bounded evaluability Input: A class Q of queries, an access schema A Question: Can we find by using A, for any query Q  Q, any dataset D, and any changes  D to D, a fraction D Q of D such that |D Q |  M, Q(D   D) = Q(D)   Q(  D, D Q ), and D Q can be identified in time determined by Q and A ? access an additional bounded amount of data Query Q may not be boundedly evaluable, but may be incrementally boundedly evaluable!  Q(, ) D D Q( ) DQDQ DQDQ DD DD D D DD DD  old output DQDQ DQDQ 43

Parallel query processing 44 Divide and conquer partition G into fragments (G 1, …, G n ), distributed to various sites manageable sizes upon receiving a query Q, evaluate Q( G i ) in parallel collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G evaluate Q on smaller G i Parallel processing = Partial evaluation + message passing Q( ) G G G1G1 G1G1 GnGn GnGn G2G2 G2G2 … graph pattern matching in GRAPE: 21 times faster than MapReduce 44

Query preserving compression 45 The cost of query processing: f(|G|, |Q|) Query preserving compression for a class L of queries For any data collection G, G C = R(G) For any Q in L, Q( G ) = P(Q, Gc) Q( G ) R G GcGc Q P Q Q( Gc ) 45 Compressing Post-processing Q( ) G G GCGC GCGC reduce the parameter? 18 times faster on average for reachability queries In contrast to lossless compression, retain only relevant information for answering queries in L. Query preserving! No need to restore the original graph G or decompress the data. Better compression ratio!

Answering queries using views 46 The complexity is no longer a function of |G| can we compute Q(G) without accessing G, i.e., independent of | G |? The cost of query processing: f(|G|, |Q|) Query answering using views: given a query Q in a language L and a set V views, find another query Q’ such that Q and Q’ are equivalent Q’ only accesses V ( G ) for any G, Q ( G ) = Q’( G ) V ( G ) is often much smaller than G (4% -- 12% on real-life data) Improvement: 31 times faster for graph pattern matching Q’( ) Q( ) V(G) 46 G G

Incremental query answering 47 Minimizing unnecessary recomputation Incremental query processing: Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M Changes to the output New output Changes to the input Old output When changes ∆G to the data G are small, typically so are the changes ∆M to the output Q(G ⊕ ∆G) Changes ∆G are typically small Compute Q(G) once, and then incrementally maintain it Real-life data is dynamic – constantly changes, ∆G Re-compute Q(G ⊕ ∆G) starting from scratch? 5%/week in Web graphs At least twice as fast for pattern matching for changes up to 10% 47

Complexity of incremental problems Bounded: the cost is expressible as f(|CHANGED|, |Q|)? 48 Complexity analysis in terms of the size of changes Incremental query answering Input: Q, G, Q(G), ∆G Output: ∆M such that Q(G ⊕ ∆G) = Q(G) ⊕ ∆M The cost of query processing: a function of |G| and |Q| incremental algorithms: |CHANGED|, the size of changes in the input: ∆G, and the output: ∆M The updating cost that is inherent to the incremental problem itself Incremental algorithms? Incremental graph simulation: bounded 48

49 A principled approach: Making big data small Bounded evaluable queries Parallel query processing (MapReduce, GRAPE, etc) Query preserving compression: convert big data to small data Query answering using views: make big data small Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data... Combinations of these can do much better than MapReduce! Including but not limited to graph queries Yes, MapReduce is useful, but it is not the only way! 49

50 Summary and Review What is BD-tractability? Why do we care about it? What is parallel scalability? Name a few parallel scalable algorithms What is bounded evaluability? Why do we want to study it? How to make big data “small”? Is MapReduce the only way for querying big data? Can we do better than it? What is query preserving data compression? Query answering using views? Bounded incremental query answering? If a class of queries is known not to be BD-tractable, how can we process the queries in the context of big data?

51 Projects (1) Prove or disprove one of the following query classes is BD-tractable, parallel scalable in MMC If so, give an algorithm as a proof. Otherwise, prove the impossibility but identify practical sub-classes that are scalable. The query classes include Distance queries on graphs Graph pattern matching by subgraph isomorphism Graph pattern matching by graph simulation Subgraph isomorphism and graph simulation on trees Experimentally evaluate your algorithms Both impossibility and possibility results are useful! 51 Pick one of these

52 Projects (2) Improve the performance of graph pattern matching via subgraph isomorphism via one of the following approaches: query-preserving graph compression query answering using views Prove the correctness of your algorithm, give complexity analysis and provide performance guarantees Experimentally evaluate your algorithm and demonstrate the improvement A research and development project 52

53 Projects (3) It is known that graph pattern matching via graph simulation can benefit from: query-preserving graph compression query answering using views W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, (query-preserving compression) W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE (query answering using views) Implement one of the algorithms Experimentally evaluate your algorithm and demonstrate the improvement Bonus: can you combine the two approaches and verify its benefit? A development project 53

54 Projects (4) Find an application with a set of SPC (conjunctive) queries and a dataset Identify access constraints on your dataset for your queries Implement an algorithm that, given a query in your class, decide whether the query is boundedly evaluable under your access constraints If so, generate a query plan to evaluate your queries by accessing a bounded amount of data Experimentally evaluate your algorithm and demonstrate the improvement A development project 54

55 Projects (5) Write a survey on techniques for querying big data, covering parallel query processing, data compression query answering using views incremental query processing … Develop a good understanding on the topic Survey: A set of 5-6 representative papers A set of criteria for evaluation Evaluate each model based on the criteria Make recommendation: what to use in different applications 55

56 Reading: data quality W. Fan and F.Geerts. Foundations of data quality management. Morgan & Claypool Publishers, (available upon request) 56 –Data consistency (Chapter 2) –Entity resolution (record matching; Chapter 4) –Information completeness (Chapter 5) –Data currency (Chapter 6) –Data accuracy (SIGMOD 2013 paper) –Deducing the true values of objects in data fusion (Chap. 7)

Reading for the next week M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, /bhattacharya-tkdd.pdf 3. P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB W. Fan and F. Geerts , Relative information completeness, PODS, Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001.