Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Query Evaluation on Probabilistic Databases

Similar presentations


Presentation on theme: "Efficient Query Evaluation on Probabilistic Databases"— Presentation transcript:

1 Efficient Query Evaluation on Probabilistic Databases
Papers by Nilesh Dalvi, Dan Suciu, Chris Re

2 Outline Motivation Definitions through examples Evaluation Complexity

3 Motivation Imprecise information on the web Partial Information
Contradictions Imprecise queries

4 Imprecise Querying

5 Interpreting the ‘~’ For the actors name we can use edit distance, frequency similarity measures… For the films rating we can use user preferences, analysis of previous queries,… But how to combine them? And how to assign a score for a tuple w.r.t. the entire query?

6 Probabilistic Independence
P(a) denotes the probability of event a P(S) for a set of events S P(B) for a boolean expression over events Iff a and b are independent P({a,b}) = P(a)*P(b), P(a or b) = (1- (1-P(a))(1-P(b)))

7 Probabilistic DB Each tuple has a probability of appearing in the DB
Assume tuple independence Distribution over all possible DB instances Possible Worlds Semantics

8 Example

9 Semantics A query is evaluated on every possible world
Note that for each concrete world, the query may have several answers In this case, sum, for each answer, the probabilities of the worlds in which it appeared in the set of answers Example

10 Example (Join on B=C)

11 Another Example (join and projection on A)

12 Solution attempt Obtain a query plan
Compute intermediate results along with probabilities A plan in our (first) example: First compute the join, then project on D

13 Evaluation of the plan

14 Wrong! The tuples in the original DB were independent
The tuples in the intermediate DB are not! Thus the multiplication (for the projection) is incorrect.

15 The problem is hard Theorem: Answering a query over a general probabilistic DB is #P-hard (Data Complexity) #P-hard is the “equivalent” of NP-hard for functional problems E.g. #SAT - given a Boolean formula, compute how many satisfying assignments it has. Likely not to have a polynomial solution

16 Other plans Some query plans are OK
These are plans that preserve independencies Let us represent the query as a logical formula Tuples that support the answer ‘p’ satisfy: (s1 or s2) and t1

17 Plans and formulas The query was P((s1 or s2) and t1)
First join, then project corresponds to P((s1 and t1) or (s2 and t1)). This conversion is fine in classic DB But (s1 and t1), (s2 and t1) are not independent events!

18 Safe Plan A plan that preserves independencies is called safe
In our example: first project s over b, only then join with t = first compute the ‘OR’, then the ‘AND’

19 Safe Plan

20 Intuition on evaluation
Work with probabilistic events Carry the events during evaluation

21 Probabilistic Events Atomic events tuples in the original DB
Complex Events – boolean combination of events  tuples in intermediate DBs Translate a query plan to a complex event

22 Translation

23 Translating events to probabilities (Works iff the DB preserves independence!)

24 Safe Plans A relational algebra expression has multiple equivalent expressions Each corresponds to a concrete execution plan. Some of these plans may correspond to correct or incorrect probabilistic computations Let us try to detect what makes a plan safe.s

25 Checking for safe plans
Attach a complex event E as an attribute of each tuple For every relation R, Attr(R)-> R.E is a FD of R, where attr(R) are all the other attributes

26 So what can we do? 1. Compute a safe plan when there is one
2. Compute an approximation when not

27 Approximation Most common is called Monte-carlo approximation
Originally by Karp, improved in [suciu07] Guarantees convergence The error is greater than e with a probability of less than d after (4*n / e^2)* ln(2/d)

28 Functional Dependencies (FDs)
A functional dependency {A1,…An} -> B holds for a relation R if the values of the A1,…An decide the value of B

29 Safe plans using FDs Selections and joins (over conjunctive queries) are always safe (but may cause unsafe successions..) Projection of a1,…,ak over the result obtained from q is safe if for every R, there is an FD a1,...,ak -> Head(q) Where Head(q) are the attributes in the result of q

30 Intuition Projection over a1,…,an  OR over all tuples that have the same values of {a1,…,an} To be independent, each atomic event must be sufficient to distinguish tuples that are ORed (otherwise it appears in more than one) I.e it uniquely determines the other atomic events appearing in the tuple Hence the FD (valid only in combination with a1,…,an)

31 Safe and Unsafe queries
We can always compute an answer But the computation might be exponential… Computing P(e) for a general formula e is #P-hard (data complexity) It’s hard when no safe plan exists

32 Conjunctive Queries and Union thereof
Whiteboard discussion

33 Algorithms Much more details in the top-k talk
We’ll give just an overview here

34 Optimality The data complexity of a query q is #P-complete iff the algorithm fails

35 Safe Plan algorithm Top-Down
Push all safe projections late in the plan When you can’t, split the query q into two sub-queries q1 and q2 such that their join is q (when possible) If stuck, the query is unsafe

36 (Union of) Conjunctive Queries by example
T(x):- R(x,y),S(y,30) T(x):- P(x,y) In relational algebra? Multiple Possible translations Correspond to different ordering of operations Each option is called a “query plan”

37 More notations Head(q) is the set of head variables in q, FreeVar(q) is the set of free variables (i.e. non-head variables) in q R.Key is the set of variables in the key position of the relation R R.NonKey is the set of variables in the non-key positions of the relation R, R.Pred is the predicate that q applies to R. For x in FreeVar(q), denote qx a new query whose body is identical with q and where Head(qx) = Head(q) U {x}.

38

39

40

41 Conclusion Probabilistic DB is a very strong tool
Combines the exact semantics of classic DB with capabilities of IR Exact evaluation becomes hard sometimes But have good approximations (with bounds!)


Download ppt "Efficient Query Evaluation on Probabilistic Databases"

Similar presentations


Ads by Google