Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.

Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala

Outline Motivation Query Evaluation: Intensional Extensional Query Optimization Complexity Unsafe Plans Extensions Conclusions

Databases Are Deterministic Databases we see today are deterministic A tuple is either in the database or not They don’t deal with uncertainties

Future of Data Management Uncertainties in Data Uncertainties are represented as probabilities Extend data management tools to handle probabilistic data

Example Accident happened in Pune or Nagpur Accidents Table 0.7 0.3 Pune Nagpur MH1234 LikelihoodAreaCar

Representing Uncertainty Tuple-existence uncertainty All attributes in a tuple are known precisely; existence of the tuple is uncertain E.g. in previous slide. More later Attribute-value uncertainty Tuples (identified by keys) exist for certain; attributes (one or more) value are however uncertain Tomorrow, it may rain (probability is 0.6)‏

Our Goal For Today Understand how queries can be evaluated efficiently on Probabilistic Databases We will deal with tuple-level uncertainties only We also assume independence among tuples. i.e. P(t1, t2) = P(t1) * P(t2)‏

Possible Worlds: Example 1 P3LensC31 P2BatteryC29 P1LensC21 pFeatureCamera LensC21 FeatCam BattC29 FeatCam LensC31 LensC21 FeatCam LensC31 BattC29 LensC21 FeatCam I1I1 (1-p 1 ) (1-p 2 ) (1-p 3 )‏ I2I2 p 1 (1-p 2 )(1-p 3 )‏ I4I4 p 1 (1-p 2 )p 3 I3I3 (1-p 1 )p 2 (1-p 3 )‏ I5I5 p1p2p3p1p2p3 Total number of worlds: 2^count_tuples ∑I i = 1 Empty

Possible Worlds: Example 2 1‘n’ 1‘m’ BA s1 s2 0.8 0.5 ‘p’1 DC 0.6 t1 S T 0.24 0.06 0.16 0.04 D1 = {s1, s2, t1} D2 = {s1, t1} D3 = {s2, t1} D4 = {t1} D5 = {s1, s2} D6 = {s1} D7 = {s2} D8 = {} Prob.World Possible Worlds pwd(D p )‏

Query Evaluation So, lets consider a query: S join T on B = C, project on D Intuitively: Execute the query on each possible world The final result is a probabilistic relation that represents end result

Query Evaluation: Example 0.24 0.06 0.16 0.04 Prob. {‘p’} {} D1 = {s1, s2, t1} D2 = {s1, t1} D3 = {s2, t1} D4 = {t1} D5 = {s1, s2} D6 = {s1} D7 = {s2} D8 = Φ ResultWorld S join T on B = C, project on D 0.46Φ 0.54{‘p’} Prob.Answer q pwd (D p ) =

Query Evaluation Semantically correct If T has ‘n’ tuples, there can be as many as 2^n possible worlds. Exponential complexity, thus impractical Goal of the paper: Evaluate query efficiently

Intensional Query Evaluation Define the complex event e p (t) for each tuple t For each intermediate tuple, associate an explicit (complex) event expression Compute the actual probabilities at the end For this talk, we will look only select, join project queries

Intensional Semantics  Ev Ev X v2v2 E 1 ˄ E 2 v1v1  vE1E1 v1v1 E2E2 v2v2 E2E2 v E1E1 v …… E 1 V E 2 V …

Theorem (2)‏ The intensional semantics and the possible world semantics on probabilistic databases are equivalent for conjunctive queries. pwd(q i (D p )) = q pwd (D p )

Intensional Semantics: Example s2 s1 1 1 B 0.5‘n’ 0.8‘m’ A t1‘p’ D 0.61 C Table S Table T S join T on B = C ‘p’ D ‘n’ ‘m’ A 1 1 C s2 ˄ t1 1 s1 ˄ t1 1 EB Project on D (s1 ˄ t1) V (s2 ˄ t1)‏ ‘p’ RankD q rank (D p ) = Pr(q) = (0.8 * 0.6) + (0.5 * 0.6) – (0.8 * 0.5 * 0.6)‏ = 0.48 + 0.3 – 0.24 = 0.54

Intensional Semantics Final answer does not depend on the choice of plan Impractical to use it: The event expressions can become very large due to projections For each tuple t, one has to compute Pr(e) for its event e, which is #P-complete problem Thus very expensive

Extensional Semantics Play with probabilities instead of event expressions Much more efficient Assume tuple independence

Extensional Semantics  pvpv x p1p1 v1v1 v2v2 p 1 p 2 v1v1 p2p2 v2v2  p2p2 v p1p1 v1-(1-p 1 )(1-p 2 )…v

Extensional Query Evaluation: Example s2 s1 1 1 B 0.5‘n’ 0.8‘m’ A t1‘p’ D 0.61 C S T S join T on B = C ‘p’ D ‘n’ ‘m’ A 1 1 C 0.301 0.481 ProbB Project on D 1 – (1-0.48)*(1-0.30) = 0.636‘p’ ProbD Wrong?? Because the two tuples in the join are no longer independent!! Plan : π D (S join B=C T)‏

Extensional: Alternate Query Plan s2 s1 1 1 B 0.5‘n’ 0.8‘m’ A t1‘p’ D 0.61 C S T Project S on B 1 – (1-0.8)*(1-0.5) = 0.91 ProbB Join with T on B=C 1 C 1 B 0.9 * 0.6 = 0.54‘p’ ProbD CORRECT!! Plan : π D (π B (S) join B=C T)‏

Observation The answer depends on query plan

Notations R is a relation name. D = instance of a database schema Γ = set of functional dependencies E = set of all complex events q = query PRels(q) = the probabilistic relation names in q Attr(q) = all attributes in all relations in q Head(q) = the set of attributes that are in output of the query q

Induced Functional Dependcies Γ p (q) Some more FDs are added to Γ to get Γ p (q)‏ For join predicate S.A = T.B, add FDs S.A --> S.B and S.B --> S.A Add R.E --> Attr(R) where R is a probabilistic relation in q

Safe Plan A plan is safe if it computes the correct probabilities for answer tuples Formally, given a schema R p, Γ p, a plan P for a query q is safe if P e (D p ) = q rank (D p ) for all instances D p of that schema

Theorem (3)‏- Safety rules Consider a database schema where all the probabilistic relations are tuple-independent. Let q, q’ be the conjunctive queries that do not share any relation name. Then σ is always safe x is always safe in q x q’ Π is safe in Π(q) iff A 1,…A k, R.E → Head (q) where A 1,…A k, are projection attributes and R is a probabilistic relation in q.

Example Same example, Γ p is : S.A, S.B → S.E T.C, T.D → T.E S.E → S.A, S.B T.E → T.C, T.D Query :- S join T on B = C, project on D Plan : π D (S join B=C T)‏ Join is safe. We need to check the safeness of project. From theorem 3, we need to check A 1,…A k, R.E → Head (q)‏ T.D, S.E → S.A, S.B, T.C, T.D (pass)‏ T.D, T.E → S.A, S.B, T.C, T.D (fails, why?)‏ Where A 1,…A k is T.D R.E is S.E and T.E Head (q) is S.A, S.B, T.C, T.D

Example: Alternative Plan Query :- S join T on B = C, project on D Plan : π D (π B (S) join B=C T)‏ Project on B is safe. We need to check the safeness of project on D. From theorem 3, we need to check A 1,…A k, R.E → Head (q)‏ T.D, S.E → S.B, T.C, T.D T.D, T.E → S.B, T.C, T.D Where A 1,…A k is T.D R.E is S.E and T.E Head (q) is S.B, T.C, T.D Plan is safe!!

Separation Let q be a conjunctive query. Two relations R1, R2 are called connected if the query contains a join condition R1.A = R2.B and either R1.A or R2.B is not in Head(q). The relations R1, R2 are called separate if they are not connected. Two sets of relations Y1 and Y2 are said to form a separation for query q iff They partition the set Rels(q)‏ For any pair of R1 and R2 s.t. R1 belongs to Y1 and R2 belongs to Y2, they are separate Intuitively R1 and R2 are separate if, The query does not contain a join condition on R1, R2 If the query has join condition R1.A=R2.B, output of query contains both R1.A and R2.B

Separation: Example Query q(D) :- S(A,B), T(C,D), B = C q BC (B,C,D) :- S(A,B), T(C,D), B = C Are S,T are separate in q BC ? Head(q BC ) = {B,C,D} S join T on B = C ‘p’ D 1 1 C 1 1 B Both B and C are present in head(q BC ). Thus S and T are separate for this query

Finding Safe Plan Authors proposed SAFE-PLAN algorithm to find safe plans for a query Try to postpone all safe projections in the query plan When no more safe projections possible, it tries to perform a join, by splitting q into q1 join q2 Since we perform join in the last, all attributes of join condition must be in Head(q)‏ If a safe plan exist, the algorithm finds it

Safe-Plan finding algorithm

Finding Safe Plan: Example Processing :- SAFE-PLAN(π D (S join B=C T))‏ Head(q A ) = {A, D} q A = π A,D (S join B=C T))‏ Z = {A} Head(q) = {D} Is π Head(q) (q A ) a safe operator? Conditions: T.D, S.E → S.A, T.D (safe)‏ T.D, T.E → S.A, T.D (unsafe)‏

Finding Safe Plan: Example Processing :- SAFE-PLAN(π D (S join B=C T))‏ Head(q B ) = {B, D} q B = π D,B (S join B=C T))‏ Z = {B} Head(q) = {D} Is π Head(q) (q B ) a safe operator? Conditions: T.D, S.E → S.B, T.D (safe)‏ T.D, T.E → S.B, T.D (safe)‏ Return π D (SAFE-PLAN(q B ))‏

Finding Safe Plan: Example Processing :- π D (SAFE-PLAN(q B ))‏ Head(q AB ) = {A, B, D} q AB = π D,A,B (S join B=C T))‏ Z = {A} Head(q B ) = {B, D} Is π Head(q) (q AB ) a safe operator? Conditions: T.D, S.E → S.A, S.B, T.D (safe)‏ T.D, T.E → S.A, S.B, T.D (unsafe)‏

Finding Safe Plan: Example Processing :- π D (SAFE-PLAN(q B ))‏ Head(q BC ) = {B, C, D} q BC = π D,B,C (S join B=C T))‏ Z = {C} Head(q B ) = {B, D} Is π Head(q) (q BC ) a safe operator? Conditions: T.D, S.E → T.C, S.B, T.D (safe)‏ T.D, T.E → T.C, S.B, T.D (safe)‏ Return π BD (SAFE-PLAN(q BC ))‏

Finding Safe Plan: Example Processing :- π D ( π BD ( SAFE-PLAN(q BC ))‏ Head(q ABC ) = {A, B, C, D} q ABC = π DABC (S join B=C T))‏ Z = {A} Head(q BC ) = {B, C, D} Is π Head(q) (q ABC ) is a safe operator? Conditions: T.D, S.E → S.A,T.C, S.B, T.D (safe)‏ T.D, T.E → S.A,T.C, S.B, T.D (unsafe)‏

Finding Safe Plan: Example Processing :- π D ( π BD ( SAFE-PLAN(q BC ))‏ No projection possible!! q BC = π D,B,C (S join B=C T))‏ Head(q BC ) = {B, C, D} Split q BC into q1 join B=C q2, s.t. q1(B) :- S(A,B)‏ q2(C,D) :- T(C,D)‏ We know that S and T are separate on query q BC !! Return SAFE-PLAN(q1) join B=C SAFE-PLAN(q2))‏

Finding Safe Plan: Example π D ( π BD ( SAFE-PLAN(q1) join B=C SAFE-PLAN(q2)))‏ Head(q A ) = {A, B} q A = S(A,B)‏ Z = {A} Head(q 1 ) = {B} Is π Head(q1) (q A ) is a safe operator? Conditions: S.B, S.E → S.A, S.B (safe)‏ Return π B (SAFE-PLAN(S(A,B)))‏ i.e. π B (S(A,B))‏

Finding Safe Plan: Example SAFE-PLAN(q2) = T(C,D)‏ Thus, final result : π D (π BD (π B (S) join B=C T))‏ SAFE-PLAN algorithm is sound and complete

Unsafe Plans What if there is no safe plan? The author proposes two solutions Least Unsafe Plans Monte-Carlo Approximations

Least Unsafe Plans Minimize the error in computing the probabilities Modify SAFE-PLAN algorithm When splitting a query q in two sub-queries q1 and q2, allow joins b/w q1 and q2 on attributes not in Head(q), then project out these attributes These projections will be unsafe. Minimize their degree of unsafety Pick q1, q2 to be a minimum cut of graph (rather than separation)‏ Problem of finding minimum cut is in PTIME

Monte-Carlo Approximations Let q’ be the query obtained from q by making it return all the variables in its body. Evaluate q’ instead of q without any probability calculations Group the tuples based on the values of attributes in Head(q)‏ Complex event expression of a group will be in DNF. i.e. V n i=1 C i where each C i is of the form e1 ˄ e2 ˄ … Should compute this probability Complexity of evaluating the probability of a boolean expression is in #P-complete

Monte-Carlo Approximations Given a DNF formula with n clauses and any ε and δ, the probability can be approximated in time O(n/ε 2 ln (1/δ))‏ Probability of the error being greater than ε is less than δ.

Additional Operators Union, Difference and Groupby operators Covers almost all queries with nested sub- queries, aggregates, group-by and existensial/universal quantifiers

Uncertain Predicates q≈ predicate on a deterministic database Syntactic closeness: String Matching. e.g. certain ~ uncertain Edit distances, q-grams etc. Semantic closeness: e.g. musical ~ opera TF/IDF, ontologies from Wordnet Numeric closeness: e.g. 25 ~ 26 similar numeric values Once distances are defined, they need to be meaningfully converted into probabilities gaussian, student-T, normal-gamma parameters can be learned (ideal case) or can be specified by user

Conclusions Extensional semantics can be used to evaluate certain class of queries in PTIME #P-complete problems can be solved using approximations techniques In practice, many queries have safe plans. In the 10 queries of TPC-H, 8 have safe plans

Reference Nilesh Dalvi, Dan Suciu, Efficient Query Evaluation on Probabilistic Databases, Published in VLDBJ, vol. 16, no. 4, pp. 523- 544, 2007

Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.

Similar presentations

Presentation on theme: "Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.

Similar presentations

Presentation on theme: "Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala."— Presentation transcript:

Similar presentations

About project

Feedback