# Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.

## Presentation on theme: "Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1."— Presentation transcript:

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1

Probabilistic Databases To model and query uncertain data (sensor networks, information extraction…) Possible worlds model – Each possible world W is a standard database instance, has a probability P [W] – Compact representation D assuming independence D a1a2a3a3a1a2a3a3 b1b1b2b3b1b1b2b3 0.1 0.5 0.2 0.1 a1a2a3a1a2a3 0.3 0.4 0.6 b1b2b3b1b2b3 0.7 0.8 0.4 R S T 2

Query Semantics Query Semantics on probabilistic databases: – Apply the query q on each possible world W – Add up the probabilities of the worlds that give the same query answer A P [q(D) = A] = ∑ W : q(W) = A P [W] Goal: Efficiently evaluate P [q(D) = A] – Data complexity; want time polynomial in n = |D| Can we always efficiently compute P [q(D)]? – NO, in general it is #P-hard 3

b1b2b3b1b2b3 u1u2u3u1u2u3 0.7 0.8 0.4 b1b2b3b1b2b3 0.7 0.8 0.4 a1a2a3a3a1a2a3a3 b1b1b2b3b1b1b2b3 v1v2v3v4v1v2v3v4 0.1 0.5 0.2 0.1 a1a2a3a3a1a2a3a3 b1b1b2b3b1b1b2b3 0.5 0.2 0.1 a1a2a3a1a2a3 w1w2w3w1w2w3 0.3 0.4 0.6 a1a2a3a1a2a3 0.3 0.4 0.6 Introduce event variables for tuples ( P [w 1 ] = 0.3, …) Step 1: Boolean provenance for q(D) [FR ’97, Z ’97] f = w 1 v 1 u 1 + w 2 v 2 u 1 + w 3 v 3 u 2 + w 3 v 4 u 3 Step 2: Compute P [q(D)] = P [f] given P [w 1 ] = 0.3, P [v 1 ] = 0.4, … 4 Probability Event variables Boolean query q():-R(x),S(x, y),T(y) easy hard Query Answering in Two Steps D R S T

Probability Computation for Positive Queries Dichotomy Result [DS ’04, ’07; DSS ’10] Given q as input, we can efficiently decide if q is – Safe: Safe plans run in poly-time on all instances, or, – Unsafe: #P-hard, e.g. q() :- R(x) S(x, y) T(y) Instance-by-instance approach [SDG ’10, RPT ’11] – Both q and D are given as input – Poly-time algorithm to compute P [q(D)] for special cases even if q is unsafe What about queries with difference ? 5

Boolean Provenances for Difference c1c1c2c3c1c1c2c3 a1a2a3a2a1a2a3a2 v1v2v3v4v1v2v3v4 a1a2a3a1a2a3 w1w2w3w1w2w3 RT 6 q 1 (x):- R(x, y), S(y, z) b1b2b1b1b2b1 c1c2c3c1c2c3 u1u2u3u1u2u3 q 2 (x):- R(x, y), S(y, z), T(z) b1b2b1b2 u 1 (v 1 + v 2 ) + u 3 v 4 u 2 v 3 b1b2b1b2 u 1 v 1 w 1 + u 1 v 2 w 2 + u 3 v 4 w 2 u 2 v 3 w 3 b1b2b1b2 (u 1 (v 1 + v 2 ) + u 3 v 4 ). (u 1 v 1 w 1 + u 1 v 2 w 2 + u 3 v 4 w 2 ) (u 2 v 3 ). (u 2 v 3 w 3 ) q = q 1 – q 2 S

Previous Work on Difference FOR ’11 – Framework for exact and approximate probability computation – But, no guarantee of polynomial running time In fact, we show in this paper that with difference, in some cases no approximation exists (unless NP = RP) How far can we go with difference in poly-time? 7

A Quick Comparison With difference DNF of boolean provenance may be exponential in n P [q(D)] may not be approximable Without difference DNF of boolean provenance is poly-size (n |q| ) P [q(D)] is always approximable ( FPRAS ) 8 FPRAS: F ully P olynomial R andomized A pprox. S cheme Compute with prob. ≥ ¾ in time polynomial in n, 1/ε p  [(1-ε) P [q(D)], (1+ε) P [q(D)]

Our Results We study queries of the form q 1 – q 2 and their generalization – FPRAS: If q 1 is any UCQ, q 2 is any safe CQ - – #P-hardness: Even if both q 1 and q 2 are safe CQ - – Inapproximability: Even if q 1 is the trivial TRUE query and q 2 is a UCQ Our FPRAS result extends to a larger class of queries of which q 1 – q 2 is a special case [CQ - : Conjunctive queries without self-joins] 9

Difference Rank Define difference rank (q) of query q recursively – (R) = 0 – (q 1 - q 2 ) = (q 1 ) + (q 2 ) + 1 R – S : rank 1 – (q 1 ⋈ q 2 ) = (q 1 ) + (q 2 ) (R – S 1 ) ⋈ (R - S 2 ) : rank 2 (R - T 1 ) ⋈ T 2 : rank 1 – (q 1  q 2 ) = max ((q 1 ), (q 2 )) (R – S 1 ) ⋈ (R - S 2 )  (R - T 1 ) ⋈ T 2 : rank 2 – Select, project: rank remains the same 10

FPRAS for queries q with (q) = 1 given some conditions hold (inapproximable for (q) = 1 in general) 11

Steps in FPRAS Step 1: Compute boolean provenance of q[D] for any query q with (q) = 1 Step 2: Write the boolean provenance in a “Probability Friendly Form” (if possible) Step 3: FPRAS inspired by Karp-Luby framework 12

Boolean Provenance for Queries q s.t. (q) = 1 Lemma: For any q with (q) = 1, on any D, the provenance f of q(D) has form f is poly-size in n = |D|, poly-time computable 13

Probability Friendly Form (PFF) If f is in PFF, we can approximate P [f] using Karp-Luby Framework 14 f is in PFF, if the negated DNF-s can be written in poly-size d-DNNFs (next slide)

d-DNNF Darwiche ’01, ’02, DM ’02 deterministic - Decomposable Negation Normal Form No internal node can have negation At most one child of a + -node is satisfiable Children of a. -node do not share variables + + In general, can be a DAG Probability can be computed in linear time 15

Karp-Luby Framework [KL ’83] Given boolean expression DAGs F 1, …, F m f = F 1 + F 2 +... + F m P [f] can be computed in poly-time (in m, n) if in poly-time,  i (1) P [F i ] can be computed (2) it can be checked if a given assignment satisfies F i (3) a random satisfying assignment of F i can be sampled Well-studied special case: DNF counting, where F 1, …, F m are DNF minterms: f = xyz + xyw + wuv 16

Conditions (1) and (2) hold for PFF Product of minterm and d-DNNF is another d-DNNF w 2 =1, z 1 =1 + + + + 17

Condition (3) also holds Lemma: Generating a random satisfying assignment on a d-DNNF can be done in poly-time + + 1.Process in reverse topological order 2.Generate a random satisfying assignment bottom up v 2 = 1v 2 = 0 v 1 = 0 v 1 = 1 v 2 = 0 v 1 = 0, v 2 = 0 v 1 = 1, v 2 = 0 At random 18

Expressibility in PFF So, if f is in PFF, we can approximate P [q(D)] But, can we decide in poly-time if some sub-expressions of a boolean expression have poly-size d-DNNFs?  Not known But, there are natural sufficient conditions that can be verified in poly-time – If certain sub-queries are safe and hence generate read- once expressions [OH ’08] – If sub-queries generate poly-size OBDDs [JS ’11] – Extends to instance-by-instance approach (both q, D given) 19

#P-hardness for q 1 - q 2 both q 1, q 2 are safe CQ - 20

#P-hardness: Steps in the proof “Hard” query q = q 1 – q 2 – q 1 () := R 1 (x, y 1 ) R 2 (x, y 2 ) R 3 (x, y 3 ) R 4 (x, y 4 ) – q 2 () := R 1 (x 1, y) R2(x 2, y) R 3 (x 3, y) R 4 (x 4, y) Counting independent sets in 3-regular bipartite graphs (XZ ’06) Counting edge covers in bipartite graphs of degree ≤ 4, where the edge set can be partitioned into 4 disjoint matchings 21

Other Related Work – Semantics of probabilistic query answering Fuhr-Rollecke ’97, Zimanyi ‘97 – Dichotomy of CQ -,CQ and UCQ queries Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10 – Knowledge compilation techniques Olteanu-Huang ’08, Jha-Olteanu-Suciu ’10, Jha-Suciu ’11, Fink-Olteanu ’11 – Instance-by-instance approach Sen-Deshpande-Getoor ’10, Roy-Perduca-Tannen ’11 22

Conclusions and Future work A step towards understanding complexity of exact and approximate computation for queries with difference operations Future work – Dichotomy results that classify syntactically difference queries (similar to positive UCQ)? – Extending FPRAS to queries with difference rank > 1? – Experimental evaluation of our algorithms 23

Thank you Questions? 24

Download ppt "Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1."

Similar presentations