# Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania.

## Presentation on theme: "Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania."— Presentation transcript:

Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania 1

Probabilistic Databases 2 Possible worlds model  Each possible world w is a standard database instance, has a probability P[w]  Compact representation D based on independence assumptions Query Semantics in Probabilistic Databases  (wlog.) Boolean query q  Traditional database: q(D)  {true, false}  Probabilistic database: P[q(D)] = ∑ q(w) = true P[w] Goal: Efficiently evaluate P[q(D)]  Data complexity; want time polynomial in n = |D|

Computation of P[q(D)] Can we efficiently compute P[q(D)]?  NO, In general #P-hard DalviSuciu’04, ff. : Positive queries can be partitioned into  Safe queries: Safe plans run in poly-time on all instances  Unsafe queries: Data complexity is #P-hard  Includes very simple queries like R(x) S(x, y) T(y)  Given q as input, we can efficiently decide whether q is safe BUT:  For unsafe queries, probabilities on some instances can be efficiently computed  Our Approach: Take both q and D as input 3

Restrictions a1 a2 a3 b1 b2 b3 0.1 0.5 0.2 0.1 Tuple-independent representation D  Tuple t annotated by P[t] a1 a2 a3 0.3 0.4 0.6 b1 b2 b3 0.7 0.8 0.4 RST a1b1a1b1 RST P[w] = 0.3 (1 – 0.4) (1 – 0.6) 0.1 (1 – 0.5) (1–0.2) (1–0.1) 0.7 (1–0.8) (1 – 0.4) w = a possible world D = Conjunctive query without self-join (CQ - )  q():= R(x)S(x, y)T(y)  (This is the H 0 query from Suciu’s keynote) Probability

Query Answering in Two Steps: Example  Event variables for tuples  Step 1: Event expression for q(D) or “lineage”  E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3  The “form” of the expression depends on query plan; here  () ((R ⋈ S) ⋈ T)  Step 2: Compute P[q(D)] = P[E]  given Pr[w1] = 0.3, Pr[v1] = 0.4, …. This work: take advantage of Read-Once expressions D a1 a2 a3 b1 b2 b3 v1 v2 v3 v4 0.1 0.5 0.2 0.1 a1 a2 a3 w1 w2 w3 0.3 0.4 0.6 b1 b2 b3 u1 u2 u3 0.7 0.8 0.4 R T S 5 Probability Event variables q():= R(x), S(x, y), T(y) EASY HARD a1 a2 a3 b1 b2 b3 0.1 0.5 0.2 0.1 a1 a2 a3 0.3 0.4 0.6 b1 b2 b3 0.7 0.8 0.4

Read-Once Boolean Expressions Expression in Read-once Form: Every variable occurs exactly once  e.g. ((x+y)z + w)(u+v)  Linear time probability computation  P(x y) = P(x) P(y)  P(x + y) = 1 – (1 -P(x)) (1 – P(y)) Read-once Expression: Has an equivalent read-once form.  e.g.  xzu + xzv + yzu + yzv + wu+ wv [in DNF, as large as O(n |q| )]  xzu + xzv + (yz + w)(u+v) [not in DNF, can be much smaller] Non-read-once Expressions: No read-once form  e.g.. xy + yz + zx, xy + yz + zw xy zuv 6

Read-Once Event Expressions Safe plans for safe queries directly produce expressions in read-once form (OlteanuHuang’08) Unsafe queries can also produce read-once expressions  Our example is read-once  E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 = (w1 v1 + w2 v2) u1 + w3 (v3 u2 + v4 u3)  Corresponds to unsafe query q():= R(x) S(x, y) T(y)  No query plan can produce the read-once form directly 7

Problem Definition Given  a boolean CQ - query q,  a tuple-independent database D,  Can we efficiently decide whether the event expression corresponding to q(D) is read-once?  If yes, can we compute the read-once form efficiently?  (then P[q(D)] can be computed efficiently) 8

Read-once-ness: only a sufficient condition to efficiently compute P[q(D)] e.g., E = x1 x2 + x2 x3 + x3 x4 + ……  Not read-once  P[E] can be computed in poly-time using dynamic programming  Moreover, see detailed analysis in JhaSuciu ’11 using OBDD, FBDD, d-DNNF E is read-once read-once form of E can be computed efficiently P[E] can be computed efficiently 9

Outline Background  Existing characterization of read-once expressions  Co-occurrence Graphs Our Contributions  Co-table graph  Step1. Computation of co-table graph  Step2. Computation of read-once form Related work, Future work and Conclusion 10

Outline Background  Existing characterization of read-once expressions  Co-occurrence Graphs Our Contributions  Co-table graph  Step1. Computation of co-table graph  Step2. Computation of read-once form Related work, Future work and Conclusion 11

Characterization of Read-once Expressions A positive boolean expression is read-once if and only if its “co-occurrence graph” is P4-free (no simple induced path with four vertices) and “normal”.  Gurvich’ 77, ’ 91  Can be checked (and computed) in poly-time if the expression is given in DNF (GolumbicMR’ 06) z 12

Co-occurrence Graph - G CO Graph on variables in the expression as vertices 1. Express boolean expression in irredundant DNF  xy + xyz + zx  xy + zx 2. Put an edge between variables if they co-occur in a disjunct Can be easily computed if the expression is in DNF y x z 13

Outline Background  Existing characterization of read-once expressions  Co-occurrence Graphs Our Contributions  Co-table graph  Step1. Computation of co-table graph  Step2. Computation of read-once form Related work, Future work and Conclusion 14

Our Contributions 1. DNF of event expression is not needed for CQ -  G CO can be directly computed from “ provenance DAGs ” 2. We do not need to compute G CO  A subgraph of G CO suffices – “ Co-table graph” G CT 15 Our Framework Compute G CO Use existing read-once testing algorithms Compute G CT Use our read-once testing algorithm (1) Uses Gurvich’s characterization vs. (2) Uses alternative (2) Is faster than (1) (1)(2)

Provenance DAG Event expressions, called “lineage” (Suciu keynote), are a form of provenance (GreenKarvounarakisT ’07). We use provenance DAGs (Green et. al. ’07) Query q():= R(x), S(x, y), T(y) Query Plan  () ((R ⋈ S) ⋈ T) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1w2w3 v1v2 v3 v4 u1 u2u3 16

Co-Table Graph -- G CT Subgraph of G co: |G CT |  |G CO | Put an edge between variables only if their tables share variables in q e.g.: q():= R(x) S(y)  R, S have n tuples each, G CO has n 2 edges, G CT has zero! q():= R(x) S(x, y) T(y) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1 w2 w3 u1 u2 u3 v1 v2 v3 v4 w1 w2 w3 v1 v2 v3 v4 u1 u2 u3 G CO G CT 17

Our Algorithm Input: Provenance DAG, H  Obtained from the query plan Step1: Compute G CT  (the same procedure can compute G CO as well) Step2: Compute read-once form (if possible)  Otherwise output that event expression is not read-once 18

Step1: Computing G CT Theorem: Two variables are adjacent in G CT if and only if their least common ancestor set contains a product-node in the provenance DAG yxZ E = xy + xz  Proof uses critically the no-self-join assumption 19

Step2: Computing Read-once form Input: G CT Alternate between  Row Decomposition and Table Decomposition Recursive computation Exactly one can be done at a recursion level, otherwise not read-once Proof uses critically no-union assumption Sound and Complete 20 q q q E1E1 E2E2 E3E3 E = E 1 + E 2 + E 3 Row decomposition q1q1 q2q2 E1E1 E2E2 E = E 1 E 2 Table decomposition

Example: Row Decomposition a1 a2 a3 b1 b2 b3 v1 v2 v3 v4 a1 a2 a3 w1 w2 w3 b1 b2 b3 u1 u2 u3 R ST q():= R(x), S(x, y), T(y) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1 w2 w3 v1 v2 v3 v4 a1 a2 b1 v1 v2 a1 a2 w1 w2 b1u1 R1 S1T1 u1 u2 u3 + 21

Example: Table Decomposition w1 w2 v1 v2 a1 a2 b1 v1 v2 a1 a2 w1 w2 b1u1 R1 S1 T1 u1 q():= R(x), S(x, y), T(y) q1():= R(x), S(x, y1) q2():= T(y2)  (w1 v1 + w2 v2) u1 (w1 v1 + w2 v2)u1 Final Expression: (w1 v1 + w2 v2)u1 + w3(v3 u2 + v4 u3) 22

Overall Time Complexity Input: Provenance DAG H Step1: Compute G CT or G CO  Time complexity ≈ O(n m H + W H m CO )  m H = #edges in H, W H = width of H, m CO = #edges in G CO, m CT = #edges in G CT Step2: Compute read-once form (if possible)  Using our algorithm: O((m CT + n) min (|q|, √ n)) ; Data complexity O(m CT + n)  Using existing algorithms: O(m CO + n), m CT ≤ m CO 23 Summary Analysis uses “charging argument” Bound recursion depth, total time at each recursion level Step1 is more expensive Step2 is linear  In |G CO | for existing algorithms  In |G CT | for our algorithms  |G CT | ≤ |G CO |

Outline Background  Co-occurrence Graphs  Existing characterization of read-once expressions Our Contributions  Co-table graph  Step1. Computation of co-table graph  Step2. Computation of read-once form Related work, Future work and Conclusion 24

Related Work SenDeshpandeGetoor’ 10  Independent work, considers the same problem  Shows that “normality” check is not needed for CQ -  Tests P4-freeness using “lineage-trees” without computing the co-occurrence graph Our work:  Computes the co-occurrence graph without DNF computation  existing algorithms can be used.  Was an open question in SenDeshpandeGetoor’10  Obtains a faster and simpler algorithm  Time complexity comparison in the paper  Uses BFS/DFS, easier to implement  Uses compact provenance DAGs instead of lineage trees 25

Other Related Work  Semantics of probabilistic query answering  Fuhr-Rollecke ’97, Zimanyi ‘97  Dichotomy of CQ -,CQ and UCQ queries  Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10  Knowledge compilation techniques  Olteanu-Huang ’08  Jha-Olteanu-Suciu ‘10  Jha-Suciu ’11  Fink-Olteanu ‘11 26

Conclusion and Future Work Can co-occurrence/co-table graph be computed as a pre-processing step?  This is the more expensive step  Akin to building indexes on databases but depends on query’s “join pattern”  Cache the already computed G CT with the join pattern How to handle  Larger classes of queries (UCQ?) and database models (disjoint independent?)  Other efficient knowledge-compilation forms 27

Thank You. Questions? 28

Similar presentations