Download presentation

Presentation is loading. Please wait.

Published byDasia Burlin Modified over 2 years ago

1
Faster Query Answering in Probabilistic Databases using Read-Once Functions Sudeepa Roy Joint work with Vittorio Perduca Val Tannen University of Pennsylvania 1

2
Probabilistic Databases 2 Possible worlds model Each possible world w is a standard database instance, has a probability P[w] Compact representation D based on independence assumptions Query Semantics in Probabilistic Databases (wlog.) Boolean query q Traditional database: q(D) {true, false} Probabilistic database: P[q(D)] = ∑ q(w) = true P[w] Goal: Efficiently evaluate P[q(D)] Data complexity; want time polynomial in n = |D|

3
Computation of P[q(D)] Can we efficiently compute P[q(D)]? NO, In general #P-hard DalviSuciu’04, ff. : Positive queries can be partitioned into Safe queries: Safe plans run in poly-time on all instances Unsafe queries: Data complexity is #P-hard Includes very simple queries like R(x) S(x, y) T(y) Given q as input, we can efficiently decide whether q is safe BUT: For unsafe queries, probabilities on some instances can be efficiently computed Our Approach: Take both q and D as input 3

4
Restrictions a1 a2 a3 b1 b2 b Tuple-independent representation D Tuple t annotated by P[t] a1 a2 a b1 b2 b RST a1b1a1b1 RST P[w] = 0.3 (1 – 0.4) (1 – 0.6) 0.1 (1 – 0.5) (1–0.2) (1–0.1) 0.7 (1–0.8) (1 – 0.4) w = a possible world D = Conjunctive query without self-join (CQ - ) q():= R(x)S(x, y)T(y) (This is the H 0 query from Suciu’s keynote) Probability

5
Query Answering in Two Steps: Example Event variables for tuples Step 1: Event expression for q(D) or “lineage” E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 The “form” of the expression depends on query plan; here () ((R ⋈ S) ⋈ T) Step 2: Compute P[q(D)] = P[E] given Pr[w1] = 0.3, Pr[v1] = 0.4, …. This work: take advantage of Read-Once expressions D a1 a2 a3 b1 b2 b3 v1 v2 v3 v a1 a2 a3 w1 w2 w b1 b2 b3 u1 u2 u R T S 5 Probability Event variables q():= R(x), S(x, y), T(y) EASY HARD a1 a2 a3 b1 b2 b a1 a2 a b1 b2 b

6
Read-Once Boolean Expressions Expression in Read-once Form: Every variable occurs exactly once e.g. ((x+y)z + w)(u+v) Linear time probability computation P(x y) = P(x) P(y) P(x + y) = 1 – (1 -P(x)) (1 – P(y)) Read-once Expression: Has an equivalent read-once form. e.g. xzu + xzv + yzu + yzv + wu+ wv [in DNF, as large as O(n |q| )] xzu + xzv + (yz + w)(u+v) [not in DNF, can be much smaller] Non-read-once Expressions: No read-once form e.g.. xy + yz + zx, xy + yz + zw xy zuv 6

7
Read-Once Event Expressions Safe plans for safe queries directly produce expressions in read-once form (OlteanuHuang’08) Unsafe queries can also produce read-once expressions Our example is read-once E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 = (w1 v1 + w2 v2) u1 + w3 (v3 u2 + v4 u3) Corresponds to unsafe query q():= R(x) S(x, y) T(y) No query plan can produce the read-once form directly 7

8
Problem Definition Given a boolean CQ - query q, a tuple-independent database D, Can we efficiently decide whether the event expression corresponding to q(D) is read-once? If yes, can we compute the read-once form efficiently? (then P[q(D)] can be computed efficiently) 8

9
Read-once-ness: only a sufficient condition to efficiently compute P[q(D)] e.g., E = x1 x2 + x2 x3 + x3 x4 + …… Not read-once P[E] can be computed in poly-time using dynamic programming Moreover, see detailed analysis in JhaSuciu ’11 using OBDD, FBDD, d-DNNF E is read-once read-once form of E can be computed efficiently P[E] can be computed efficiently 9

10
Outline Background Existing characterization of read-once expressions Co-occurrence Graphs Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form Related work, Future work and Conclusion 10

11
Outline Background Existing characterization of read-once expressions Co-occurrence Graphs Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form Related work, Future work and Conclusion 11

12
Characterization of Read-once Expressions A positive boolean expression is read-once if and only if its “co-occurrence graph” is P4-free (no simple induced path with four vertices) and “normal”. Gurvich’ 77, ’ 91 Can be checked (and computed) in poly-time if the expression is given in DNF (GolumbicMR’ 06) z 12

13
Co-occurrence Graph - G CO Graph on variables in the expression as vertices 1. Express boolean expression in irredundant DNF xy + xyz + zx xy + zx 2. Put an edge between variables if they co-occur in a disjunct Can be easily computed if the expression is in DNF y x z 13

14
Outline Background Existing characterization of read-once expressions Co-occurrence Graphs Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form Related work, Future work and Conclusion 14

15
Our Contributions 1. DNF of event expression is not needed for CQ - G CO can be directly computed from “ provenance DAGs ” 2. We do not need to compute G CO A subgraph of G CO suffices – “ Co-table graph” G CT 15 Our Framework Compute G CO Use existing read-once testing algorithms Compute G CT Use our read-once testing algorithm (1) Uses Gurvich’s characterization vs. (2) Uses alternative (2) Is faster than (1) (1)(2)

16
Provenance DAG Event expressions, called “lineage” (Suciu keynote), are a form of provenance (GreenKarvounarakisT ’07). We use provenance DAGs (Green et. al. ’07) Query q():= R(x), S(x, y), T(y) Query Plan () ((R ⋈ S) ⋈ T) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1w2w3 v1v2 v3 v4 u1 u2u3 16

17
Co-Table Graph -- G CT Subgraph of G co: |G CT | |G CO | Put an edge between variables only if their tables share variables in q e.g.: q():= R(x) S(y) R, S have n tuples each, G CO has n 2 edges, G CT has zero! q():= R(x) S(x, y) T(y) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1 w2 w3 u1 u2 u3 v1 v2 v3 v4 w1 w2 w3 v1 v2 v3 v4 u1 u2 u3 G CO G CT 17

18
Our Algorithm Input: Provenance DAG, H Obtained from the query plan Step1: Compute G CT (the same procedure can compute G CO as well) Step2: Compute read-once form (if possible) Otherwise output that event expression is not read-once 18

19
Step1: Computing G CT Theorem: Two variables are adjacent in G CT if and only if their least common ancestor set contains a product-node in the provenance DAG yxZ E = xy + xz Proof uses critically the no-self-join assumption 19

20
Step2: Computing Read-once form Input: G CT Alternate between Row Decomposition and Table Decomposition Recursive computation Exactly one can be done at a recursion level, otherwise not read-once Proof uses critically no-union assumption Sound and Complete 20 q q q E1E1 E2E2 E3E3 E = E 1 + E 2 + E 3 Row decomposition q1q1 q2q2 E1E1 E2E2 E = E 1 E 2 Table decomposition

21
Example: Row Decomposition a1 a2 a3 b1 b2 b3 v1 v2 v3 v4 a1 a2 a3 w1 w2 w3 b1 b2 b3 u1 u2 u3 R ST q():= R(x), S(x, y), T(y) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 w1 w2 w3 v1 v2 v3 v4 a1 a2 b1 v1 v2 a1 a2 w1 w2 b1u1 R1 S1T1 u1 u2 u3 + 21

22
Example: Table Decomposition w1 w2 v1 v2 a1 a2 b1 v1 v2 a1 a2 w1 w2 b1u1 R1 S1 T1 u1 q():= R(x), S(x, y), T(y) q1():= R(x), S(x, y1) q2():= T(y2) (w1 v1 + w2 v2) u1 (w1 v1 + w2 v2)u1 Final Expression: (w1 v1 + w2 v2)u1 + w3(v3 u2 + v4 u3) 22

23
Overall Time Complexity Input: Provenance DAG H Step1: Compute G CT or G CO Time complexity ≈ O(n m H + W H m CO ) m H = #edges in H, W H = width of H, m CO = #edges in G CO, m CT = #edges in G CT Step2: Compute read-once form (if possible) Using our algorithm: O((m CT + n) min (|q|, √ n)) ; Data complexity O(m CT + n) Using existing algorithms: O(m CO + n), m CT ≤ m CO 23 Summary Analysis uses “charging argument” Bound recursion depth, total time at each recursion level Step1 is more expensive Step2 is linear In |G CO | for existing algorithms In |G CT | for our algorithms |G CT | ≤ |G CO |

24
Outline Background Co-occurrence Graphs Existing characterization of read-once expressions Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form Related work, Future work and Conclusion 24

25
Related Work SenDeshpandeGetoor’ 10 Independent work, considers the same problem Shows that “normality” check is not needed for CQ - Tests P4-freeness using “lineage-trees” without computing the co-occurrence graph Our work: Computes the co-occurrence graph without DNF computation existing algorithms can be used. Was an open question in SenDeshpandeGetoor’10 Obtains a faster and simpler algorithm Time complexity comparison in the paper Uses BFS/DFS, easier to implement Uses compact provenance DAGs instead of lineage trees 25

26
Other Related Work Semantics of probabilistic query answering Fuhr-Rollecke ’97, Zimanyi ‘97 Dichotomy of CQ -,CQ and UCQ queries Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10 Knowledge compilation techniques Olteanu-Huang ’08 Jha-Olteanu-Suciu ‘10 Jha-Suciu ’11 Fink-Olteanu ‘11 26

27
Conclusion and Future Work Can co-occurrence/co-table graph be computed as a pre-processing step? This is the more expensive step Akin to building indexes on databases but depends on query’s “join pattern” Cache the already computed G CT with the join pattern How to handle Larger classes of queries (UCQ?) and database models (disjoint independent?) Other efficient knowledge-compilation forms 27

28
Thank You. Questions? 28

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google