Download presentation

Presentation is loading. Please wait.

Published byOctavio Heaton Modified over 4 years ago

1
Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

2
“Boolean Provenance/Lineage” as a Boolean formula Q is true on D F Q,D is true Poly-size, Poly-time computable (data complexity) But Q is a RA + query This talk : What if Q is a Datalog Program? AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 Database D F Q,D = (x 1 y 1 z 1 ) (x 1 y 2 z 2 ) (x 2 y 3 z 2 )

3
Provenance – Reliability and repeatability – View management and deletion propagation – Trust and security management – Query answering in probabilistic database, …. Datalog – Datalog is popular again! (two keynotes this ICDT/EDBT) – Data extraction in Web, declarative networking – Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna) Finding suitable “Provenance for Datalog” is important – Both from theoretical and practical viewpoints How do we compute, store, and interpret provenance for datalog programs efficiently and effectively? 3

4
Can we get poly-size Boolean formulas for datalog provenance? No, even if we allow unbounded time Do we have a solution? Yes! Use Boolean Circuits! What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07] It depends on the semiring 4

5
Background Circuits for Boolean Provenance Circuits for General Provenance Semirings 5

6
Background Circuits for Boolean Provenance Circuits for General Provenance Semirings 6

7
T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) 7 Datalog program for Transitive Closure and Single-source Reachability EDB (base) relation for edges: R IDB (derived) relations ─ Transitive closure (T) ─ Single-source reachability from vertex ‘a’ (S) IDB (Intensional Databases) EDB (Extensional Databases)

8
8 Tuples are annotated with variables from a set X – Here X = {x 1, x 2, y 1, y 2, ….} For n tuples in X, 2 n possible worlds by assignments : X {True, False} Useful in query evaluation on incomplete or probabilistic databases AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 PosBool(X)- database D

9
9 Annotation propagates from input to output – Join = , Projection/Union = Output tuples are annotated by monotone Boolean formula – F Q,D is the annotation of the unique output tuple AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom RA + Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 PosBool(X)-Database D F Q,D = (x 1 y 1 z 1 ) (x 1 y 2 z 2 ) (x 2 y 3 z 2 )

10
10 For all RA + query Q, D, and assignment 1. (Faithful Representation) Q(D )= [Q(D)] 2. (Poly-size overhead) The size of F Q,D is poly in |D| and can be computed in poly-time. AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom RA + Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 F Q,D = (x 1 y 1 z 1 ) (x 1 y 2 z 2 ) (x 2 y 3 z 2 ) True False True False True False = False PosBool(X)-Database D

11
Semantics using Derivation Trees (Green et al. 2007) Annotation of T(a, b): 11 T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) R aa ab p q a b Trees Leaves t of Annot(t) … = (q) (p q) (p p q) … Infinitely many trees But always has a finite equivalent form = q But not necessarily poly-size T(a, b) R(a, a) T(a, b) R(a, a) T(a, b) R(a, b) T(a, b) R(a, a) T(a, b) R(a, b) T(a, b)

12
Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Boolean formulas of size polynomial in |D| Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Boolean formulas of size polynomial in |D| 12 Proof outline: st-connectivity on n nodes requires n (logn) -size monotone Boolean formula Karchmer-Wigderson, 1988 Faithful representation requires: for all True/False assignments to X, P(D )= [P(D)] Reduce to the hard instance with right when P = transitive closure Solution: Boolean Circuit!

13
Background Circuits for Boolean Provenance or PosBool(X) Circuits for General Provenance Semirings 13

14
Circuit is a DAG – use common subexpressions – Boolean formula = tree Leaf nodes: – EDB vars in X Internal nodes – : IDB/EDB vars used in one derivation –:–: Alternative derivations Roots: – IDB vars 14 R aa ab p q T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) X T(a, b) q p X R(a, b) X R(a, a) a b

15
Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time) Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time) 15

16
1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples [Green et al. 2007] 1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples [Green et al. 2007] 16 2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011] N = #IDB tuples Build a circuit with N+1 layers from the system of equations Two key ideas from previous work EDB tuples constants, IDB tuples variables Iteratively solve this system of equations Fixpoint = provenance for all IDB tuples

17
17 T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) R aa ab p q a b Step1 : Build system of equations by all possible instantiations: x, y, z a, b X T(a, a) = p (p X T(a, a) ) X T(a, b) = q (p X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) Step 2: Build a circuit with 4 + 1 layers (N = 4) … var Const

18
18 X T(a,a),0 X S(b),0 X T(a,a),0 X T(a,b),0 X S(a),0 p q X T(a,a),1 X S(b),1 X T(a,a),1 X T(a,b),1 X S(a),1 X S(a),2 X T(a,a),2 X S(b),2 X Ta,a),2 X T(a,b),2 Level 1 Level 2 false X T(a, a) = p (p X T(a, a) ) X T(a, b) = q (p X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) Assign leaf IDB vars to false Multiple roots for multiple IDB vars

19
1. Store only two levels of circuit instead of N+1 levels – Evaluate iteratively 2. Embed circuit construction in semi-naïve evaluation – Check for new derivations, not only new IDB variables – Sound and Complete 3. Remove self-dependency of IDB vars – works for PosBool(X) and also some other semirings… X T(a, a) = p (p X T(a, a) ) X T(a, b) = q (p X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) 19

20
20 X T(a,a),0 X S(b),0 X T(a,a),0 X T(a,b),0 X S(a),0 p q X T(a,a),1 X S(b),1 X T(a,a),1 X T(a,b),1 X S(a),1 X S(a),2 X T(a,a),2 X S(b),2 X Ta,a),2 X T(a,b),2 Level 1 Level 2 false

21
21 X T(a,a),bottom X T(a,b),bottom X S(a),bottom pq X T(a,a),top X T(a,b),top X S(a),top With all these optimizations Top Level Bottom Level

22
Linear-time deletion propagation (in circuit-size) Approximation for probabilistic databases – even when only the circuit (and not the database) is available Circuits can be computed “offline” – Only linear-time evaluation is required when needed (e.g. deletion propagation) compared to storing and solving a system of equations iteratively, or re-evaluating datalog program Can use existing techniques for efficient and parallel circuit evaluation 22

23
Background Circuits for Boolean Provenance or PosBool(X) Circuits for General Provenance Semirings 23

24
(K, + K, K, 0 K, 1 K ) – domain K – + K, K : associative, commutative, have neutral elements 0 K, 1 K – K distributes over + K, i.e. a K (b + K c) = a K b + K a K c – 0 K cancels any element in K, i.e. a K 0 K = 0 K K a = 0 K Examples: – (B, , , False, True) Set semantics – (N, +, , 0, 1) Bag semantics – (N { }, min, +, , 0) Tropical semiring to compute cost (e.g. cost of a shortest path) 24

25
Generalization of PosBool(X) (K, + K, K, 0 K, 1 K ) – Tuples are annotated with variables from X – K is of the form Prov(X) – + K denotes alternative usage – K denotes joint usage Examples: – (PosBool(X), , , False, True) – (Lin(X), , , , ) tracks contributing tuples [Cui et. al. ’00] – (Why(X), , , , { } ) : pairwise union of subsets, tracks contributing tuples in alternative derivations [Buneman et. al. ’01] 25

26
Key property needed for applications like deletion propagation, trust management, cost computation, … Prov(X) specializes correctly to K, if any valuation v : X K extends uniquely to a homomorphism h v : Prov(X) K (which correctly maps +, of Prov(X) to that of K) Further, some provenance semirings are “more informative” than the others 26

27
27 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set) Defined later Specializes correctly More informative Less informative

28
28 Trees Leaves t of Annot(t) Trees Leaves t of Annot(t) PosBool(X) General Prov(X) +k+k kk Infinite sums should be well-defined Need to consider “ –continuous semirings” and “ –continuous homomorphism”

29
29 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set) Finite so -continuous Need to add N [[X]] and N N [[X]] : Most informative provenance semiring [Green et al. ’07]

30
Poly-size overhead is not valid because of infinite sum But can outputs have finite annotations (with X, , +) that specializes correctly to semirings with finite domains? 30 Theorem: It is not possible to annotate with finite provenance expressions the output of datalog programs following N [[X]] -semantics that specialize “correctly” to the semiring Why(X) Theorem: It is not possible to annotate with finite provenance expressions the output of datalog programs following N [[X]] -semantics that specialize “correctly” to the semiring Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) ─ Need more levels in the circuit from system of equations ─ Need a different argument for correctness Finite annotations won’t specialize correctly to Why(X)

31
We propose Sorp(X) – Most general absorptive semiring a + a.b = a – N[X] but keep polynomials that are not “absorbed” by the others e.g. pq + p 2 q 3 pq p 2 q + pq 2 p 2 q + pq 2 The same algorithm, proof, and optimizations to construct poly-size circuits hold – Circuits are more general than Boolean circuit 31 1. Specializes correctly to interesting semirings 2. Outputs can be annotated by poly-size circuits

32
32 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set)

33
Data Provenance – e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08] Circuits – Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book) Provenance for Datalog – System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07] – Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014] 33

34
Circuits to represent and store Datalog Provenance – for PosBool(X) and other semirings – Semantics, Algorithms, Limitations, Applicability – Preliminary experiments support our results we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch Future Work: – A complete implementation, evaluation, new applications 34

35
Thank You Questions? 35

Similar presentations

Presentation is loading. Please wait....

OK

Analysis of Algorithms

Analysis of Algorithms

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google