Presentation is loading. Please wait.

Presentation is loading. Please wait.

Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania.

Similar presentations


Presentation on theme: "Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania."— Presentation transcript:

1 Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

2  “Boolean Provenance/Lineage” as a Boolean formula  Q is true on D  F Q,D is true  Poly-size, Poly-time computable (data complexity)  But Q is a RA + query  This talk : What if Q is a Datalog Program? AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom Boolean query Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 Database D F Q,D = (x 1  y 1  z 1 )  (x 1  y 2  z 2 )  (x 2  y 3  z 2 )

3  Provenance – Reliability and repeatability – View management and deletion propagation – Trust and security management – Query answering in probabilistic database, ….  Datalog – Datalog is popular again! (two keynotes this ICDT/EDBT) – Data extraction in Web, declarative networking – Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)  Finding suitable “Provenance for Datalog” is important – Both from theoretical and practical viewpoints  How do we compute, store, and interpret provenance for datalog programs efficiently and effectively? 3

4  Can we get poly-size Boolean formulas for datalog provenance? No, even if we allow unbounded time  Do we have a solution? Yes! Use Boolean Circuits!  What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07] It depends on the semiring 4

5  Background  Circuits for Boolean Provenance  Circuits for General Provenance Semirings 5

6  Background  Circuits for Boolean Provenance  Circuits for General Provenance Semirings 6

7 T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, z), T(z, y) S(x) :- T(a, x) 7 Datalog program for Transitive Closure and Single-source Reachability EDB (base) relation for edges: R IDB (derived) relations ─ Transitive closure (T) ─ Single-source reachability from vertex ‘a’ (S) IDB (Intensional Databases) EDB (Extensional Databases)

8 8  Tuples are annotated with variables from a set X – Here X = {x 1, x 2, y 1, y 2, ….}  For n tuples in X, 2 n possible worlds by assignments   : X  {True, False}  Useful in query evaluation on incomplete or probabilistic databases AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 PosBool(X)- database D

9 9  Annotation propagates from input to output – Join = , Projection/Union =   Output tuples are annotated by monotone Boolean formula – F Q,D is the annotation of the unique output tuple AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom RA + Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 PosBool(X)-Database D F Q,D = (x 1  y 1  z 1 )  (x 1  y 2  z 2 )  (x 2  y 3  z 2 )

10 10 For all RA + query Q, D, and assignment  1. (Faithful Representation) Q(D  )= [Q(D)]  2. (Poly-size overhead) The size of F Q,D is poly in |D| and can be computed in poly-time. AsthmaPatient Ann Bob Friend AnnJoe AnnTom BobTom Smoker Joe Tom RA + Q:  x  y AsthmaPatient(x)  Friend (x, y)  Smoker(y) x1x1 x2x2 z1z1 z2z2 y1y1 y2y2 y3y3 F Q,D = (x 1  y 1  z 1 )  (x 1  y 2  z 2 )  (x 2  y 3  z 2 ) True False True False True False = False PosBool(X)-Database D

11  Semantics using Derivation Trees (Green et al. 2007)  Annotation of T(a, b): 11 T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) R aa ab p q a b  Trees   Leaves t of  Annot(t) … = (q)  (p  q)  (p  p  q)  … Infinitely many trees But always has a finite equivalent form = q But not necessarily poly-size T(a, b) R(a, a) T(a, b) R(a, a) T(a, b) R(a, b) T(a, b) R(a, a) T(a, b) R(a, b) T(a, b)

12 Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Boolean formulas of size polynomial in |D| Theorem: Given PosBool(X)-database D and datalog program P, provenance of tuples in P(D) cannot have a faithful representation using Boolean formulas of size polynomial in |D| 12 Proof outline: st-connectivity on n nodes requires n  (logn) -size monotone Boolean formula Karchmer-Wigderson, 1988 Faithful representation requires: for all True/False assignments  to X, P(D  )= [P(D)]  Reduce to the hard instance with right  when P = transitive closure Solution: Boolean Circuit!

13  Background  Circuits for Boolean Provenance or PosBool(X)  Circuits for General Provenance Semirings 13

14  Circuit is a DAG – use common subexpressions – Boolean formula = tree  Leaf nodes: – EDB vars in X  Internal nodes – : IDB/EDB vars used in one derivation –:–: Alternative derivations  Roots: – IDB vars 14 R aa ab p q T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) X T(a, b) q p X R(a, b) X R(a, a) a b    

15 Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time) Theorem: Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time) 15

16 1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples [Green et al. 2007] 1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalog program P to EDB/IDB tuples [Green et al. 2007] A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011] N = #IDB tuples Build a circuit with N+1 layers from the system of equations Two key ideas from previous work EDB tuples  constants, IDB tuples  variables Iteratively solve this system of equations Fixpoint = provenance for all IDB tuples

17 17 T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) T(x, y) :- R(x, y) T(x, y) :- R(x, y), T(y, z) S(x) :- T(a, x) R aa ab p q a b Step1 : Build system of equations by all possible instantiations: x, y, z  a, b X T(a, a) = p  (p  X T(a, a) ) X T(a, b) = q  (p  X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) Step 2: Build a circuit with layers (N = 4) … var Const

18 18 X T(a,a),0 X S(b),0 X T(a,a),0 X T(a,b),0 X S(a),0  p q   X T(a,a),1 X S(b),1 X T(a,a),1  X T(a,b),1   X S(a),1  X S(a),2  X T(a,a),2 X S(b),2 X Ta,a),2 X T(a,b),2 Level 1 Level 2 false X T(a, a) = p  (p  X T(a, a) ) X T(a, b) = q  (p  X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) Assign leaf IDB vars to false Multiple roots for multiple IDB vars

19 1. Store only two levels of circuit instead of N+1 levels – Evaluate iteratively 2. Embed circuit construction in semi-naïve evaluation – Check for new derivations, not only new IDB variables – Sound and Complete 3. Remove self-dependency of IDB vars – works for PosBool(X) and also some other semirings… X T(a, a) = p  (p  X T(a, a) ) X T(a, b) = q  (p  X T(a, b) ) X S(b) = X T(a, b) X S(a) = X T(a, a) 19

20 20 X T(a,a),0 X S(b),0 X T(a,a),0 X T(a,b),0 X S(a),0   p q   X T(a,a),1 X S(b),1 X T(a,a),1  X T(a,b),1   X S(a),1  X S(a),2  X T(a,a),2 X S(b),2 X Ta,a),2 X T(a,b),2 Level 1 Level 2 false

21 21 X T(a,a),bottom X T(a,b),bottom X S(a),bottom pq X T(a,a),top X T(a,b),top X S(a),top With all these optimizations Top Level Bottom Level   

22  Linear-time deletion propagation (in circuit-size)  Approximation for probabilistic databases – even when only the circuit (and not the database) is available  Circuits can be computed “offline” – Only linear-time evaluation is required when needed (e.g. deletion propagation)  compared to storing and solving a system of equations iteratively, or  re-evaluating datalog program  Can use existing techniques for efficient and parallel circuit evaluation 22

23  Background  Circuits for Boolean Provenance or PosBool(X)  Circuits for General Provenance Semirings 23

24  (K, + K,  K, 0 K, 1 K ) – domain K – + K,  K : associative, commutative, have neutral elements 0 K, 1 K – K distributes over + K, i.e. a  K (b + K c) = a  K b + K a  K c – 0 K cancels any element in K, i.e. a  K 0 K = 0 K  K a = 0 K Examples: – (B, , , False, True)  Set semantics – (N, +, , 0, 1)  Bag semantics – (N  {  }, min, +, , 0)  Tropical semiring to compute cost (e.g. cost of a shortest path) 24

25  Generalization of PosBool(X)  (K, + K,  K, 0 K, 1 K ) – Tuples are annotated with variables from X – K is of the form Prov(X) – + K denotes alternative usage –  K denotes joint usage  Examples: – (PosBool(X), , , False, True) – (Lin(X), , , ,  )  tracks contributing tuples [Cui et. al. ’00] – (Why(X), , , , {  } )   : pairwise union of subsets, tracks contributing tuples in alternative derivations [Buneman et. al. ’01] 25

26  Key property needed for applications like deletion propagation, trust management, cost computation, …  Prov(X) specializes correctly to K, if any valuation v : X  K extends uniquely to a homomorphism h v : Prov(X)  K (which correctly maps +,  of Prov(X) to that of K)  Further, some provenance semirings are “more informative” than the others 26

27 27 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set) Defined later Specializes correctly More informative Less informative

28 28  Trees   Leaves t of  Annot(t)  Trees   Leaves t of  Annot(t) PosBool(X) General Prov(X) +k+k kk Infinite sums should be well-defined Need to consider “  –continuous semirings” and “  –continuous homomorphism”

29 29 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set) Finite so  -continuous Need to add   N  [[X]] and N  N  [[X]] : Most informative provenance semiring [Green et al. ’07]

30  Poly-size overhead is not valid because of infinite sum  But can outputs have finite annotations (with X, , +) that specializes correctly to semirings with finite domains? 30 Theorem: It is not possible to annotate with finite provenance expressions the output of datalog programs following N  [[X]] -semantics that specialize “correctly” to the semiring Why(X) Theorem: It is not possible to annotate with finite provenance expressions the output of datalog programs following N  [[X]] -semantics that specialize “correctly” to the semiring Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X) ─ Need more levels in the circuit from system of equations ─ Need a different argument for correctness Finite annotations won’t specialize correctly to Why(X)

31  We propose Sorp(X) – Most general absorptive semiring  a + a.b = a – N[X] but keep polynomials that are not “absorbed” by the others  e.g. pq + p 2 q 3  pq p 2 q + pq 2  p 2 q + pq 2  The same algorithm, proof, and optimizations to construct poly-size circuits hold – Circuits are more general than Boolean circuit Specializes correctly to interesting semirings 2. Outputs can be annotated by poly-size circuits

32 32 N[X]N[X] Why(X) Lin(X) PosBool(X) Sorp(X) Tropical N (bag) SecurityBoolean (set)

33  Data Provenance – e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]  Circuits – Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g. [Arora-Barak ’09] (book)  Provenance for Datalog – System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07] – Poly-size c-tables with Boolean formulas for datalog with contradictions [Abiteboul et al. 2014] 33

34  Circuits to represent and store Datalog Provenance – for PosBool(X) and other semirings – Semantics, Algorithms, Limitations, Applicability – Preliminary experiments support our results  we compared circuits for deletion propagation with iteratively solving system of equations and reevaluation of datalog from scratch  Future Work: – A complete implementation, evaluation, new applications 34

35 Thank You Questions? 35


Download ppt "Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania."

Similar presentations


Ads by Google