Webdamlog and Contradictions Daniel Deutch Tel Aviv University Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu.

Slides:

Advertisements

Similar presentations

Problems and Their Classes

Advertisements

SLD-resolution Introduction Most general unifiers SLD-resolution

Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.

Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.

CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.

Greedy Algorithms Greed is good. (Some of the time)

Everything You Need to Know (since the midterm). Diagnosis Abductive diagnosis: a minimal set of (positive and negative) assumptions that entails the.

Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.

© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Best-First Search: Agendas

Efficient Query Evaluation on Probabilistic Databases

Complexity 18-1 Complexity Andrei Bulatov Probabilistic Algorithms.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.

Introduction to Computability Theory

Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.

1 9. Evaluation of Queries Query evaluation – Quantifier Elimination and Satisfiability Example: Logical Level: r   y 1,…y n  r’ Constraint.

NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.

. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.

Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.

NP-Complete Problems Problems in Computer Science are classified into

Analysis of Algorithms CS 477/677

Search in the semantic domain. Some definitions atomic formula: smallest formula possible (no sub- formulas) literal: atomic formula or negation of an.

1 Ivan Lanese Computer Science Department University of Bologna Italy Concurrent and located synchronizations in π-calculus.

Normal forms for Context-Free Grammars

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Cooperative Query Answering Based on a talk by Erick Martinez.

Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.

Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.

Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart BDA 2011.

DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.

The Relational Model: Relational Calculus

1 Automatic Refinement and Vacuity Detection for Symbolic Trajectory Evaluation Orna Grumberg Technion Haifa, Israel Joint work with Rachel Tzoref.

Theory of Computing Lecture 17 MAS 714 Hartmut Klauck.

Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.

On Reducing the Global State Graph for Verification of Distributed Computations Vijay K. Garg, Arindam Chakraborty Parallel and Distributed Systems Laboratory.

Pattern-directed inference systems

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.

Formal Specification of Intrusion Signatures and Detection Rules By Jean-Philippe Pouzol and Mireille Ducassé 15 th IEEE Computer Security Foundations.

1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.

Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.

Automated Reasoning Early AI explored how to automated several reasoning tasks – these were solved by what we might call weak problem solving methods as.

Modeling Speech Acts and Joint Intentions in Modal Markov Logic Henry Kautz University of Washington.

Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.

© Copyright 2008 STI INNSBRUCK Intelligent Systems Propositional Logic.

Tommy Messelis * Stefaan Haspeslagh Patrick De Causmaecker *

1 First order theories (Chapter 1, Sections 1.4 – 1.5) From the slides for the book “Decision procedures” by D.Kroening and O.Strichman.

Random Interpretation Sumit Gulwani UC-Berkeley. 1 Program Analysis Applications in all aspects of software development, e.g. Program correctness Compiler.

1 Reasoning with Infinite stable models Piero A. Bonatti presented by Axel Polleres (IJCAI 2001,

CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.

Daniel Kroening and Ofer Strichman Decision Procedures An Algorithmic Point of View Deciding Combined Theories.

Operational Semantics Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.

Assumption-based Truth Maintenance Systems: Motivation n Problem solvers need to explore multiple contexts at the same time, instead of a single one (the.

Approximation Algorithms based on linear programming.

Extensions of Datalog Wednesday, February 13, 2001.

COMP 412, FALL Type Systems C OMP 412 Rice University Houston, Texas Fall 2000 Copyright 2000, Robert Cartwright, all rights reserved. Students.

A formal study of collaborative access control in distributed datalog Serge Abiteboul – Inria & ENS Cachan Pierre Bourhis CNRS & Lille Univ. & Inria Victor.

CS589 Principles of DB Systems Fall 2008 Lecture 4c: Query Language Equivalence Lois Delcambre

PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.

Queries with Difference on Probabilistic Databases

Propositional Calculus: Boolean Algebra and Simplification

Lecture 16: Probabilistic Databases

Logic: Top-down proof procedure and Datalog

Probabilistic Databases

Probabilistic Databases with MarkoViews

Presentation transcript:

Webdamlog and Contradictions Daniel Deutch Tel Aviv University Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu

Motivation In a distributed setting, contradictions and uncertainty naturally arise. Due to –Different /contradictory opinions –Different view points –Partial Information –…

Example: Where is Alice Consider a IsIn(Person,City,Peer) relation –“Peer believes Person is in City” There is a natural Functional Dependency {Peer, Person} → City Now consider a datalog rule IsIn(Person,City,p) :- IsIn(Person,City,p’), Friend(p,p’) How to combine the contradictory opinions of two friends on the location of Alice? How to do so if the opinions are uncertain?

Roadmap Centralized non-deterministic semantics –For Datalog in presence of FDs –We study properties of the semantics, computational and representation issues Quantifying non-determinism with probabilities –Studying computation of probabilities and explanation of answers Distributed settings

Centralized case For the centralized case we use the datalog syntax Standard (safe) datalog rules R(X 1 …X n ) :- R 1 (X 11,…,X 1m ),…, R k (X k1,…,X ks ) Functional Dependencies of the form R:1,2 → 3 We will change the datalog semantics to account for FDs Datalog FD

First Semantics Non-deterministic inflationary fact-at-a-time semantics Re-define the immediate consequence operator such that –A fact is derived only if it does not contradict other facts already in the database A possible world is a maximal consequence Simple “stubborn” semantics

Example Program IsIn(X,Y,P):-Friend(P,P’), IsIn(X,Y,P’) IsIn(Carol,Y,P):-IsIn(Alice,Y,P) Database IsIn(Alice,Paris,Peter), IsIn(Carol,London,Tom), Friend(Ben,Tom), Friend(Ben,Peter) IsIn(Alice,Paris,Peter)=>IsIn(Alice,Paris,Ben)=> IsIn(Carol,Paris,Ben) IsIn(Carol,London,Tom)=> IsIn(Carol,London,Ben) In either case, IsIn(Alice,Paris,Ben) will be derived

Set-at-a-time Semantics Idea: the immediate consequence operator now selects a maximal consistent subset of the new facts that can be derived in one step of derivation –Still inflationary: “old” facts always stay. The semantics gives “priority” to more “direct” derivations –Intuitive, especially in a distributed settings Operational Two types of non-determinism in nfat –Data non-determinism (choice between contradicting facts) –Control non-determinism (choice of order of rule activation) In nsat only the first type remains

Example Program IsIn(X,Y,P):-Friend(P,P’), IsIn(X,Y,P’) IsIn(Carol,Y,P):-IsIn(Alice,Y,P) Database IsIn(Alice,Paris,Peter), IsIn(Carol,London,Tom), Friend(Ben,Tom),Friend(Ben,Peter) IsIn(Alice,Paris,Peter)=>IsIn(Alice,Paris,Ben)=> IsIn(Carol,Paris,Ben) IsIn(Carol,London,Tom)=> IsIn(Carol,London,Ben) IsIn(Alice,Paris,Ben) will be derived

Expressive Power Thm: nsat is strictly stronger –We can simulate a program under the nfat semantics, using a program under the nsat semantics –But not the converse Thm: datalog fd with nsat captures NDB-PTIME –Queries computable by a nondeterministic TM for which every computation is in PTIME

(Tuple) Possibility and Certainty nsatnfat NP-completePTIMENon-recursive NP-complete Recursive nsatnfat CoNP-complete Non-recursive CoNP-complete Recursive Possibility Certainty

Representation System ConditionsBA XLondonAlice XLondonCarol NOT(X)ParisAlice NOT(X)ParisCarol Concrete c-table: variables only in condition boolean formuals General c-tables: variables may appear in entries Non-deterministic semantics = set of possible worlds Can we capture with compact (i.e. PSIZE) c-tables?

Efficient Representation! Theorem: Given a (possibly recursive) datalog fd program P and input instance I one can compute in PTIME (w.r.t. |I|) a concrete c-table C such that: –C encodes the fixpoint possible worlds + the empty relation This holds both for the nsat and nfat semantics –Different constructions, using formulas (with negation) to compactly encode the possible derivations This holds also if instead of a certain instance I, we start with an (arbitrary) c-table

Probabilistic Semantics Probabilistic counterpart introduced for both nfat and nsat semantics over a prob. database –Choose a possible world for base facts –Repeatedly and uniformly choose one possible set of rule instantiations, Applying the nfat/nsat immediate consequence operator –This defines a distribution over fixpoint possible worlds Allows to capture voting Extensions allow to associate probabilities with rules Can we compute the probability of a tuple to appear in a fixpoint world?

Probability Computation Thm: Even if input instances are tuple-indepndent and the query is non-recursive and safe: –Computing exact tuple probability is #P-hard –Even with one FD per relation I.e. FDs introduce a novel hardness Thm: PTIME absolute approximation exists for the general case –Relative approximation is hard

Probabilistic Representation System In pc-tables, probabilities are associated with boolean variables Theorem: For non-recursive case, one FD per relation: –We can capture possible worlds with their probabilities via a pc-table –Even if starting from a pc-table instance General case is open.

Top-k Supports Top-k (minimal) subsets of facts that are most likely to occur in conjunction with Q (given Q) The problem is PTIME with no recursion, no FDs, tuple- independent DBs –Compute a DNF for the result and rank clauses by their individual probabilities. Either FDs (even one FD) or recursion (even linear recursion) lead to #P-hardness –Even approximation is hard. But surprisingly, PTIME exact solution for Transitive Closure program Future work to identify classes of easy inputs, practically efficient heuristics.

Influence A tuple is necessary if without it, Q cannot be derived PTIME for the recursive case A tuple is relevant if it is necessary in conjunction with some subset PTIME for non-recursive NP-complete for recursive We can further quantify influence as the change in answer probability when removing the fact Top-k facts based on their influence –Exact top-k is NP-hard even with no recursion, no FD –Approximate top-k possible in PTIME for the general case

The distributed setting We next extend the model to a distributed setting A quick overview of some problems that are of interest in this settings –There are many others

Webdamlog basics Alphabet –Peer and Relations names Schema –A set of peer Ids –A disjoint sets of extensional & intensional relations of the form (with m relation constant, p peer ID) Typing function defines the arity and sorts of components for each such relation Facts are of the form

Webdamlog basics (cont.) M n+1 (U n+1 ) :- M 1 (U 1 ),...,M n (U n ) where M i are relation terms, Q i are peer terms, U i are tuples of terms We focus on local and deductive rules I.e. (body)At p, Q i = p for 1≤i≤n (head) M n+1 (U n+1 ) is extensional

Webdamlog basics (example) :- :-

Webdamlog basics (semantics) A local semantics to be used at each peer is defined and then induces a global semantics based on moves and runs. In our restricted case, with standard datalog semantics for each peer –A move of a peer p is : Computing the fixpoint for its program Alerting other peers of the derived facts that concern them –A run is a sequence of moves which satisfies fairness, i.e. each peer p is invoked infinitely many times.

With Contradictions With respect to webdamlog we change local semantics of peers to be nsat\nfat –When activated, a peer runs the semantics until saturation –The obtained system is (I,R,F) for (Initial instance, Rules, FDs) A subtlety: –Note that non-deterministic choices are made upon derivation at a peer p, but then the facts are added to a peer q –So we need to explicitly make sure that subsequent choices at p are consistent –We use a “memory” that p keeps throughout the run

Translation to the centralized case Given a distributed system (I,P,F), the centralized system is (I c,P c,F c ) –I c is the union of all peers instances –P c is all possible instantiations of the peer variables in the rules of P, with concrete peers (only instantiations respecting the typing and arity constraints) –F c is the union of all Functional Dependencies If F is empty (no FDs) the systems are “equivalent” But with FDs: Theorem: There exists a webdamlog system such that the set of nsat possible worlds for the original and centralized system are not contained in each other

Probabilities and Voting :- :- Prob. Semantics: Uniform choice of peer to move, prob. local semantics for moves Proposition: For acyclic networks, the probability of a peer inferring a fact is exactly its relative support at followed peers We can also use probabilistic base facts to weigh opinions

Distributed Sampling The idea is that each peer chooses a possible world for the base facts And then simply executes the semantics –Making probabilistic choices along the way Some new subtleties in the procedure –Cooperation is needed for initiating the samples As well as in the convergence proof –Due to order of peer invocation

Related Work Datalog with negation, nondeterminism Witness Repair and probabilistic repair Integrity constraints via rules in Data Exchange Distributed Datalog Probabilistic and Incomplete Databases

Conclusion We have studied data management in presence of contradictions Defined semantics in the centralized and distributed case Provided a probabilistic modeling of the uncertainty that arises Studied computational problems in these contexts Future work: Open questions, optimizations and implementation issues, additional semantics.

Thank you!

More questions How to distribute? –In the independent case, seems straightforward –With dependencies? How do the result change in presence of contradictions? –The proof theory introduced earlier How to handle complex patterns? –With negation, projection, join…

Proof theory A proof tree is a tree labeled with facts, such that leaves are labeled with facts from the original database. In presence of FDs, we require that the facts in the tree nodes do not violate any FD. Theorem: The possible facts are exactly those that have proof trees

Example C :- R(a,0),R(a,1), R(a,0) :- A, R(a,1) :- B DB = {A,B}, FD in R: 1→2 There are proof trees for both R(a,0) and R(a,1) but not for C

Refuting proof trees No two nodes contradict each other. Every node is either: 1. A leaf node which is labeled with a fact from I. 2. It is labeled B and its children have labels C 1,… C n where B :- C 1,… C n is a rule in P. 3. It is labeled ¬B, and has a child that is labeled with a fact contradicting B. 4. It is labeled ¬ B and has children labeled ¬ C 1,… ¬ C n, where B is not in I and every rule in P which concludes on B has some C i in its body. 5. It is a leaf node labeled ¬ B, and it has an ancestor which is also labeled ¬ B. Theorem: The certain facts are exactly the possible facts that do not have a refuting proof

Hardness Results The existence of a proof theory does not imply tractability of possibility and certainty. (Finding the proof tree may be hard…) Indeed, Theorem: Possibility is NP-hard Certainty is coNP-hard

Connection to Datalog with negation The nsat semantics is non-deterministic, while datalog is deterministic. We can “encode” the non-determinism in a new prefer relation deciding for every non-deterministic choice, the preference between facts. Then we can simulate nsat via inflationary datalog with negation and with the prefer relation Details omitted

Connection to Witness Thm: non-recursive datalog fd with single FD per relation and nsat semantics can express precisely the set of queries definable in FO + +W

Query Evaluation c-table representation? –The idea is that the boolean events associated with e.g. base facts are propagated, to form a boolean expression –The probability of this boolean expression can then be approximated –Works well for query evaluation on “standard” probabilistic databases Seems infeasible in our case –Peers may not disclose base facts probabilities –Expression may become huge as it depends also on ordering Even infinite-size if done without care

Possibility and Certainty As the semantics is non-deterministic, there are multiple possible fixpoint states –Referred to as possible worlds Given an input instance: –A fact is possible if it appears in some fixpoint state –A fact is certain if it appears in all fixpoint states

Observations 1. An nsat possible world is an nfat possible world. 2. The converse of 1. does not hold in general. 3. All possible nsat facts are possible nfat facts 4. Certain nfat facts are certain nsat facts 5. The above inclusions may be strict

C-tables A c-table is an incomplete instance (variables may occur in tuple entries), with conditions associated with tuples –Conditions are boolean combinations of equality predicates over variables and values A concrete c-table is one where –No variables appear in tuples –Conditions are boolean formulas over variables

Distributed Sampling Peer p performs one round of sampling: 1. p asks the other peers to set their databases to the initial state. 2. p asks each peer in the system whether it can move. One of the peers that can move is chosen for the next move. If none can move, p samples the query.

Sampling Quality (1) The probability of q computed by the algorithm converges (as the number of samples grows) to the probability of q according to the probabilistic semantics. (2) The number of samples required for obtaining the correct probability up to an additive error of ε, with probability at least δ, is O(ln(1/δ), ε) (3) The expected time for producing each sample is polynomial in the size of the input instance

Same Possible Worlds? NO The program at p FD on :- :- :- :- :- :- The program at q :-

Same Possible Worlds? We could use nfat (rather than nsat) as a local semantics, but this would not generate all nfat worlds of the centralized case Of course, a semantics that runs a single nfat step at each peer activation would capture all. –But not very realistic to assume communication after every step

Introducing probabilities So far we have non-deterministic semantics We next turn to a probabilistic semantics where the non-deterministic choices are associated with probabilities –We will show how this allows to capture voting We will also capture uncertainty on base facts with probabilities –This captures incomplete as well as weighted knowledge