10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig, Andre Melo, Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek

The important thing is to not stop questioning... One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery every day. - Albert Einstein, 1936 The Marvelous Structure of Reality Joseph M. Hellerstein Keynote at WebDB 2003, San Diego The Marvelous Structure of Reality Joseph M. Hellerstein Keynote at WebDB 2003, San Diego

Look, There is Structure! The important thing is to not stop questioning

Look, There is Structure! Plethora of natural-language- processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition & Disambiguation (NERD) Dependency Parsing Semantic Role Labeling Text is not just unstructured data C1

Look, There is Structure! Plethora of natural-language- processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition & Disambiguation (NERD) Dependency Parsing Semantic Role Labeling Text is not just unstructured data But: Even the best NLP tools frequently yield errors Facts found on the Web are logically inconsistent Web-extracted knowledge bases are inherently incomplete C1

bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) type(Jeff, Author) [0.9] author(Jeff, Drag_Book) [0.8] author(Jeff,Cind_Book) [0.6] worksAt(Jeff, Bell_Labs) [0.7] type(Jeff, CEO) [0.4] Information Extraction YAGO/DBpedia et al. New fact candidates >120 M facts for YAGO2 (mostly from Wikipedia infoboxes) 100s M additional facts from Wikipedia free-text

7 instanceOf http://www.mpi-inf.mpg.de/yago-naga/ YAGO Knowledge Base Entity Max_Planck Apr 23, 1858 Person City Country subclass Location subclass bornOn Max Planck means subclass Oct 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck fatherOf hasWon Scientist means Max Karl Ernst Ludwig Planck Physicist subclass Biologist subclass Germany Politician Angela Merkel Schleswig- Holstein State Angela Dorothea Merkel Oct 23, 1944 diedOn Organization subclass Max_Planck Society instanceOf means instanceOf subclass means Angela Merkel means citizenOf locatedIn subclass 3 M entities, 120 M facts 100 relations, 200k classes 3 M entities, 120 M facts 100 relations, 200k classes accuracy 95% accuracy 95% subclass instanceOf

8 http://linkeddata.org/ Linked Open Data As of Sept. 2011: >200 linked-data sources >30 billion RDF triples >400 million owl:sameAs links As of Sept. 2011: >200 linked-data sources >30 billion RDF triples >400 million owl:sameAs links

9 Maybe Even More Importantly: Linked Vocabularies! Source: http://en.wikipedia.org/wiki/Linked_datahttp://en.wikipedia.org/wiki/Linked_data LinkedData.org Instance & class links between DBpedia, WordNet, OpenCyc, GeoNames, and many more… Schema.org Common vocabulary released by Google, Yahoo!, BING to annotate Web pages, incl. links to DBpedia. Micro-Formats: RDFa (W3C) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" version="XHTML+RDFa 1.0" xml:lang="en"> Martin's Home Page

10 Currently (Sept. 2011) > 5 Mio owl:sameAs links between DBpedia/YAGO/Freebase As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

11 Application 1: Enrichment of Search Results Recent Advances in Structured Data and the Web. Alon Y. Halevy, Keynote at ICDE 2013, Brisbane Recent Advances in Structured Data and the Web. Alon Y. Halevy, Keynote at ICDE 2013, Brisbane

12 Its about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby. The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. After discovering that Salander has hacked into his computer, he persuades her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is the perfect victim for anyone who wished her ill." Application II: Machine Reading same uncleOf owns hires headOf affairWith enemyOf Etzioni, Banko, Cafarella: Machine Reading. AAAI06 Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI10 Etzioni, Banko, Cafarella: Machine Reading. AAAI06 Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI10

13 Application III: Natural-Language Question Answering evi.comevi.com (formerly trueknowledge.com)trueknowledge.com

14 Application III: Natural-Language Question Answering wolframalpha.com >10 trillion(!) facts >50,000 search algorithms >5,000 visualizations

15 IBM Watson: Deep Question Answering 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain This town is known as "Sin City" & its downtown is "Glitter Gulch" William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel As of 2010, this is the only former Yugoslav republic in the EU www.ibm.com/innovation/us/watson/index.htm Knowledge back-ends Question classification & decomposition D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

16 http://greententacle.techfak.uni- bielefeld.de/~cunger/qald/ Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13 Natural-Language QA over Linked Data Which river does the Brooklyn Bridge cross? Welchen Fluss überspannt die Brooklyn Bridge? ¿Por qué río cruza la Brooklyn Bridge? Quale fiume attraversa il ponte di Brooklyn? Quelle cours d'eau est traversé par le pont de Brooklyn? Welke rivier overspant de Brooklyn Bridge? river, cross, Brooklyn Bridge Fluss, überspannen, Brooklyn Bridge río, cruza, Brooklyn Bridge fiume, attraversare, ponte di Brooklyn cours d'eau, pont de Brooklyn rivier, Brooklyn Bridge, overspant PREFIX dbo: PREFIX res: SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri. }

17 Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor? German politicians successor other stepped down before actual term name ancestor SELECT ?s ?s1 WHERE { ?s rdf:type. ?s1 ?s. FILTER FTContains (?s, "stepped down early"). } https://inex.mmci.uni- saarland.de/tracks/lod/ INEX Linked Data Track, CLEF 2012-13 Natural-Language QA over Linked Data

18 Outline Probabilistic Databases Stanfords Trio System: Data, Uncertainty & Lineage Handling Uncertain RDF Data: URDF (Max-Planck-Institute/U-Antwerp) Probabilistic & Temporal Databases Sequenced vs. Non-Sequenced Semantics Interval Alignment & Probabilistic Inference Probabilistic Programming Statistical Relational Learning Learning Interesting Deduction Rules Summary & Challenges

19 Probabilistic databases combine first-order logic and probability theory in an elegant way: Declarative: Queries formulated in SQL/Relational Algebra/Datalog, support for updates, transactions, etc. Deductive: Well-studied resolution algorithms for SQL/Relational Algebra/Datalog (top-down/bottom- up), indexes, automatic query optimization Scalable (?): Polynomial data complexity (SQL), but #P-complete for the probabilistic inference Probabilistic Databases: A Panacea to All of the Afore Tasks? C2

20 Special Cases: Query Semantics: (Marginal Probabilities) Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities of all instances D i where t exists. A probabilistic database D p (compactly) encodes a probability distribution over a finite set of deterministic database instances D i. 0.420.180.280.12 (1) Tuple-independent PDB(II) Block-independent PDB Note: (I) and (II) are not equivalent! Probabilistic Database

21 Stanford Trio System 1. Alternatives 2. ? (Maybe) Annotations 3. Confidence values 4. Lineage Uncertainty-Lineage Databases (ULDBs) [Widom: CIDR 2005]

22 Trios Data Model 1. Alternatives: uncertainty about value Saw (witness, color, car) Amy red, Honda red, Toyota orange, Mazda Three possible instances

23 Six possible instances Trios Data Model 1. Alternatives 2. ? (Maybe): uncertainty about presence ? Saw (witness, color, car) Amy red, Honda red, Toyota orange, Mazda Bettyblue, Acura

24 Trios Data Model 1. Alternatives 2. ? (Maybe) Annotations 3. Confidences: weighted uncertainty Still six possible instances, each with a probability ? Saw (witness, color, car) Amy red, Honda 0.5 red, Toyota 0.3 orange, Mazda 0.2 Betty blue, Acura 0.6

25 So Far: Model is Not Closed Saw (witness, car) Cathy Honda Mazda Drives (person, car) Jimmy, Toyota Jimmy, Mazda Billy, Honda Frank, Honda Hank, Honda Suspects Jimmy Billy Frank Hank Suspects = π person (Saw Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT

26 Example with Lineage IDSaw (witness, car) 11Cathy Honda Mazda IDDrives (person, car) 21 Jimmy, Toyota Jimmy, Mazda 22 Billy, Honda Frank, Honda 23Hank, Honda IDSuspects 31Jimmy 32 Billy Frank 33Hank Suspects = π person (Saw Drives) ? ? ? λ (31) = (11,2) Λ (21,2) λ (32,1) = (11,1) Λ (22,1); λ (32,2) = (11,1) Λ (22,2) λ (33) = (11,1) Λ 23

27 Example with Lineage ID Saw (witness, car) 11Cathy Honda Mazda ID Drives (person, car) 21 Jimmy, Toyota Jimmy, Mazda 22 Billy, Honda Frank, Honda 23Hank, Honda ID Suspects 31Jimmy 32 Billy Frank 33Hank Suspects = π person (Saw Drives) ? ? ? λ (31) = (11,2) Λ (21,2) λ (32,1) = (11,1) Λ (22,1); λ (32,2) = (11,1) Λ (22,2) λ (33) = (11,1) Λ 23 (4)

28 Operational Semantics Closure: up-arrow always exists Closure: up-arrow always exists Completeness: any (finite) set of possible instances can be represented DpDp D 1, D 2,…, D n D 1, D 2, …, D m D p possible instances Q on each instance rep. of instances direct implementation But: data complexity is #P-complete!

29 Summary on Trios Data Model 1. Alternatives 2. ? (Maybe) Annotations 3. Confidence values 4. Lineage Uncertainty-Lineage Databases (ULDBs) Theorem: ULDBs are closed and complete. Formally studied properties like minimization, equivalence, approximation and membership based on lineage. [Benjelloun, Das Sarma, Halevy, Widom, Theobald: VLDB-J. 2008]

30 Basic Complexity Issue Theorem [Valiant:1979] For a Boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form is there a witness ? SAT #P = class of problems of the form how many witnesses ? #SAT NP = class of problems of the form is there a witness ? SAT #P = class of problems of the form how many witnesses ? #SAT The decision problem for 2CNF is in PTIME. The counting problem for 2CNF is already #P-complete. (will be coming back to this later again…) [Suciu & Dalvi: SIGMOD05 Tutorial on "Foundations of Probabilistic Answers to Queries"]

…back to Information Extraction bornIn(Barack, Honolulu) bornIn(Barack, Kenya)

Uncertain RDF (URDF): Facts & Rules Extensional Knowledge (the facts) High-confidence facts: existing knowledge base (ground truth) New fact candidates: extracted fact candidates with confidences Linked-Data & integration of various knowledge sources: Ontology merging or explicitly linked facts (owl:sameAs, owl:equivProp.) Large Probabilistic Database of RDF facts Intensional Knowledge (the rules) Soft rules: deductive grounding & lineage (Datalog/SLD resolution) Hard rules: consistency constraints (more general FOL rules) Propositional & probabilistic inference At query-time!

Soft Rules vs. Hard Rules (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y) marriedTo(x,z) livesIn(z,y) livesIn(x,y) hasChild(x,z) livesIn(z,y) People are not born in different places/on different dates bornIn(x,y) bornIn(x,z) y=z bornOn(x,y) bornOn(x,z) y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t 1 ) marriedTo(x,z,t 2 ) y z disjoint(t 1,t 2 ) [0.8] [0.5]

[0.8] [0.5] Soft Rules vs. Hard Rules (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y) marriedTo(x,z) livesIn(z,y) livesIn(x,y) hasChild(x,z) livesIn(z,y) People are not born in different places/on different dates bornIn(x,y) bornIn(x,z) y=z bornOn(x,y) bornOn(x,z) y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t 1 ) marriedTo(x,z,t 2 ) yz disjoint(t 1,t 2 ) Deductive Database: Datalog, core of SQL & Relational Algebra, RDF/S, OWL2-RL, etc. Deductive Database: Datalog, core of SQL & Relational Algebra, RDF/S, OWL2-RL, etc. More General FOL Constraints: Datalog plus constraints, X-tuples in PDBs, owl:FunctionalProperty, owl:disjointWith, etc. More General FOL Constraints: Datalog plus constraints, X-tuples in PDBs, owl:FunctionalProperty, owl:disjointWith, etc.

URDF Running Example Jeff Stanford University type [1.0] Surajit Princeton David Computer Scientist Computer Scientist worksAt [0.9] type [1.0] graduatedFrom [0.6] graduatedFrom [0.7] graduatedFrom [0.9] hasAdvisor [0.8] hasAdvisor [0.7] KB: RDF Base Facts Derived Facts gradFr(Surajit,Stanford) gradFr(David,Stanford) Derived Facts gradFr(Surajit,Stanford) gradFr(David,Stanford) graduatedFrom [?] Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z

Basic Types of Inference MAP Inference Find the most likely assignment to query variables y under a given evidence x. Compute: arg max y P( y | x) (NP-complete for MaxSAT) Marginal/Success Probabilities Probability that query y is true in a random world under a given evidence x. Compute: y P( y | x ) (#P-complete already for conjunctive queries)

General Route: Grounding & MaxSAT Solving Query graduatedFrom(x, y) Query graduatedFrom(x, y) CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton)) (graduatedFrom(David, Stanford) graduatedFrom(David, Princeton)) (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton)) (graduatedFrom(David, Stanford) graduatedFrom(David, Princeton)) (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) 1000 0.4 0.9 0.8 0.7 0.6 0.7 0.9 1) Grounding – Consider only facts (and rules) which are relevant for answering the query 2) Propositional formula in CNF, consisting of – Grounded soft & hard rules – Weighted base facts 3) Propositional Reasoning – Find truth assignment to facts such that the total weight of the satisfied clauses is maximized MAP inference: compute most likely possible world

[Theobald,Sozio,Suchanek,Nakashole: VLDS12] Find: arg max y P( y | x) Resolves to a variant of MaxSAT for propositional formulas URDF: MaxSAT Solving with Soft & Hard Rules { graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) } { graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) } { graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) } { graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) } (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) 0.4 0.9 0.8 0.7 0.6 0.7 0.9 S: Mutex-const. Special case: Horn-clauses as soft rules & mutex-constraints as hard rules C: Weighted Horn clauses (CNF) Compute W 0 = clauses C w(C) P(C is satisfied); For each hard constraint S { For each fact f in S t { Compute W f+ t = clauses C w(C) P(C is sat. | f = true); } Compute W S- t = clauses C w(C) P(C is sat. | S t = false); Choose truth assignment to f in S t that maximizes W f+ t, W S- t ; Remove satisfied clauses C; t++; } Compute W 0 = clauses C w(C) P(C is satisfied); For each hard constraint S { For each fact f in S t { Compute W f+ t = clauses C w(C) P(C is sat. | f = true); } Compute W S- t = clauses C w(C) P(C is sat. | S t = false); Choose truth assignment to f in S t that maximizes W f+ t, W S- t ; Remove satisfied clauses C; t++; } Runtime: O(|S||C|) Approximation guarantee of 1/2 Runtime: O(|S||C|) Approximation guarantee of 1/2 MaxSAT Alg.

Experiment (I): MAP Inference URDF: Grounding & MaxSAT solving |C| - # literals in grounded soft rules |S| - # literals in grounded hard rules URDF MaxSAT vs. Markov Logic (MAP inference & MC-SAT) YAGO Knowledge Base: 2 Mio entities, 20 Mio facts Query Answering: Deductive grounding & MaxSAT solving for 10 queries over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …) Asymptotic runtime checks via synthetic (random) soft rule expansions

[Yahya,Theobald: RuleML11 Dylla,Miliaraki,Theobald: ICDE13] Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog) \/ /\ graduatedFrom (Surajit, Princeton) [0.7] graduatedFrom (Surajit, Princeton) [0.7] hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) CD AB A (B (C D)) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0]

Lineage & Possible Worlds 1) Deductive Grounding Dependency graph of the query Trace lineage of individual query answers 2) Lineage DAG (not in CNF), consisting of Grounded soft & hard rules Probabilistic base facts 3) Probabilistic Inference Compute marginals: P(Q): sum up the probabilities of all possible worlds that entail the query answers lineage P(Q|H): drop impossible worlds \/ /\ graduatedFrom (Surajit, Princeton) [0.7] graduatedFrom (Surajit, Princeton) [0.7] hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) 0.7x(1-0.888)=0.078(1-0.7)x0.888=0.266 1-(1-0.72)x(1-0.6) =0.888 0.8x0.9 =0.72 CD AB A (B (C D)) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 [Das Sarma,Theobald,Widom: ICDE08 Dylla,Miliaraki,Theobald: ICDE13]

Possible Worlds Semantics 1.0 0.2664 0.412 P(Q 2 )=0.2664 P(Q 2 |H)=0.2664 / 0.412 = 0.6466 P(Q 1 )=0.0784P(Q 1 |H)=0.0784 / 0.412 = 0.1903 0.0784 Hard rule H: A (B (C D))

Inference in Probabilistic Databases Safe query plans [Dalvi,Suciu: VLDB-J07] Can propagate confidences along with relational operators. Read-once functions [Sen,Deshpande,Getoor: PVLDB10] Can factorize Boolean formula (in polynomial time) into read-once form, where every variable occurs at most once. Knowledge compilation [Olteanu et al.: ICDT10, ICDT11] Can decompose Boolean formula into ordered binary decision diagram (OBDD), such that inference resolves to independent-and and independent-or operations over the decomposed formula. Top-k pruning [Ré,Davli,Suciu: ICDE07; Karp,Luby,Madras: J-Alg.89] Can return top-k answers based on lower and upper bounds, even without knowing their exact marginal probabilities. Multi-Simulation: run multiple Markov-Chain-Monte-Carlo (MCMC) simulations in parallel.

Monte Carlo Simulation (I) E = X 1 X 2 v X 1 X 3 v X 2 X 3 cnt = 0 repeat N times randomly choose X 1, X 2, X 3 {0,1} if E(X 1, X 2, X 3 ) = 1 then cnt = cnt+1 P = cnt/N return P /* estimate for true Pr(F) */ cnt = 0 repeat N times randomly choose X 1, X 2, X 3 {0,1} if E(X 1, X 2, X 3 ) = 1 then cnt = cnt+1 P = cnt/N return P /* estimate for true Pr(F) */ Theorem: If N (1/ Pr(E)) × (4 ln(2/ )/ 2 ) then: Pr[ | P/ Pr(E) - 1 | > ] < N may be very big for small Pr(E) X1X2X1X2 X1X2X1X2 X1X3X1X3 X1X3X1X3 X2X3X2X3 X2X3X2X3 Boolean formula: Zero/One-Estimator Theorem Works for any E (not in PTIME) Works for any E (not in PTIME) Naïve sampling: [Suciu & Dalvi: SIGMOD05 Tutorial on "Foundations of Probabilistic Answers to Queries" Karp,Luby,Madras: J-Alg.89]

cnt = 0; S = Pr(C 1 ) + … + Pr(C m ) repeat N times randomly choose i {1,2,…, m}, with prob. Pr(C i )/S randomly choose X 1, …, X n {0,1} s.t. C i = 1 if C 1 = 0 and C 2 = 0 and … and C i-1 = 0 then cnt = cnt+1 P = cnt/N return P /* estimate for true Pr(E) */ cnt = 0; S = Pr(C 1 ) + … + Pr(C m ) repeat N times randomly choose i {1,2,…, m}, with prob. Pr(C i )/S randomly choose X 1, …, X n {0,1} s.t. C i = 1 if C 1 = 0 and C 2 = 0 and … and C i-1 = 0 then cnt = cnt+1 P = cnt/N return P /* estimate for true Pr(E) */ Theorem: If N (1/m) × (4 ln(2/ )/ 2 ) then: Pr[ |P/Pr(E) - 1| > ] < Theorem: If N (1/m) × (4 ln(2/ )/ 2 ) then: Pr[ |P/Pr(E) - 1| > ] < E = C 1 v C 2 v... v C m Importance sampling: This is better! Only for E in DNF in PTIME Boolean formula in DNF: Monte Carlo Simulation (II) [Suciu & Dalvi: SIGMOD05 Tutorial on "Foundations of Probabilistic Answers to Queries" Karp,Luby,Madras: J-Alg.89]

Top-k Ranking by Marginal Probabilities \/ graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 graduatedFrom (Surajit, Princeton )[0.7] graduatedFrom (Surajit, Princeton )[0.7] AB graduatedFrom (Surajit, y=Stanford) graduatedFrom (Surajit, y=Stanford) /\ hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] CD Datalog/SLD resolution Top-down grounding allows us to compute lower and upper bounds on the marginal probabilities of answer candidates before rules are fully grounded. Subgoals may represent sets of answer candidates. First-order lineage formulas: Φ (Q 1 ) = A Φ (Q 2 ) = B y gradFrom(Surajit,y) Prune entire set of answer candidates represented by Φ. [Dylla,Miliaraki,Theobald: ICDE13]

Bounds for First-Order Formulas Theorem 1: Given a (partially grounded) first-order lineage formula Φ : Φ (Q 2 ) = B y gradFrom(S,y) Lower bound P low (for all query answers that can be obtained from grounding Φ) Substitute y gradFrom(S,y) with false (or true if negated). P low (Q 2 ) = P(B false) = P(B) = 0.6 Upper bound P up (for all query answers that can be obtained from grounding Φ) Substitute y gradFrom(S,y) with true (or false if negated). P up (Q 2 ) = P(B true) = P(true) = 1.0 Proof: (sketch) Substitution of a subformula with false reduces the number of models (possible worlds) that satisfy Φ ; substitution with true increases them. [Dylla,Miliaraki,Theobald: ICDE13]

Theorem II: Let Φ 1,…, Φ n be a series of first-order lineage formulas obtained from grounding Φ via SLD resolution, and let φ be the propositional lineage formula of an answer obtained from this grounding procedure. Then rewriting each Φ i according to Theorem 1 into P i,low and P i,up creates a monotonic series of lower and upper bounds that converges to P( φ ). 0 = P(false) P(B false) = 0.6 P(B (C D)) = 0.888 P(B true) = P(true) = 1 Proof: (sketch, via induction) Substitution of true with a formula reduces the number of models that satisfy Φ ; substitution of false with a formula increases this number. Convergence of Bounds [Dylla,Miliaraki,Theobald: ICDE13]

P 2,up (Q j ) P 2,low (Q j ) Top-k Pruning Fagins Algorithm Maintain two disjoint queues: Top-k queue sorted by P low and Candidates sorted by P up Return the top-k queue at the tth grounding step when: P i,low (Q k ) | Q k Top-k > P i,up (Q j ) | Q j Candidates Drop Q j from the Candidates queue. P 1,up (Q j ) P 1,low (Q j ) k-th lower bound P n,up (Q j ) P n,low (Q j ) #SLD steps t Marginal probability 1 0 [Fagin et al.01; Balke,Kießling02; Dylla,Miliaraki,Theobald: ICDE13]

Top-k Stopping Condition Fagins Algorithm Maintain two disjoint queues: Top-k queue sorted by P low and Candidates sorted by P up Return the top-k queue at the tth grounding step when: P i,low (Q k ) | Q k Top-k > P i,up (Q j ) | Q j Candidates Stop and return the top-2 query answers. 2-nd lower bound [Fagin et al.01; Balke,Kießling02; Dylla,Miliaraki,Theobald: ICDE13] k = 2 P t,up (Q 2 ) P t,low (Q 2 ) P t,up (Q 1 ) P t,low (Q 1 ) @SLD step t Marginal probability 1 0 P t,low (Q m ) P t,up (Q m )

Experiment (II): Computing Marginals IMDB data with 26 Mio facts about movies, directors, actors, etc. 4 query patterns, each instantiated to 1,000 queries (showing runtime averages) Q1 – safe, non-repeating hierarchical Q2 – unsafe, repeating hierarchical Q3 – unsafe, head-hierarchical Q4 – general unsafe

Experiment (II): Computing Marginals Runtime vs. number of top-k results; single join query Percentage of tuples scanned from input relations IMDB data set, 26 Mio facts

Probabilistic & Temporal Database Sequenced Semantics & Snapshot Reducibility: Built-in semantics: reduce temporal-relational operators to their non- temporal counterparts at each snapshot of the database. Coalesce/split tuples with consecutive time intervals based on their lineages. Non-Sequenced Semantics Queries can freely manipulate timestamps just like regular attributes. Single temporal operator T supports all of Allens 13 temporal relations. Deduplicate tuples with overlapping time intervals based on their lineages. A temporal-probabilistic database D Tp (compactly) encodes a probability distribution over a finite set of deterministic database instances D i and a finite time domain T. [Dignös, Gamper, Böhlen: SIGMOD12] [Dylla,Miliaraki,Theobald: PVLDB13]

Temporal Alignment & Deduplication Non-Sequenced Semantics: f1f1 193619761988 f 2 ¬f 3 f 2 f 3 f 1 ¬f 3 f 1 f 3 (f 1 f 3 ) (f 1 ¬f 3 ) (f 2 f 3 ) (f 2 ¬f 3 ) (f 1 f 3 ) (f 2 ¬f 3 ) T MarriedTo(X,Y)[T b1,t max ) Wedding(X,Y)[T b1,T e1 ) ¬Divorce(X,Y)[T b2,T e2 ) MarriedTo(X,Y)[T b1,T e2 ) Wedding(X,Y)[T b1,T e1 ) Divorce(X,Y)[T b2,T e2 ) T e1 T T b2 Base Facts Deduced Facts Dedupl. Facts Wedding(DeNiro,Abbott) Divorce(DeNiro,Abbott) t max f2f2 f3f3 t min

0.08 0.12 0.16 0.4 0.6 03 0507 playsFor(Beckham, Real, T 1 ) Base Facts Derived Facts 0.2 0.1 0.4 05000207 playsFor(Ronaldo, Real, T 2 ) 04 0304 07 05 playsFor(Beckham, Real, T 1 ) playsFor(Ronaldo, Real, T 2 ) overlaps(T 1,T 2, T 3 ) t 3 teamMates(Beckham, Ronaldo, t 3 ) teamMates(Beckham, Ronaldo, T 3 ) Inference in Temporal-Probabilistic Databases [Wang,Yahya,Theobald: MUD10; Dylla,Miliaraki,Theobald: PVLDB13]

0.4 0.6 03 0507 playsFor(Beckham, Real, T 1 ) Base Facts Derived Facts playsFor(Ronaldo, Real, T 2 ) 0.2 0.1 05000207 04 0.4 0.08 0.12 0.16 0304 07 05 playsFor(Zidane, Real, T 3 ) teamMates(Beckham, Zidane, T 5 ) teamMates(Ronaldo, Zidane, T 6 ) teamMates(Beckham, Ronaldo, T 4 ) Non-independent Independent Inference in Temporal-Probabilistic Databases [Wang,Yahya,Theobald: MUD10; Dylla,Miliaraki,Theobald: PVLDB13]

playsFor(Beckham, Real, T 1 ) Base Facts Derived Facts playsFor(Ronaldo, Real, T 2 ) playsFor(Zidane, Real, T 3 ) teamMates(Beckham, Zidane, T 5 ) teamMates(Ronaldo, Zidane, T 6 ) Non-independent Independent Closed and complete representation model (incl. lineage) Temporal alignment is linear in the number of input intervals Confidence computation per interval remains #P-hard In general requires Monte Carlo approximations (Luby-Karp for DNF, MCMC-style sampling), decompositions, or top-k pruning Closed and complete representation model (incl. lineage) Temporal alignment is linear in the number of input intervals Confidence computation per interval remains #P-hard In general requires Monte Carlo approximations (Luby-Karp for DNF, MCMC-style sampling), decompositions, or top-k pruning teamMates(Beckham, Ronaldo, T 4 ) Need Lineage! Inference in Temporal-Probabilistic Databases [Wang,Yahya,Theobald: MUD10; Dylla,Miliaraki,Theobald: PVLDB13]

Experiment (III): Temporal Alignment & Probabilistic Inference 1,827 base facts with temporal annotations Extracted from free-text biographies from Wikipedia, IMDB.com, biography.com 11 handcrafted temporal deduction rules, e.g.: MarriedTo(X,Y)[T b1,T e2 ) Wedding(X,Y)[T b1,T e1 ) Divorce(X,Y)[T b2,T e2 ) T e1 T T b2 21 handcrafted temporal consistency constraints, e.g.: BornIn(X,Y)[T b1,T e1 ) MarriedTo(X,Y)[T b2,T e2 ) T e1 T T b2

Statistical Relational Learning & Probabilistic Programming SRL combines first-order logic and probabilistic inference Employs relational data as input, but with a focus also on learning the relations (facts, rules & weights) Knowledge compilation for probabilistic inference Including recent techniques for lifted inference Markov Logic Networks (U-Washington) Grounding of weighted first-order rules over a function-free Herbrand base into an undirected graphical model ( Markov Random Field) Probabilistic Programming (ProbLog, KU-Leuven) Deductive grounding over a set of base facts into a directed graphical model (SLD proofs Bayesian Net)

Learning Soft Deduction Rules Inductive learning algorithm based on dynamic programming A-priori-style pre-filtering & pruning of low-support join patterns Adaptation of confidence and support measures from data mining Learning interesting rules with constants and type constraints Ground truth for bornIn(partially known) Specificity: avoid producing overly general rules Overly general Refine by types Ground truth for IivesIn (only partially known) Knowledge base for livesIn (known positive examples) Facts inferred for livesIn from the body of the rule bornIn (only partially correct) Goal: Inductively learn soft rule S: livesIn(x,y) :- bornIn(x,y) G KB R

Learning Interesting Deduction Rules (I) Plots for the distribution of income versus quarterOfBirth and educationLevel over actual US census data from Oct. 2009 (>1 billion RDF facts). Divergence fromOverall population shows strong correlation of income with educationLevel but not with quarterOfBirth. income re/. freq. Overall population QOB-1 st -quarter QOB-2 nd -quarter QOB-3 rd -quarter QOB-4 th -quarter Overall population QOB-1 st -quarter QOB-2 nd -quarter QOB-3 rd -quarter QOB-4 th -quarter income re/. freq. income(x, y), quarterOfBirth(x, z) income(x, y), educationLevel(x, z)

Learning Interesting Deduction Rules (II) Divergence measured using Kullback-Leibler or χ 2 betweenOverall population withNursery school to Grade 4 andProfessional school degree over discretized income domain. re/. freq. lowmediumhigh income(x, y) :- educationLevel(x, z) income(x, low) :- educationLevel(x, Nursery school to Grade 4) income(x, medium) :- educationLevel(x, Professional school degree) income(x, high) :- educationLevel(x, Professional school degree) – Overall population – Nursery school to Grade 4 – Professional school degree – Overall population – Nursery school to Grade 4 – Professional school degree income

ontological rigor human effort Names & PatternsEntities & Relations Open- Domain & Unsuper- vised Domain- Oriented Training Data/Facts < N. Portman, honored with, Academy Award>, < Jeff Bridges, expected to win, Oscar > < Bridges, nominated for, Academy Award> wonAward: Person Prize type (Meryl_Streep, Actor) wonAward (Meryl_Streep, Academy_Award) wonAward (Natalie_Portman, Academy_Award) wonAward (Ethan_Coen, Palme_dOr) Summary & Challenges (I) Web-Scale Information Extraction

ontological rigor human effort Names & PatternsEntities & Relations Open- Domain & Unsuper- vised Domain- Oriented Training Data/Facts Summary & Challenges (I) Web-Scale Information Extraction TextRunner ReadTheWeb / NELL Probase Freebase YAGO2DBpedia 3.8 Sofie / Prospera StatSnowball / EntityCube ? ----- WebTables / FusionTables

Summary & Challenges (II) RDF is Not Enough! HMMs, CRFs, PCFGs (not in this talk) yield much richer output structures than just triplets. Extraction of facts beliefs, modifiers, modalities, etc.. intensional knowledge (rules) More expressive but canonical representation of natural language: trees, graphs, objects, frames (F-logic, KL-one, CycL, OWL, etc.) All combined with structured probabilistic inference Extraction of facts beliefs, modifiers, modalities, etc.. intensional knowledge (rules) More expressive but canonical representation of natural language: trees, graphs, objects, frames (F-logic, KL-one, CycL, OWL, etc.) All combined with structured probabilistic inference

Summary & Challenges (III) Scalable Probabilistic Inference Domain-liftable FO formula X,Y People smokes(X) friends(X,Y) smokes(Y) Exact lifted inference via Weighted-First-Order-Model-Counting (WFOMC) Probability of a query depends only on the size(s) of the domain(s), a weight function for the first-order predicates, and the weighted model count over the FO d-DNNF. [Van den Broeck11]: Compilation rules and inference algorithms for FO d-DNNFs [Jha & Suciu11]: Classes of SQL queries which admit polynomial-size (propositional) d-DNNFs Approximate inference via Belief Propagation, MCMC-style sampling, etc. Scale-out via distributed grounding & inference: TrinityRDF (MSR), GraphLab2 (MIT) Corresponding FO d-DNNF circuit

Final Summary Text is not just unstructured data. Probabilistic databases combine first-order logic and probability theory in an elegant way. Natural-Language-Processing people, Database guys, and Machine-Learning folks: its about time to join your forces! C1C2C3

Demo! urdf.mpi-inf.mpg.de

References Maximilian Dylla, Iris Miliaraki, and Martin Theobald: A Temporal-Probabilistic Database Model for Information Extraction. PVLDB 6(14), 2013 (to appear) Maximilian Dylla, Iris Miliaraki, and Martin Theobald: Top-k Query Processing in Probabilistic Databases with Non- Materialized Views. ICDE 2013, 2013 Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS 2012: 15-20 Mohamed Yahya, Martin Theobald: D2R2: Disk-Oriented Deductive Reasoning in a RISC-Style RDF Engine. RuleML America 2011: 81-96 Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive Reasoning in Uncertain RDF Knowledge Bases. CIKM 2011: 2557-2560 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Scalable Knowledge Harvesting with High Precision and High Recall. WSDM 2011: 227-236 Maximilian Dylla, Mauro Sozio, Martin Theobald: Resolving Temporal Conflicts in Inconsistent RDF Knowledge Bases. BTW 2011: 474-493 Yafang Wang, Mohamed Yahya, Martin Theobald: Time-aware Reasoning in Uncertain Knowledge Bases. MUD 2010: 51-65 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Find your Advisor: Robust Knowledge Gathering from the Web. WebDB 2010 Anish Das Sarma, Martin Theobald, Jennifer Widom: LIVE: A Lineage-Supported Versioned DBMS. SSDBM 2010: 416-433 Anish Das Sarma, Martin Theobald, Jennifer Widom: Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases. ICDE 2008: 1023-1032 Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty and lineage. VLDB J. 17(2): 243-264 (2008)

10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Similar presentations

Presentation on theme: "10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Similar presentations

Presentation on theme: "10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,"— Presentation transcript:

Similar presentations

About project

Feedback