Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Database Repairing

Similar presentations


Presentation on theme: "Probabilistic Database Repairing"— Presentation transcript:

1 Probabilistic Database Repairing
Benny Kimelfeld Technion Data & Knowledge Lab Collaborators: Christopher De Sa, Ihab Ilyas, Ester Livshits, Christoper Ré, Theodoros Rekatsinas, Sudeepa Roy Acknowledgments:

2 Cleaning in Information Extraction
Text queries produce inconsistent results Data artifacts Developer limitations Person1 33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303 Person2 Address1 Systems provide mechanisms for post-extraction repair SystemT “consolidators” [Chiticariu+10], GATE/JAPE “controls” [Cunningham02], WHISK [Soderland99], POSIX regex [Fowler03] [Fagin, K, Reiss, Vansummeren TODS16]: Unifying concept via prioritized database repairs [Staworko+2012]

3 Inconsistency in the DBpedia KB
Marion Jones Cullen Douglas Irene Tedrow dbo:height dbo:birthPlace dbo:deathPlace 1.524 1.778 dbr:California dbr:Florida dbr:California dbr:Hollywood,_Los_Angeles dbr:New_York_City dbo:parent dbo:birthYear dbo:birthYear 1969 David Saxe Melinda Saxe 1965 dbo:parent

4 Sources of Inconsistent Data
Imprecise data sources Crowd, Web pages, social encyclopedias, sensors, … Imprecise data generation ETL, natural-language processing, sensor/signal processing, image recognition, … Conflicts in data integration Crowd + enterprise data + KB + Web + ... Data staleness Entities change address, status, ... And so on ...

5

6 Principled Declarative Approaches
Several principled approaches proposed for reasoning about inconsistent data Concepts in declarative approaches Integrity constraints Or dependencies Inconsistent database Violates the constraints Edit operations Delete/insert tuple, update an attribute Repairs Consistent DB following a legitimate edit Clean formulation by Arenas, Bertossi, Chomicki, 1999

7 Examples of Integrity Constraints
Key constraints Person(ssn,name,birthCity,birthState) Functional Dependencies (FDs) birthCity ⟶ birthState Conditional FDs birthCity ⟶ birthState whenever country=“USA” Denial constraints not[ Parent(x,y) & Parent(y,x) ] Referential constraints Inclusion dependencies

8 Example: Inconsistent Database
person ⟶ birthCity birthCity ⟶ birthState person birthCity birthState Douglas LA CA Miami FL Tedrow NYC Jones

9 birthCity ⟶ birthState
person birthCity birthState Douglas LA CA Miami FL Tedrow NYC Jones person birthCity birthState Douglas LA CA Miami FL Tedrow NYC Jones Subset Repairs Set-min collection of deleted tuples Min #deleted tuples (cardinality repairs) person ⟶ birthCity birthCity ⟶ birthState person birthCity birthState Douglas LA CA Miami FL Tedrow NYC Jones Set-min collection of cell updates Min #cell updates person birthCity birthState Douglas Miami CA Tedrow LA NYC Jones person birthCity birthState Douglas Miami FL Tedrow LA CA Jones Update Repairs

10 Studied Computational Problems
Repairing / Cleaning Compute a (good/best) repair Consistent query answering Which query answers are not affected by inconsistency? Formally, find the tuples that belong to Q(J) for all repairs J Repair checking Test whether a given candidate is a repair Formally, given I and J, is J a repair of I? Repair counting Count #repairs (that satisfy a given constraint)

11 Biasing Repairs In traditional theory, all repairs are equally good
There are reasons to prefer one over another Different levels of reliability Enterprise DB vs. Web extraction, varying sensor qualities, varying extraction precisions, deceit, … Staleness arguments Timestamps, version numbers likely to increase, salaries likely to grow, single becomes married/divorced – not vice versa, … Common ways of expressing bias: Preference relationship (pairwise, partial order) Weights (pointwise, probability)

12 3 Levels of Preferences Attribute-level preferences
[Fan, Geerts, Wijsen 2012], [Cao, Fan, Yu 2013], Llunatic system [Geerts+ 2013] Tuple-level preferences [van Nieuwenborgh, Vermeir 2002] [Staworko, Chomicki, Marcinkowski 2012] [Bienvenu, Bourgaux, Goasdoue 2014] [K, Livshits, Peterfreund 2017] [Livshits, K 2017] Captures IE cleaning policies [Fagin+ 2016] Repair-level preferences [Flesca, Furfaro, Parisi 2007] Probabilistic repairs: HoloClean [Rekatsinas ChuIlyas Ré, 2017] most probable databases [Gribkoff, Van den Broeck, Suciu 2014]

13

14 Probabilistic Duplicates [Andritsos, Fuxman, Miller 06]
person ⟶ birthCity, birthState person birthCity birthState p Cullen Douglas LA CA 0.6 Tampa FL 0.4 Marion Jones 1.0 Irene Tedrow NYC NY 0.3 Hollywood 0.2 0.1 Same as “Block Independent Disjoint” (BID) DBs [Dalvi, Ré, Suciu 09]

15 Probabilistic Duplicates [Andritsos, Fuxman, Miller 06]
person ⟶ birthCity, birthState person birthCity birthState p Cullen Douglas LA CA 0.6 Tampa FL 0.4 Marion Jones 1.0 Irene Tedrow NYC NY 0.3 Hollywood 0.2 0.1

16 Probabilistic Duplicates [Andritsos, Fuxman, Miller 06]
person ⟶ birthCity birthCity ⟶ birthState person birthCity birthState p Cullen Douglas LA CA 0.6 Tampa FL 0.4 Marion Jones 1.0 Irene Tedrow NYC NY 0.3 Hollywood 0.2 0.1 How to generalize beyond key constraints?

17 MPD [Gribkoff, Van den Broeck, Suciu 14]
= Most Probable Database person ⟶ birthCity birthCity ⟶ birthState person birthCity birthState p Cullen Douglas LA CA 0.6 Tampa FL 0.7 Marion Jones 0.9 Irene Tedrow NYC NY Hollywood 0.5 0.8 Most probable world, conditioned on dependency satisfaction

18 MPD [Gribkoff, Van den Broeck, Suciu 14]
person ⟶ birthCity birthCity ⟶ birthState factor 0.6 1-0.7 1-0.9 1-0.6 1-0.5 0.8 person birthCity birthState p Cullen Douglas LA CA 0.6 Tampa FL 0.7 Marion Jones 0.9 Irene Tedrow NYC NY Hollywood 0.5 0.8 max ( ) 𝑡∈𝐽 𝑝 𝑡 × 𝑡∉𝐽 (1−𝑝 𝑡 ) 𝐽

19 MPD [Gribkoff, Van den Broeck, Suciu 14]
person ⟶ birthCity birthCity ⟶ birthState factor 1-0.6 0.7 0.9 1-0.9 1-0.5 0.8 person birthCity birthState p Cullen Douglas LA CA 0.6 Tampa FL 0.7 Marion Jones 0.9 Irene Tedrow NYC NY Hollywood 0.5 0.8 Can compute efficiently? max ( ) 𝑡∈𝐽 𝑝 𝑡 × 𝑡∉𝐽 (1−𝑝 𝑡 ) 𝐽

20 person ⟶ birthCity birthState
Tractable Innapproximable person ⟶ birthCity birthState person ⟶ birthCity birthCity ⟶ birthState person state ⟶ license# license# ⟶ person state birthPlace ⟶ BPType person BPType ⟶ birthPlace country person ⟶ voterID country voterID ⟶ person

21 Dichotomy Theorem [Livshits, K, Roy 2018]
For any set of FDs, one of two situations: MPD can be found in polynomial time It is NP-hard to find an approx MPD (for any constant or polynomial ratio) Moreover, we can efficiently test which of the two situations hold. Tractable (case 1) Innapproximable (case 2) person ⟶ birthCity birthState person ⟶ birthCity birthCity ⟶ birthState person state ⟶ license# license# ⟶ person state birthPlace ⟶ BPType person BPType ⟶ birthPlace country person ⟶ voterID country voterID ⟶ person Same for computing a cardinality repair!

22 Dichotomy Theorem [Livshits, K, Roy 2018]
For any set of FDs, one of two situations: MPD can be found in polynomial time It is NP-hard to find an approx MPD (for any constant or polynomial ratio) Moreover, we can efficiently test which of the two situations hold. Yet, if we measure log-likelihood instead of probability, then constant approx is feasible for every set of anti-monotone constraints. Same for computing a cardinality repair!

23

24 PUD: A Noisy Channel Model
Parameters via ML This is where traditional constraints fit in This is where edit operations fit in Intension Probabilistic Data Generator Realization Probabilistic Noise Generator We are given this one We reason about these HoloClean [Rekatsinas, Chu, Ilyas, Ré PVLDB2017] [De Sa, Ilyas, K, Ré, Rekatsinas 2018]

25 PUD Example 1: MarkovLogic/Update
Intention Probabilistic Data Generator Realization Probabilistic Noise Generator i.i.d. value generator i.i.d. tuple generator Markov logic for cross-attribute & cross-tuple dependencies Randomly change cell values “Generalizes” minimum update repairs

26 Intention Realization -50 -5 Pro𝑏 (𝐼) ~ exp (Σ penatlies (𝐼) ) -50 𝐼
t1.person = t2.person & t1.birthCity ≠ t2.birthCity t1.birthCity = t2.birthCity & t1.birthCountry ≠ t2.birthCountry Weak constraints Markov Logic: Pro𝑏 (𝐼) ~ exp (Σ penatlies (𝐼) ) person birthCity birthCountry Douglas LA USA Tampa Khan Ghajar Lebanon Israel Intention Probabilistic Data Generator -50 𝐼 -5 Realization Probabilistic Noise Generator person birthCity birthCountry Douglas LA USA Tampa Khan Ghajar Lebanon Rajar Israel 𝐽

27 PUD Example 2: MarkovLogic/Subset
Intention Probabilistic Data Generator Realization Probabilistic Noise Generator i.i.d. value generator i.i.d. tuple generator Markov logic for cross-attribute & cross-tuple dependencies Randomly add new tuples “Generalizes” cardinality repairs

28 Intention Realization -50 -5 Pro𝑏 (𝐼) ~ exp (Σ penatlies (𝐼) ) -50 𝐼
t1.person = t2.person & t1.birthCity ≠ t2.birthCity t1.birthCity = t2.birthCity & t1.birthCountry ≠ t2.birthCountry Pro𝑏 (𝐼) ~ exp (Σ penatlies (𝐼) ) person birthCity birthCountry Douglas LA USA Tampa Khan Ghajar Lebanon Israel NYC Intention Probabilistic Data Generator -50 𝐼 -5 -50 𝐽 Realization Probabilistic Noise Generator Douglas NYC NY Khan Ghajar Syria

29 Fundamental Problems Intension Realization Most Likely Intent
Deterministic Variant Most Likely Intent Cleaning / repair generation Prob. Query Answering Consistent Query Answering, repair counting Parameter Learning Pr(a∈Q(I )| J ) Intension Probabilistic Data Generator Realization Probabilistic Noise Generator J argmaxI Pr(I | J ) I

30 Preliminary Results Results on ML/update and ML/subset
Connections: most-likely intent vs. min repairs ML/subset generalizes cardinality repairs ML/update generalizes min-update repairs Most-likely-intent algorithms for key constraints Learning results: convexity and algorithms for special cases Under certain assumptions, parameters can be learned from a dirty database without clean examples

31 Thanks! Collaborators Christopher De Sa Ihab Ilyas Ester Livshits
Christoper Ré Theodoros Rekatsinas Sudeepa Roy Thanks!


Download ppt "Probabilistic Database Repairing"

Similar presentations


Ads by Google