Presentation is loading. Please wait.

Presentation is loading. Please wait.

Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

Similar presentations


Presentation on theme: "Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)"— Presentation transcript:

1 Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)

2 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Overview 2 ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results

3 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Overview 3 ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results

4 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Data Cleaning as Data Repairing Data Cleaning is a very general term standardization, entity resolution, data fusion… We consider one of its facets: data repairing 4 Dirty Table R Database Table R Constraints C over R Data Repairing Cleaned Table R’

5 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, A Motivating Example Customers SSNNamePhonePhConfCityCC# t1111M. White NY t2222L. Lennon SF t3222L. Lennon SF Master Data SSNNamePhoneStreetCity tm222F. Lennon Sky Dr.SF fd1. Cust: SSN, Name → Phone Treatments SSNSalaryInsuranceTreat.Date t411110kAbxDental10/01/11 t511125kAbxCholest.08/12/12 t622230kMedEye Surg.06/10/12 er6. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street, City from MD fd2. Cust: SSN, Name → CC# fd3. Treat: SSN → Salary cfd4. Treat: Insur[‘Abx’] → Tr[‘Dental’] cfd5. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’] ? INTERACTION! PREFERRED VALUES

6 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Language of Constraints Choose Preferred Values Repair Selection 6 Many Techniques FD - ID Cond FD Cond ID Matching D Edit Rules Currency Confidence Master Data Cost-Minimality Certain Fixes Sampling 6 How to put everything together?

7 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Language of Constraints Choose Preferred Values Repair Selection 7 Many Techniques FD - ID Cond FD Cond ID Matching D Edit Rules FD - ID Currency Confidence Master Data Cost-Minimality Certain Fixes Sampling Problem 1: missing semantics and missing repair algorithm Problem 1: missing semantics and missing repair algorithm 7

8 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Language of Constraints Choose Preferred Values Repair Selection 8 Many Techniques FD - ID Cond FD Cond ID Matching D Edit Rules FD - ID Currency Confidence Master Data Cost-Minimality Certain Fixes Sampling Problem 2: missing formalism to handle user-specified preference rules Problem 2: missing formalism to handle user-specified preference rules 8

9 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Language of Constraints Choose Preferred Values Repair Selection 9 Many Techniques FD - ID Cond FD Cond ID Matching D Edit Rules FD - ID Currency Confidence Master Data Cost-Minimality Certain Fixes Sampling Problem 3: no DBMS-based scalable implementations Problem 3: no DBMS-based scalable implementations 9

10 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, System Comparison Depend. Language Repair Strategy Value PreferenceSolution Selection FDsCFDsERsTgdsRHSLHSConf.Curr.MDCostCert.Card. Sam pl [1] ✓✗✗✓✓✗✓✗✗✓✗✗✗ [2] ✓✓✗✗✓✓✓✓✗✓✗✗✗ [3] ✓✓✗✗✓✓✗✗✗✓✗✗✗ [4] ✗✗✓✗✓✗✗✗✓✗✓✗✗ [5] ✓✗✗✗✓✓✗✗✗✗✗✓✓ [1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10

11 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Our Goal Depend. Language Repair Strategy Value PreferenceSolution Selection FDsCFDsERsTgdsRHSLHSConf.Curr.MDCostCert.Card. Sam pl [1] ✓✗✗✓✓✗✓✗✗✓✗✗✗ [2] ✓✓✗✗✓✓✓✓✗✓✗✗✗ [3] ✓✓✗✗✓✓✗✗✗✓✗✗✗ [4] ✗✗✓✗✓✗✗✗✓✗✓✗✗ [5] ✓✗✗✗✓✓✗✗✗✗✗✓✓ GOAL [1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10 ✓✓✓✓✓✓✓✓✓✓✓✓✓

12 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Overview 12 ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results

13 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Cleaning EGDs Customers SSNNamePhonePhConfCityCC# t1111M. White NY t2222L. Lennon SF t3222L. Lennon SF Master Data SSNNamePhoneStreetCity tm222F. Lennon Sky Dr.SF e1. Cust(ssn, n, ph, c, cc ), Cust(ssn, n, ph’, c’, cc’) → ph = ph’ e2. Cust(ssn, n, ph, c, cc ), Cust(ssn, n, ph’, c’, cc’) → cc = cc’ e3. Treat(ssn, sal, i, t, d), Treat(ssn, sal’, i’, t’, d’) → sal = sal’ e4. Treat(ssn, s, ins, t, d), ins = ‘Abx’ → tr = ‘Dental’ Treatments SSNSalaryInsuranceTreat.Date t411110kAbxDental10/01/11 t511125kAbxCholest.08/12/12 t622230kMedEye Surg.06/10/12 e5. Cust(ssn, n, ph, c, cc ), Treat(ssn, s, ins, t, d), ins = ‘Abx’ → c = ‘SF’ e6. Cust(ssn, n, ph, c, cc ), Master(ssn, n’, ph, s, c’), → n = n’, c = c’ Σ: Cleaning EGDs

14 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Semantics 14 ‣ Cell Groups ‣ Upgrades ‣ LLUNs ‣ Partial Order

15 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, The Partial Order Standard preference rules NULLS CONST source values are preferred to target values Ordering attribute Π(Cust.Phone) = {dom(Cust.PhConf), ≥} Π(Treat.Salary) = {dom(Treat.Date), ≥ time } No order Π(Cust.CC#) = ∅ 15 Customers SSNNamePhonePhConfCityCC# t1111M. White NY t2222L. Lennon SF t3222L. Lennon SF Master Data SSNNamePhoneStreetCity tm222F. Lennon Sky Dr.SF Treatments SSNSalaryInsuranceTreat.Date t411110kAbxDental10/01/11 t511125kAbxCholest.08/12/12 t622230kMedEye Surg.06/10/12

16 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, SSNNamePhonePhConf t1222L. Lennon t2222F. Lennon SSNNamePhonePhConf t1222L. Lennon t2222F. Lennon Cell Groups We model partial repair by a Cell-Group set of target cells that must be changed together we do not disrupt this equality in the following Carry also lineage information 16 g 1 = occurrences value Master Data NamePhone tmL. Lennon999 e2. Cust(s, n, ph), Master(n, ph’) → ph = ph’ 222L. Lennon F. Lennon g 2 = we change the value of all occurrences justifications e1. Cust(ssn, n, ph, c, cc ), Cust(ssn, n’, ph’, c’, cc’) → ph = ph’ Partial order over cell groups

17 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Upgrades 17 SSNNamePhone t2222L. Lennon123 t3222F. Lennon123 SSNNamePhone t2222L. Lennon999 t3222F. Lennon999 SSNNamePhone t2222L. Lennon123 t3222F. Lennon000 Starting database J upgrade Repair 1 upgrade Repair N

18 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, LLUNs There are cases in which we don’t have any clear strategy to remove a violation LLUNs a new class of symbols placeholders used to mark conflicts not only an unknown value but rather a “hypervalue” the opposite of a NULL, since it upgrades constants 18 SSNNamePhoneCC# t2222L. Lennon t3222L. Lennon e3. Cust(ssn, n, ph, cc ), Cust(ssn, n, ph’, cc’) → cc = cc’ L0L0

19 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Scenarios and Solutions 19 S: source schema, T: target schema, Σ: set of cleaning EGDs Π: the partial order specification Π is the way to specify when a value is preferrable to another Solution: Given C, an instance I of S, and an instance J of T compute an instance J’ such that: it is a repair, i.e., “I and J’ satisfy Σ” and “J’ is an upgrade of J according to Π” Cleaning Scenario

20 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, The Chase Algorithm To compute solutions: Chase algorithm for chasing egds 20 SSNNamePhone t2L1L1 L. Lennon123 t3222F. Lennon000 SSNNamePhone t2222L. Lennon123 t3L2L2 F. Lennon000 Repair 2 (backward) Repair 3 (backward) e1. Cust(ssn, n, ph, c, cc ), Cust(ssn, n’, ph’, c’, cc’) → ph = ph’ SSNNamePhone t2222L. Lennon123 t3222F. Lennon123 Repair 1 (forward)

21 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, A Few Results Given a cleaning scenario C, and instances C always has a solution for The chase always terminates (it never fails) The chase computes all minimal solutions The number of minimal solutions is exponential in the size of J 21

22 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Overview 22 ‣ Motivations and Goals ‣ Semantics ‣ Experimental Results

23 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Chase Tree 23 J R1R1 e 0, b 1 R2R2 R3R3 e 0, b 2 e 0, f R4R4 e 1, b 1 R5R5 R6R6 e 1, b 2 e 1, f R10R10 e 1, b 1 R11R11 R12R12 e 1, b 2 e 1, f R13R13 e 0, b 1 R14R14 R15R15 e 0, b 2 e 0, f the e0-e1 sequencethe e1-e0 sequence Our goal: to make this scalable ! Different orders of application give different results

24 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Scalability Techniques Chase implementation based on equivalence classes Delta Databases a representation system for chase trees Cost managers pluggable strategies to prune the chase tree 24

25 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Scalability L LUNATIC -FR-S10L LUNATIC -FR-S1L LUNATIC -FR-S1-FOL LUNATIC -FR-S50 C USTOMERS H OSPITAL C USTOMERS 100K 250K400K550K700K850K1M LLUNATIC is the first scalable DBMS-based chase algorithm for data repairing sec.

26 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, Quality of Repairs 5k 10k 25k L LUNATIC -FR-S10 L LUNATIC -FR-S1 L LUNATIC -FR-S1-FO S AMPLING -500 V ERTEX C OVER M IN. C OST H OSPITAL

27 The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, That’s all Folks! T h a t ’ s a l l F o l k s !


Download ppt "Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)"

Similar presentations


Ads by Google