Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino AT&T Labs - Research: Xin Luna Dong, Divesh.

Similar presentations


Presentation on theme: "Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino AT&T Labs - Research: Xin Luna Dong, Divesh."— Presentation transcript:

1 Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino AT&T Labs - Research: Xin Luna Dong, Divesh Srivastava October, 2012

2 Some Statistics from DBLP -How many Wei Wang’s are there? -What are their authoring histories? 2

3 Some Statistics from YellowPages 3 -Are there any business chains? -If yes, which businesses are their members?

4 Record Linkage What is record linkage (entity resolution)? Input: a set of records Output: clustering of records A critical problem in data integration and data cleaning “A reputation for world-class quality is profitable, a ‘business maker’.” – William E. Winkler Current work (surveyed in [Elmagarmid, 07], [Koudas, 06]) : assume that records of the same entities are consistent often focus on different representations of the same value e.g., “IBM” and “International Business Machines” 4

5 New Challenges In reality, we observe value diversity of entities Values can evolve over time Catholic Healthcare ( )  Dignity Health (2012 -) Different records of the same group can have “local” values Some sources may provide erroneous values 5 IDNameAddressPhoneURL 001F.B. InsuranceVernon TX txfb-ins.com 002F.B. Insurance #1Lufkin TX txfb.org 003F.B. Insurance #5Cibolo TX IDNameURLSource 001Meekhof Tire Sales & Service Incwww.meekhoftire.comSrc Meekhof Tire Sales & Service Incwww.napaautocare.comSrc. 2 5

6 My Goal To improve the linkage quality of integrated data with fairly high diversity linking temporal records [VLDB ’11] [VLDB ’12 demo][FCS Journal ’12] linking records of the same group [Under preparation for SIGMOD ’13] 6

7 Outline Motivation Linking temporal records Decay Temporal clustering Demo Linking records of the same group Related work Conclusions & Future work 7

8 r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r7: Dong Xin University of Illinois r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r8:Dong Xin University of Illinois r9: Dong Xin Microsoft Research r5: Xin Luna Dong AT&T Labs-Research r10: Dong Xin University of Illinois r11: Dong Xin Microsoft Research r6: Xin Luna Dong AT&T Labs-Research r12: Dong Xin Microsoft Research -How many authors? -What are their authoring histories?

9 r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r7: Dong Xin University of Illinois r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r8:Dong Xin University of Illinois r9: Dong Xin Microsoft Research r5: Xin Luna Dong AT&T Labs-Research r10: Dong Xin University of Illinois r11: Dong Xin Microsoft Research r6: Xin Luna Dong AT&T Labs-Research r12: Dong Xin Microsoft Research -Ground truth 3 authors

10 r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r7: Dong Xin University of Illinois r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r8:Dong Xin University of Illinois r9: Dong Xin Microsoft Research r5: Xin Luna Dong AT&T Labs-Research r10: Dong Xin University of Illinois r11: Dong Xin Microsoft Research r6: Xin Luna Dong AT&T Labs-Research r12: Dong Xin Microsoft Research -Solution 1: -requiring high value consistency 5 authors false negative

11 r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r7: Dong Xin University of Illinois r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r8:Dong Xin University of Illinois r9: Dong Xin Microsoft Research r5: Xin Luna Dong AT&T Labs-Research r10: Dong Xin University of Illinois r11: Dong Xin Microsoft Research r6: Xin Luna Dong AT&T Labs-Research r12: Dong Xin Microsoft Research -Solution 2: -matching records w. similar names 2 authors false positive

12 Opportunities IDNameAffiliationCo-authorsYear r1Xin DongR. Polytechnic Institute Wozny1991 r2Xin DongUniversity of Washington Halevy, Tatarinov 2004 r7Dong XinUniversity of Illinois Han, Wah2004 r3Xin DongUniversity of Washington Halevy2005 r4Xin Luna Dong University of Washington Halevy, Yu2007 r8Dong XinUniversity of Illinois Wah2007 r9Dong XinMicrosoft ResearchWu, Han2008 r1 0 Dong XinUniversity of Illinois Ling, He2009 r1 1 Dong XinMicrosoft ResearchChaudhuri, Ganti 2009 r5Xin Luna Dong AT&T Labs- Research Das Sarma, Halevy 2009 r6Xin Luna Dong AT&T Labs- Research Naumann2010 r1 2 Dong XinMicrosoft ResearchHe2011 Smooth transition Seldom erratic changes Continuity of history 12

13 Intuitions IDNameAffiliationCo-authorsYear r1Xin DongR. Polytechnic Institute Wozny1991 r2Xin DongUniversity of Washington Halevy, Tatarinov 2004 r7Dong XinUniversity of Illinois Han, Wah2004 r3Xin DongUniversity of Washington Halevy2005 r4Xin Luna Dong University of Washington Halevy, Yu2007 r8Dong XinUniversity of Illinois Wah2007 r9Dong XinMicrosoft ResearchWu, Han2008 r1 0 Dong XinUniversity of Illinois Ling, He2009 r1 1 Dong XinMicrosoft ResearchChaudhuri, Ganti 2009 r5Xin Luna Dong AT&T Labs- Research Das Sarma, Halevy 2009 r6Xin Luna Dong AT&T Labs- Research Naumann2010 r1 2 Dong XinMicrosoft ResearchHe2011 Less penalty on different values over time Less reward on the same value over time Consider records in time order for clustering 13

14 Outline Motivation Linking temporal records Decay Temporal clustering Demo Linking records of the same group Related work Conclusions & Future work 14

15 Disagreement Decay Intuition: different values over a long time is not a strong indicator of referring to different entities. University of Washington (01-07) AT&T Labs-Research (07-date) Definition (Disagreement decay) Disagreement decay of attribute A over time ∆t is the probability that an entity changes its A-value within time ∆t. 15

16 Agreement Decay Intuition: the same value over a long time is not a strong indicator of referring to the same entities. Adam Smith: ( ) Adam Smith: (1965-) Definition (Agreement decay) Agreement decay of attribute A over time ∆t is the probability that different entities share the same A-value within time ∆t. 16

17 Decay Curves Decay curves of address learnt from European Patent data Disagreement decayAgreement decay Patent records: 1871 Real-world inventors: 359 In years:

18 Applying Decay E.g. r1 r2 No decayed similarity: w(name)=w(affi.)=.5 sim(r1, r2)=.5*1+.5*0=.5 Decayed similarity w(name, ∆t=5)=1-d agree (name, ∆t=5)=.95, w(affi., ∆t=5)=1-d disagree (affi., ∆t=5)=.1 sim(r1, r2)=(.95*1+.1*0)/(.95+.1)=.9 Match Un-match 18

19 Applying Decay 19 IDNameAffiliationCo-authorsYear r1Xin DongR. Polytechnic Institute Wozny1991 r2Xin DongUniversity of Washington Halevy, Tatarinov 2004 r7Dong XinUniversity of Illinois Han, Wah2004 r3Xin DongUniversity of Washington Halevy2005 r4Xin Luna Dong University of Washington Halevy, Yu2007 r8Dong XinUniversity of Illinois Wah2007 r9Dong XinMicrosoft ResearchWu, Han2008 r1 0 Dong XinUniversity of Illinois Ling, He2009 r1 1 Dong XinMicrosoft ResearchChaudhuri, Ganti 2009 r5Xin Luna Dong AT&T Labs- Research Das Sarma, Halevy 2009 r6Xin Luna Dong AT&T Labs- Research Naumann2010 r1 2 Dong XinMicrosoft ResearchHe2011  All records are merged into the same cluster!! Able to detect changes!

20 Decayed Similarity & Traditional Clustering 20 Decay improves recall over baselines by 23-67% Patent records: 1871 Real-world inventors: 359 In years:

21 Outline Motivation Linking temporal records Decay Temporal clustering Demo Linking records of the same group Related work Conclusions & Future work 21

22 Early Binding Compare a new record with existing clusters Make eager merging decision for each record Maintain the earliest/latest timestamp for its last value 22

23 Early Binding IDID NameAffiliationCo-authorsFro m To r2Xin DongUniv. of Washington Halevy, Tatarinov 2004 IDID NameAffiliationCo-authorsFro m To r3Xin DongUniv. of Washington Halevy r1Xin DongR. P. InstituteWozny1991 r7Dong Xin University of Illinois Han, Wah2004 r8Dong Xin University of Illinois Wah r4Xin Luna Dong Univ. of Washington Halevy, Yu r9Dong Xin Microsoft Research Wu, Han2008 r1 0 Dong XinUniversity of Illinois Ling, He2009 IDID NameAffiliationCo-authorsFro m To r5Xin Luna Dong AT&T Labs- Research Das Sarma, Halevy r1 1 Dong Xin Microsoft Research Chaudhuri, Ganti r6Xin Luna Dong AT&T Labs- Research Naumann r1 2 Dong Xin Microsoft Research He C1C1 C2C2 C3C3  earlier mistakes prevent later merging!! Avoid a lot of false positives! 23

24 Adjusted Binding Compare earlier records with clusters created later Proceed in EM-style 1.Initialization: Start with the result of initialized clustering 2.Estimation: Compute record-cluster similarity 3.Maximization: Choose the optimal clustering 4.Termination: Repeat until the results converge or oscillate 24

25 Adjusted Binding Compute similarity by Consistency: consistency in evolution of values Continuity: continuity of records in time Case 1: r.t C.late record time stampcluster time stamp C.early Case 2: r.t C.late C.early Case 3: r.t C.lateC.early Case 4: r.t C.late C.early sim(r, C)=cont(r, C)*cons(r, C) 25

26 Adjusted Binding r r C3C3 C4C4 C5C5 r r r r r10 has higher continuity with C4 r8 has higher continuity with C4 Once r8 is merged to C4, r7 has higher continuity with C4 26

27 Adjusted Binding C1C1 C2C2 C3C3 IDNameAffiliationCo-authorsYear r1Xin DongR. Polytechnic Institute Wozny1991 r2Xin DongUniversity of Washington Halevy, Tatarinov 2004 r3Xin DongUniversity of Washington Halevy2005 r4Xin Luna Dong University of Washington Halevy, Yu2007 r5Xin Luna Dong AT&T Labs- Research Das Sarma, Halevy 2009 r6Xin Luna Dong AT&T Labs- Research Naumann2010 r7Dong XinUniversity of Illinois Han, Wah2004 r8Dong XinUniversity of Illinois Wah2007 r9Dong XinMicrosoft ResearchWu, Han2008 r1 0 Dong XinUniversity of Illinois Ling, He2009 r1 1 Dong XinMicrosoft ResearchChaudhuri, Ganti 2009 r1 2 Dong XinMicrosoft ResearchHe2011 Correctly cluster all records 27

28 Temporal Clustering 28 Patent records: 1871 Real-world inventors: 359 In years: Full algorithm has the best result Adjusted Clustering improves recall without reducing precision much

29 Experimental Results Data sets: #Records#EntitiesYears Patent DBLP-XD DBLP-WW73818+potpourri (a) Results of XD data (b) Results of WW data 29

30 Demonstration CHRONOS: Facilitating History Discovery by Linking Temporal Records CHRONOS: Facilitating History Discovery by Linking Temporal Records ITIS Lab 30

31 Outline Motivation Linking temporal records Decay Temporal clustering Demo Linking records of the same group Related work Conclusions & Future work 31

32 -Are there any business chains? -If yes, which businesses are their members? 32

33 -Ground Truth 2 chains 33

34 -Solution 1: -Require high value consistency 0 chain 34

35 -Solution 2: -Match records w. same name 1 chain 35

36 Challenges IDnamephonestateURL domain r1Taco CasaALtacocasa.com r2Taco Casa900ALtacocasa.com r3Taco Casa900ALtacocasa.com, tacocasatexas.com r4Taco Casa900AL r5Taco Casa900AL r6Taco Casa701TXtacocasatexas.com r7Taco Casa702TXtacocasatexas.com r8Taco Casa703TXtacocasatexas.com r9Taco Casa704TX r10Elva’s Taco CasaTXtacodemar.com Erroneous values Different local values Scalability 6.8M Records 36

37 Two-Stage Linkage – Stage I Stage I: Identify cores containing listings very likely to belong to the same chain Require strong robustness in presence of possibly erroneous values  Graph theory High Scalability 37 IDnamephonestateURL domain r1Taco CasaALtacocasa.com r2Taco Casa900ALtacocasa.com r3Taco Casa900ALtacocasa.com, tacocasatexas.com r4Taco Casa900AL r5Taco Casa900AL r6Taco Casa701TXtacocasatexas.com r7Taco Casa702TXtacocasatexas.com r8Taco Casa703TXtacocasatexas.com r9Taco Casa704TX r10Elva’s Taco CasaTXtacodemar.com

38 Two-Stage Linkage – Stage II Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering No penalty on local values 38 IDnamephonestateURL domain r1Taco CasaALtacocasa.com r2Taco Casa900ALtacocasa.com r3Taco Casa900ALtacocasa.com, tacocasatexas.com r4Taco Casa900AL r5Taco Casa900AL r6Taco Casa701TXtacocasatexas.com r7Taco Casa702TXtacocasatexas.com r8Taco Casa703TXtacocasatexas.com r9Taco Casa704TX r10Elva’s Taco CasaTXtacodemar.com Reward strong evidence

39 Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering No penalty on local values 39 IDnamephonestateURL domain r1Taco CasaALtacocasa.com r2Taco Casa900ALtacocasa.com r3Taco Casa900ALtacocasa.com, tacocasatexas.com r4Taco Casa900AL r5Taco Casa900AL r6Taco Casa701TXtacocasatexas.com r7Taco Casa702TXtacocasatexas.com r8Taco Casa703TXtacocasatexas.com r9Taco Casa704TX r10Elva’s Taco CasaTXtacodemar.com Reward strong evidence Two-Stage Linkage – Stage II

40 Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering No penalty on local values 40 IDnamephonestateURL domain r1Taco CasaALtacocasa.com r2Taco Casa900ALtacocasa.com r3Taco Casa900ALtacocasa.com, tacocasatexas.com r4Taco Casa900AL r5Taco Casa900AL r6Taco Casa701TXtacocasatexas.com r7Taco Casa702TXtacocasatexas.com r8Taco Casa703TXtacocasatexas.com r9Taco Casa704TX r10Elva’s Taco CasaTXtacodemar.com Apply weak evidence Two-Stage Linkage – Stage II

41 Stage II: Cluster cores and remaining records into chains. Collect strong evidence from cores and leverage in clustering No penalty on local values 41 IDnamephonestateURL domain r1Taco CasaALtacocasa.com r2Taco Casa900ALtacocasa.com r3Taco Casa900ALtacocasa.com, tacocasatexas.com r4Taco Casa900AL r5Taco Casa900AL r6Taco Casa701TXtacocasatexas.com r7Taco Casa702TXtacocasatexas.com r8Taco Casa703TXtacocasatexas.com r9Taco Casa704TX r10Elva’s Taco CasaTXtacodemar.com No penalty on local values Two-Stage Linkage – Stage II

42 Experimental Evaluation Data set 6.8M records from YellowPages.com Effectiveness: Precision / Recall / F-measure (avg.):.96 /.96 /.96 Efficiency: 6.9 hrs for single-machine solution 40 mins for Hadoop solution 80K chains and 1M records in chains 42 Chain name# Stores USPS - United States Post Office 12,776 SUBWAY11,278 State Farm Insurance8,711 McDonald's7,450 Edward Jones6,781

43 Experimental Evaluation II ITIS Lab 43 Sample#Records#ChainsChain size#Single-biz records Random206230[2, 308]503 AI UB3227[2, 275]5 FBIns114914[33, 269]0

44 Related Work Record similarity: Probabilistic linkage Classification-based approaches: classify records by probabilistic model [Felligi, ’69] Deterministic linkage Distance-base approaches: apply distance metric to compute similarity of each attribute, and take the weighted sum as record similarity [Dey,08] Rule-based approaches: apply domain knolwedge to match record [Hernandez,98] Record clustering Transitive rule [Hernandez,98] Optimization problem [Wijaya,09] … 44

45 Conclusions In some applications record linkage needs to be tolerant with value diversity When linking temporal records, time decay allows tolerance on evolving values When linking group members, two-stage linkage allows leveraging strong evidence and allows tolerance on different local values 45

46 Future Work 46 Data Integration Temporal Database Data Quality

47 Thanks! 47


Download ppt "Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino AT&T Labs - Research: Xin Luna Dong, Divesh."

Similar presentations


Ads by Google