Record Linkage with Uniqueness Constraints and Erroneous Values

Record Linkage with Uniqueness Constraints and Erroneous Values
Zhang Xiaojian 2010 November 26 WAMDM Group Meeting

Data integration process
Schema matching E.Rahm VLDBJ01 Two challenges Heterogeneous sources Schema level Instance level conflicting data Value level contradiction Application1 Application2 Cleaned Data Data exchange R.Fagin TODS05 Data Fusion Duplicate Detection Record Linkage Schema matching Duplicate detection Record linkage A.K.E TKDE07 Entity resolution Tect Report Stanford Data fusion X Dong VLDB09 Data fusion Felix WWW06 s s s s s Name Address Age John R Smith 16 Main Street 16 J R Smith 16 Main St NULL s uncertainty Data fusion Felix ACMC08 Data integration process

Contents Motivation Problem definition Solution Experimental results
Conclusions Getting some problems from the paper

Motivation s1 s2 integration s3 Cleaned Data Search Box s4 Src Name
Phone Address City V A-Link Wireless 2148 GLENDALE GALLERIA GLENDALE Abercrombie 2229 GLENDALE GALLERIA Abercrombie & Fitch 2151 GLENDALE GALLERIA Aeropostale 2187 GLENDALE GALLERIA Aerosoles 1163 GLENDALE GALLERIA Newtown Pizza Palace 65 Church hill Rd NEWTOWN Pizza Palace Of Newtown s2 Src Name Phone Address City D Aerosoles 1163 GLENDALE GALLERIA GLENDALE Aldo Shoes 1157 GLENDALE GALLERIA Newtown Pizza Palace 65 Church hill Rd Newtown Pizza Palace of Newtown Church Hill Rd integration s3 Src Name Phone Address City A A 24 Hour 1 A 1 Locksmith 3210 GLENDALE GALLERIA GLENDALE A Link Wireless 2148 GLENDALE GALLERIA Abercrombie 2229 GLENDALE GALLERIA Abercrombie & Fitch 2151 GLENDALE GALLERIA Newtown Pizza Palace 65 Church hill Rd Newtown Aldo Shoes 2154 GLENDALE GALLERIA Alert Cellular Cleaned Data Search Box s4 Src Name Phone Address City T Newtown Piza Palace 65 Church hill Rd Newtown Aldo Shoes 2154 GLENDALE GALLERIA GLENDALE American Eagle Outfitters 2182 GLENDALE GALLERIA ANN TAYLOR 2178 GLENDALE GALLERIA Ann Taylor Stores 1108 GLENDALE GALLERIA

Current Solution Current two-step solution Uniqueness constraint
Step 1: Record Linkage link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] Step 2: Data Fusion merge the linked records and decide the correct values for each result entity in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys08] Uniqueness constraint Many real world entities has a unique value for the attribute. E.g. Website(IP ), Phone, Facebook account Co-existence of conflicts and duplicates makes the problem hard to solve

Limitations of Current Solution
SOURCE NAME PHONE ADDRESS s1 Microsofe Corp. xxx-1255 1 Microsoft Way xxx-9400 Macrosoft Inc. xxx-0500 2 Sylvan W. s2 Microsoft Corp. 2 Sylvan Way s3 s4 s5 s6 xxx-2255 s7 MS Corp. s8 s9 s10 (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) Assume that Phone and Address satisfy uniqueness constraints Erroneous values may prevent correct matching Current solutions may fall short when the uniqueness constraints exist (PHONE) 9400 missing

Conclusions and Future work

Problem Definition Input Output:
A set of records provided by a set of independent data sources A set of (hard or soft) uniqueness constraints Output: Real-world entities For each (hard or soft) uniqueness attribute of each entity True value

Concepts Entity and Attribute Constraint E.g.,
Value vs. Representations (e.g., New York City  New York City, NYC, N.Y.C) Constraint Uniqueness constraint (hard constraint): DA Business Name, Business Phone, Business Address Soft uniqueness constraint (soft constraint): DA Business Phone (e.g., p1=30%, p2=10% ) Where p1 is the upper bound probability of an entity having multiple values for A and p2 is the upper bound probability of a value of A being shared by multiple entities. Special case: key attribute (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) 1-p1 1-p1 1-p2 1-p2

K-Partite Graph Encoding
(Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) N1 Microsofe Corp. s(1) P1 xxx-1255 A1 1 Microsoft Way S Microsofe Corp Xxx Microsoft Way

Encoding of the ideal solution
Microsofe Corp. Microsoft Corp. MS Corp. Macrosoft Inc. N1 N2 N3 N4 P1 P2 P3 P4 xxx-9400 xxx-1255 xxx-2255 xxx-0500 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. Pre-processing for the K-partite graph Clustering in every partite (subset)

Clustering with Hard Constraint
Microsofe Corp. N3 N1 N2 1 Microsoft Way xxx-1255 N4 P1 A1 P2 P3 P4 A2 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 A3 2 Sylvan W. C1 C2 C3 C4 Clustering the whole graph G(S)

Clustering w.r.t hard constraint
Ideal clustering should meet two requests High cohesion within each cluster Low correlation between different clusters Objective function for getting “best” clustering Choosing Davies-Bouldin index [Davies and Bouldin TPAML79] The goal is to minimize Davies-Bouldin index min( ) corresponds to complement of cohesion corresponds to complement of correlation High cohesion Low correlation

Computing cluster distance
Cluster distance function is similarity distance for measuring similarity between value representations of the same attributes. is association distance for measuring association between value representations of different attributes. The key is how to calculate and for computing cluster distance

Similarity Distance Within the same cluster
How to get  N1 N2 N3 P1 A1 0.95 0.65 Microsofe Corp. Microsoft Corp. MS Corp. xxx-1255 1 Microsoft Way C1 N4 P4 A2 A3 Macrosoft Corp. 2 Sylvan Way xxx-0500 C4 0.9 0.4 0.7 d1S(C1,C1) = 1 − ( )/3 = 0.25 (name) d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) N1 N2 N3 N4 1.0 0.95 0.65 0.7 0.4 A1 A3 1.0 A2 0.9 dS(C1,C1) = ( )/3 = 0.083 Within the different clusters d1S(C1,C4) = 1 − ( )/3 = 0.4 (name) d2S(C1,C4) = 1-0 = 1 (phone) d3S(C1,C4) = 1-0 = 1 (address) dS(C1,C4) = ( )/3=0.8

Association Distance Within the same cluster
How to get association distance Within the same cluster d1,2A (C1,C1) = 1 − 7/9 =  d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 Microsoft Corp. Macrosoft Inc. Microsofe Corp. MS Corp. dA(C1,C1) = ( )/3 = N1 N2 N3 N4 s(1-2) s(1-5,7,8) s(2-6) S(7-8) s(2-5) S(10) S(1-9) Within the different clusters s(1) S(7-8) d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 s(1) P1 S(10) P4 S(2-9) d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 xxx-1255 xxx-0500 S(2-10) dA(C1,C4) = ( )/3 = 0.93 s(1) A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C4

Greedy Algorithm--CLUSTER
Obtaining optimal clustering is intractable [T.F. Gonzales., 82],[J. Simal et al., 06] Algorithm: CLUSTER Step1: Initialization Cluster value representations according to their similarity distance and association distance Step2: Adjustment For each node, moving to the cluster that minimize this Davies-Bouldin(DB) index Step3: Convergence checking stop if step 2 doesn’t change the clustering result. Otherwise, repeat step 2

N3 N1 N2 N4 P1 P2 P3 P4 A2 A3 A1 Φ=0.94 Φ=1.16 Φ=0.93 Φ=0.71 Φ=1.15
Φ=0.92 Microsoft Corp. MS Corp. Microsofe Corp. Macrosoft Inc. N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P1 P2 P3 P4 xxx-1255 xxx-9400 xxx-0500 xxx-2255 A2 A3 A1 1 Microsoft Way 2 Sylvan Way 2 Sylvan W. C1 C2 C3 C4

Matching w.r.t. Soft Constraints
NC1 1 Microsoft Way xxx-1255 Microsofe Corp. NC4 PC1 AC1 PC2 PC3 PC4 AC4 Microsoft Corp. MS Corp. Macrosoft Inc. 2 Sylvan Way xxx-2255 xxx-9400 xxx-0500 2 Sylvan W. 7 s(1-5,7,8) 1 S(6) 5 s(1-5) S(10) 9 S(1-9) 8 S(1-8) Graph Transform Next step is to find the best matching between key attribute and soft uniqueness attributes How to match?

Matching w.r.t. Soft Constraint
Goals Maximizing the sum of weights of selected edges w(e) Minimizing the gap for each node Gap(N) How to balance above two goals? Giving a score function to balance w(e) and Gap(N) Getting the “best” matching Maximize Score function Greedy algorithm: MATCHT Getting Gap(N) and W(u,v) N1 1 (s1) 9 (s2-s10) 7 (s4-s10) P1 P2 P3

Continue the example Solution 1 Solution 2 P1 P1 P2 P2 P3 P4 P1 P2 P2
(s3-s5) 3 (s3-s5) 9 (s2-s10) Greedily select 9 (s2-s10) 1 (s1) 1 (s1) 8 (s2-s9) 8 (s2-s9) 7 (s4-s10) 10 (s1-s10) 7 (s4-s10) 10 (s1-s10) Greedily select P1 P1 P2 P2 P3 P4 P1 P2 P2 P3 P4 P4 Gap(N1) = 9 Gap(N2) = 5 Gap(N1) = 3 Gap(N2) = 0 Gap(P1) = 0 Gap(P2) = 4 Gap(P2) = 4 Gap(P4) = 2 w(N1,P1) = 1 w(N2,P2) = 3 w(N1,P2) = 7 w(N2,P4) = 8 Solution 3 Solution 4 N1 N2 N1 N2 3 (s3-s5) 3 (s3-s5) 9 (s2-s10) 9 (s2-s10) 1 (s1) Greedily select 1 (s1) 8 (s2-s9) 8 (s2-s9) 10 (s1-s10) 7 (s4-s10) 10 (s1-s10) 7 (s4-s10) P1 P2 P3 P3 P4 P1 P2 P3 P4 P4 Gap(N1) =1 Gap(N2) = 0 Gap(N1) =0 Gap(N2) = 0 Gap(P3) = 0 Gap(P4) = 2 Gap(P4) = 2 Gap(P4) = 2 w(N1,P3) =9 w(N2,P2) = 8 w(N1,P4) =10 w(N2,P2) = 8

Experiment Settings Dataset I
Business listings for two zip codes(07035,07715) from multiple sources Zip Business Source #Sources #Sources/business 07035 662 15 1—7 07715 149 6 1—3 Zip Records #Records #Names #Phones #Addresses #(Error Phones) 07035 1629 1154 839 735 72 07715 266 243 184 55 12

Experiment Settings Implementation
MATCH +CLUSTER LINK: linkage only FUSE: data fusion only LINKFUSE: first LINK , second FUSE Golden Standard: by manually checking Measures: Precision/Recall/F-measure Matching of values of different attributes Clustering of values of the same attribute Precision Recall F-measure

Accuracy 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS)
07035 Clustering (NAME) 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)

Efficiency and Scalability

Conclusions In the real-world, we need to resolve duplicates and conflicts at the same time. We reduce the problem to a k-partite graph clustering and matching problem Combine linkage and fusion Experiments show high efficiency and scalability

Thank You!

Record Linkage with Uniqueness Constraints and Erroneous Values

Similar presentations

Presentation on theme: "Record Linkage with Uniqueness Constraints and Erroneous Values"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Record Linkage with Uniqueness Constraints and Erroneous Values

Similar presentations

Presentation on theme: "Record Linkage with Uniqueness Constraints and Erroneous Values"— Presentation transcript:

Similar presentations

About project

Feedback