Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.

Similar presentations


Presentation on theme: "Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved."— Presentation transcript:

1 Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

2 Overview Introduction to record linking What is record linking, what is it not, what is the theory? Record linking: Applications and examples How do you do it, what do you need, what are the possible complications? Examples of record linking Do it yourself record linking © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

3 From Imputing to Linking © 2011 John M. Abowd, Lars Vilhuber, all rights reserved Precision of link Availability of linked data “ Statistical record linkage ” Merge match with imperfect link variables “ Statistical record linkage ” Merge match with imperfect link variables “Massively imputed” Common variables/ values, but datasets can’t be linked “Massively imputed” Common variables/ values, but datasets can’t be linked “Simulated data” No common observations “Simulated data” No common observations “Classical” Merge match by link variable “Classical” Merge match by link variable

4 Definitions of Record Linkage “a procedure to find pairs of records in two files that represent the same entity” “identify duplicate records within a file” © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

5 Uses of Record Linkage Merging two files for micro-data analysis – CPS base survey to a supplement – SIPP interviews to each other – Merging years of Business Register – Merging two years of CPS – Merging financial info to firm survey Updating/unduplicating a survey frame or a electoral list – Based on business lists – Based on tax records Disclosure review of potential public use micro-data © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

6 Uses of Record Linkage (private-sector applications) Merging two files … – Credit scoring – Customer lists after merger – Internal files when consolidating/upgrading software Updating/unduplicating junk mail lists – Based on multiple sources of lists Disclosure review of potential public use micro- data – Not done… (Netflix case) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

7 Types of Record Linkage Merging two files for micro-data analysis – CPS base survey to a supplement – SIPP interviews to each other – Merging years of Business Register – Merging two years of CPS* – Merging financial info to firm survey Updating a survey frame or a electoral list – Based on business lists – Based on tax records Disclosure review of potential public use micro-data © 2011 John M. Abowd, Lars Vilhuber, all rights reserved Deterministic linkage: survey- provided IDs Probabilistic linkage: imperfect or no IDs Probabilistic linkage: no IDs

8 Methods of Record Linkage Probabilistic record linkage (PBRL) – non-parametric methods – regression-based methods Distance-based record linkage (DBRL) – Euclidean distance – Mahalanobis distance – Kernel-based distance © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

9 Need for Automated Record Linkage RA time required for the following matching tasks: – Finding financial records for Fortune 100: 200 hours (Abowd, 1989) 50,000 small businesses: ??? hours – Identifying miscoded SSNs on 60,000 wage records: several weeks on 500 million wage records: ???? – Unduplication of the U.S. Census survey frame (115,904,641 households): ???? – Longitudinally linking the 12 million establishments in the Business Register: ???? © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

10 Basic Definitions and Notation Entities Associated files Records on files Matches Nonmatches © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

11 Comparisons Comparison function maps comparison space into some domain: Comparison vector PBRL: Agreement pattern, finitely many values, typically {0,1}, but can be Reals DBRL: distance (scalar) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

12 Linkage Rule A linkage rule defines a record pair’s status based on it’s comparison value – Link (L) – Undecided (Clerical, C) – Non-link (N) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

13 In a perfect world… © 2011 John M. Abowd, Lars Vilhuber, all rights reserved and

14 Linkage Rules Depend on Context PBRL: – For matching: rank by agreement ratios, use cutoff values to classify into {L,C,U} – For disclosure-analysis: rank by agreement ratios, classify as {L} if true link (M) is among top j pairs DBRL: – Rank pairs by distance, link closest pairs © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

15 Probabilistic Record Linkage © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

16 Example Agreement Pattern 3 binary comparisons test whether – γ 1 pair agrees on last name – γ 2 pair agrees on first name – γ 3 pair agrees on street name Simple agreement pattern: γ=(1,0,1) Complex agreement pattern: γ=(0.66,0,0.8) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

17 Conditional Probabilities Probability that a record pair has agreement pattern γ given that it is a match [nonmatch] P(γ|M) P(γ|U) Agreement ratio R(γ) = P(γ|M) / P(γ|U) This ratio will determine the distinguishing power of the comparison γ. © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

18 Error Rates False match: a linked pair that is not a match (type II error) False match rate: probability that a designated link (L) is a nonmatch: μ=P(L|U) False nonmatch: a nonlinked pair that is a match (type I error) False nonmatch rate: probability that a designated nonlink is a match: λ=P(N|M) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

19 Fundamental Theorem 1.Order the comparison vectors {γ j } by R(γ) 2.Choose upper T u and lower T l cutoff values for R(γ) 3.Linkage rule: © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

20 Fundamental Theorem (cont.) Error rates are © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

21 Fundamental Theorem (3) Fellegi & Sunter (JASA, 1969): If the error rates for the elements of the comparison vector are conditionally independent, then given the overall error rates ( , ), the linkage rule F minimizes the probability associated with an agreement pattern  being placed in the clerical review set. (optimal linkage rule) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

22 Applying the Theory The theory holds on any subset of match pairs (blocks) Ratio R: matching weight or total agreement weight Optimality of decision rule heavily dependent on the probabilities P(γ|M) and P(γ|U) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

23 Distance-Based Record Linking © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

24 Distance-Based Record Linking Distance between any pair of records can be generally defined as © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

25 DBRL: 4 cases Mahalanobis distance, known covariance Mahalanobis distance, unknown covariance © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

26 DBRL: 4 cases Euclidean distance, unstandardized inputs Euclidean distance, standardized inputs © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

27 Linkage rules Matching: Sort by distance, choose top j pairs as matches Disclosure analysis: Sort by distance, identify true matches among top j pairs © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

28 Acknowledgements This lecture is based in part on a 2000 and 2004 lecture given by William Winkler, William Yancey and Edward Porter at the U.S. Census Bureau Some portions draw on Winkler (1995), “Matching and Record Linkage,” in B.G. Cox et. al. (ed.), Business Survey Methods, New York, J. Wiley, 355-384. Some (non-confidential) portions drawn from Abowd, Stinson, Benedetto (2006), “Final Report to Social Security Administration on the SIPP/SSA/IRS Public Use File Project” © 2011 John M. Abowd, Lars Vilhuber, all rights reserved


Download ppt "Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved."

Similar presentations


Ads by Google