© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.

Slides:



Advertisements
Similar presentations
Evaluating the Effects of Business Register Updates on Monthly Survey Estimates Daniel Lewis.
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Introduction Simple Random Sampling Stratified Random Sampling
Record Linking, II John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
1 The Structure of Error Components in 2010 Census Coverage Error Estimation: P-sample estimates Mary H. Mulry & Bruce D. Spencer U.S. Census Bureau Northwestern.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
INFO 7470/ILRLE 7400 Geographic Information Systems John M. Abowd and Lars Vilhuber March 29, 2011.
ESSnet DI WP2: Record Linkage Luca Valentino Istat.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio.
© John M. Abowd 2005, all rights reserved Household Samples John M. Abowd March 2005.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved An application of probabilistic matching Abowd and Vilhuber (2004), JBES.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.
INFO 4470/ILRLE 4470 Social and Economic Data Introduction John M. Abowd and Lars Vilhuber January 24, 2011.
INFO 4470/ILRLE 4470 Social and Economic Data Populations and Frames John M. Abowd and Lars Vilhuber February 7, 2011.
INFO 4470/ILRLE 4470 Social and Economic Data The 2010 Census of Population John M. Abowd and Lars Vilhuber January 26, 2011.
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Record Linking Examples John M. Abowd and Lars Vilhuber March 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
The Census Data Enhancement Project Glenys Bishop.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Record Linking, II John M. Abowd and Lars Vilhuber March 2005.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Estimating m and u Probabilities Using EM Based on Winkler 1988 "Using the EM Algorithm for Weight.
INFO 4470/ILRLE 4470 Register-based statistics by example: County Business Patterns John M. Abowd and Lars Vilhuber February 14, 2011.
INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011.
Beyond 2011: Automating the linkage of anonymous data Pete Jones Office for National Statistics.
Learning Objective Chapter 11 Basic Sampling Issues CHAPTER eleven Basic Sampling Issues Copyright © 2000 by John Wiley & Sons, Inc.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
1 Business Register: Quality Practices Eddie Salyers
12th Meeting of the Group of Experts on Business Registers
INFO 7470/ILRLE 7400 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber March 25, 2013.
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2013.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
1.State your research hypothesis in the form of a relation between two variables. 2. Find a statistic to summarize your sample data and convert the above.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Copyright © 2013 Pearson Education. All rights reserved. Chapter 1 Introduction to Statistics and Probability.
Optimal Bayes Classification
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Record Linking, II John M. Abowd and Lars Vilhuber April 2007.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Linking, selecting cut-offs, and examining quality in the Integrated Data Infrastructure (IDI) Laura O’Sullivan Statistics New Zealand
Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.
Household Survey Data on Remittances in Sending Countries Johan A. Mistiaen International Technical meeting on Measuring Remittances Washington DC - January.
Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and Ryan Yamagata Introduction Description of Probabilistic.
1 A Study of Sources for the Error Structure in Estimates of Census Coverage Error Components Mary H. Mulry U.S. Census Bureau 2009 International Total.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2013, 2016.
INFO 7470 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber April 11, 2016.
INFO 7470/ECON 7400/ILRLE 7400 Register-based statistics John M. Abowd and Lars Vilhuber March 4, 2013 and April 4, 2016.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,
INFO 7470 Statistical Tools: Hierarchical Models and Network Analysis John M. Abowd and Lars Vilhuber May 2, 2016.
John M. Abowd and Lars Vilhuber February 16, 2011
Challenges in data linkage: error and bias
Introduction to Probabilistic Record Linking
Reasoning Under Uncertainty in Expert System
Lecture 9: Entity Resolution
Using Reported Data as Matching Variables in Record Linkage
Quality assurance and assessment in the vital statistics system
Stephanie Hirner ESTP ”Administrative data and censuses
Presentation transcript:

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Overview Introduction to probabilistic record linking What is record linking, what is it not, what is the theory? Probabilistic record linking: Applications and examples How do you do it, what do you need, what are the possible complications? Examples of record linking Do it yourself record linking

© John M. Abowd and Lars Vilhuber 2005, all rights reserved From imputing to linking Precision of link Availability of linked data “ Probabilistic record linkage ” Merge match with imperfect link variables “ Probabilistic record linkage ” Merge match with imperfect link variables “Massively imputed” Common variables/ values, but datasets can’t be linked “Massively imputed” Common variables/ values, but datasets can’t be linked “Simulated data” No common variables, only moments transferred “Simulated data” No common variables, only moments transferred “Classical” Merge match by link variable “Classical” Merge match by link variable

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Definitions of record linkage “a procedure to find pairs of records in two files that represent the same entity” “identify duplicate records within a file”

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Uses of record linkage Merging two files for microdata analysis –CPS base survey to a supplement –SIPP interviews to each other –Merging years of the Business Register –Merging two years of CPS –Merging financial info to firm survey Updating a survey frame –Based on business lists –Based on tax records Disclosure review of potential public use microdata Data mining by businesses and law enforcement

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Types of record linkage Merging two files for microdata analysis –CPS base survey to a supplement –SIPP interviews to each other –Merging years of Business Register –Merging two years of CPS –Merging financial info to firm survey Updating a survey frame –Based on business lists –Based on tax records Disclosure review of potential public use microdata Deterministic linkage: survey- provided IDs Probabilistic linkage: imperfect or no IDs Probabilistic linkage: no IDs

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Basic definitions and Notation Set of entities A, B Associated files α(A), β(B) Matches Nonmatches

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Agreement patterns Comparison space Comparison vector Components of comparison vector take on finitely many values, typically {0,1}

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Example Agreement pattern 3 binary comparisons test whether –γ 1 pair agrees on last name –γ 2 pair agrees on first name –γ 3 pair agrees on street name Sample agreement pattern: γ=(1,0,1)

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Linkage rule A linkage rule defines a record pair’s status based on it’s agreement pattern –Link (L) –Undecided (Clerical, C) –Non-link (N)

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Conditional probabilities Probability that a record pair has agreement pattern γ given that it is a match [nonmatch] P(γ|M) P(γ|U) Agreement ratio R(γ) = P(γ|M) / P(γ|U) This ratio will determine the distinguishing power of the comparison γ.

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Error rates False match: a linked pair that is not a match (type II error) False match rate: probability that a designated link (L) is a nonmatch: μ=P(L|U) False nonmatch: a nonlinked pair that is a match (type I error) False nonmatch rate: probability that a designated nonlink is a match: λ=P(N|M)

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Fundamental Theorem 1.Order the comparison vectors {γ j } by R(γ) 2.Choose upper T u and lower T l cutoff values for R(γ) 3.Linkage rule:

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Fundamental theorem (cont.) Error rates are

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Fundamenal Theorem (3) Fellegi & Sunter (JASA, 1969): If the error rates for the elements of the comparison vector are conditionally independent, then given the overall error rates ( , ), the the linkage rule F minimizes the probability associated with an agreement pattern  being placed in the clerical review set. (optimal linkage rule)

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Applying the theory The theory holds on any subset of match pairs (blocks) Ratio R: matching weight or total agreement weight Optimality of decision rule heavily dependent on the probabilities P(γ|M) and P(γ|U)

© John M. Abowd and Lars Vilhuber 2005, all rights reserved Acknowledgements This lecture is based in part on a 2000 and 2004 lecture given by William Winkler, William Yancey and Edward Porter at the U.S. Census Bureau Some portions draw on Winkler (1995), “Matching and Record Linkage,” in B.G. Cox et. al. (ed.), Business Survey Methods, New York, J. Wiley, Examples are all purely fictitious, but inspired from true cases presented in the above lecture, in Abowd & Vilhuber (2004).