Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Introduction Simple Random Sampling Stratified Random Sampling
1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics.
ESSnet DI WP2: Record Linkage Luca Valentino Istat.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved An application of probabilistic matching Abowd and Vilhuber (2004), JBES.
© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.
INFO 4470/ILRLE 4470 Social and Economic Data Populations and Frames John M. Abowd and Lars Vilhuber February 7, 2011.
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Record Linking Examples John M. Abowd and Lars Vilhuber March 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
The Census Data Enhancement Project Glenys Bishop.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Record Linking, II John M. Abowd and Lars Vilhuber March 2005.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Estimating m and u Probabilities Using EM Based on Winkler 1988 "Using the EM Algorithm for Weight.
INFO 4470/ILRLE 4470 Register-based statistics by example: County Business Patterns John M. Abowd and Lars Vilhuber February 14, 2011.
INFO 7470/ILRLE 7400 Survey of Income and Program Participation (SIPP) Synthetic Beta File John M. Abowd and Lars Vilhuber April 26, 2011.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Role of editing and imputation in integration of sources for structural business statistics Svein Gåsemyr, Statistics Norway Svein Nordbotten, University.
INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011.
Beyond 2011: Automating the linkage of anonymous data Pete Jones Office for National Statistics.
Learning Objective Chapter 11 Basic Sampling Issues CHAPTER eleven Basic Sampling Issues Copyright © 2000 by John Wiley & Sons, Inc.
1 Business Register: Quality Practices Eddie Salyers
12th Meeting of the Group of Experts on Business Registers
INFO 7470/ILRLE 7400 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber March 25, 2013.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Basic Sampling Issues CHAPTER Ten.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2013.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Optimal Bayes Classification
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Record Linking, II John M. Abowd and Lars Vilhuber April 2007.
Academic Research Academic Research Dr Kishor Bhanushali M
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Linking, selecting cut-offs, and examining quality in the Integrated Data Infrastructure (IDI) Laura O’Sullivan Statistics New Zealand
Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and Ryan Yamagata Introduction Description of Probabilistic.
1 of 22 INTRODUCTION TO SURVEY SAMPLING October 6, 2010 Linda Owens Survey Research Laboratory University of Illinois at Chicago
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
1 Data Collection and Sampling ST Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical.
Public Libraries Survey Data File Overview. What We’ll Talk About PLS: Public Libraries Survey State level data Public library data (Administrative Entities)
Public Libraries Survey Data File Overview. 2 What We’ll Talk About PLS: Public Library Survey State level data Public library data (Administrative Entities)
Chapter 8 – Naïve Bayes DM for Business Intelligence.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2013, 2016.
INFO 7470 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber April 11, 2016.
INFO 7470/ECON 7400/ILRLE 7400 Register-based statistics John M. Abowd and Lars Vilhuber March 4, 2013 and April 4, 2016.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,
Methods for Data-Integration
John M. Abowd and Lars Vilhuber February 16, 2011
Introduction to Probabilistic Record Linking
Reasoning Under Uncertainty in Expert System
Computer Vision Chapter 4
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Using Reported Data as Matching Variables in Record Linkage
Stephanie Hirner ESTP ”Administrative data and censuses
Presentation transcript:

Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Overview Introduction to record linking What is record linking, what is it not, what is the theory? Record linking: Applications and examples How do you do it, what do you need, what are the possible complications? Examples of record linking Do it yourself record linking © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

From Imputing to Linking © 2011 John M. Abowd, Lars Vilhuber, all rights reserved Precision of link Availability of linked data “ Statistical record linkage ” Merge match with imperfect link variables “ Statistical record linkage ” Merge match with imperfect link variables “Massively imputed” Common variables/ values, but datasets can’t be linked “Massively imputed” Common variables/ values, but datasets can’t be linked “Simulated data” No common observations “Simulated data” No common observations “Classical” Merge match by link variable “Classical” Merge match by link variable

Definitions of Record Linkage “a procedure to find pairs of records in two files that represent the same entity” “identify duplicate records within a file” © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Uses of Record Linkage Merging two files for micro-data analysis – CPS base survey to a supplement – SIPP interviews to each other – Merging years of Business Register – Merging two years of CPS – Merging financial info to firm survey Updating/unduplicating a survey frame or a electoral list – Based on business lists – Based on tax records Disclosure review of potential public use micro-data © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Uses of Record Linkage (private-sector applications) Merging two files … – Credit scoring – Customer lists after merger – Internal files when consolidating/upgrading software Updating/unduplicating junk mail lists – Based on multiple sources of lists Disclosure review of potential public use micro- data – Not done… (Netflix case) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Types of Record Linkage Merging two files for micro-data analysis – CPS base survey to a supplement – SIPP interviews to each other – Merging years of Business Register – Merging two years of CPS* – Merging financial info to firm survey Updating a survey frame or a electoral list – Based on business lists – Based on tax records Disclosure review of potential public use micro-data © 2011 John M. Abowd, Lars Vilhuber, all rights reserved Deterministic linkage: survey- provided IDs Probabilistic linkage: imperfect or no IDs Probabilistic linkage: no IDs

Methods of Record Linkage Probabilistic record linkage (PBRL) – non-parametric methods – regression-based methods Distance-based record linkage (DBRL) – Euclidean distance – Mahalanobis distance – Kernel-based distance © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Need for Automated Record Linkage RA time required for the following matching tasks: – Finding financial records for Fortune 100: 200 hours (Abowd, 1989) 50,000 small businesses: ??? hours – Identifying miscoded SSNs on 60,000 wage records: several weeks on 500 million wage records: ???? – Unduplication of the U.S. Census survey frame (115,904,641 households): ???? – Longitudinally linking the 12 million establishments in the Business Register: ???? © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Basic Definitions and Notation Entities Associated files Records on files Matches Nonmatches © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Comparisons Comparison function maps comparison space into some domain: Comparison vector PBRL: Agreement pattern, finitely many values, typically {0,1}, but can be Reals DBRL: distance (scalar) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Linkage Rule A linkage rule defines a record pair’s status based on it’s comparison value – Link (L) – Undecided (Clerical, C) – Non-link (N) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

In a perfect world… © 2011 John M. Abowd, Lars Vilhuber, all rights reserved and

Linkage Rules Depend on Context PBRL: – For matching: rank by agreement ratios, use cutoff values to classify into {L,C,U} – For disclosure-analysis: rank by agreement ratios, classify as {L} if true link (M) is among top j pairs DBRL: – Rank pairs by distance, link closest pairs © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Probabilistic Record Linkage © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Example Agreement Pattern 3 binary comparisons test whether – γ 1 pair agrees on last name – γ 2 pair agrees on first name – γ 3 pair agrees on street name Simple agreement pattern: γ=(1,0,1) Complex agreement pattern: γ=(0.66,0,0.8) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Conditional Probabilities Probability that a record pair has agreement pattern γ given that it is a match [nonmatch] P(γ|M) P(γ|U) Agreement ratio R(γ) = P(γ|M) / P(γ|U) This ratio will determine the distinguishing power of the comparison γ. © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Error Rates False match: a linked pair that is not a match (type II error) False match rate: probability that a designated link (L) is a nonmatch: μ=P(L|U) False nonmatch: a nonlinked pair that is a match (type I error) False nonmatch rate: probability that a designated nonlink is a match: λ=P(N|M) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Fundamental Theorem 1.Order the comparison vectors {γ j } by R(γ) 2.Choose upper T u and lower T l cutoff values for R(γ) 3.Linkage rule: © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Fundamental Theorem (cont.) Error rates are © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Fundamental Theorem (3) Fellegi & Sunter (JASA, 1969): If the error rates for the elements of the comparison vector are conditionally independent, then given the overall error rates ( , ), the linkage rule F minimizes the probability associated with an agreement pattern  being placed in the clerical review set. (optimal linkage rule) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Applying the Theory The theory holds on any subset of match pairs (blocks) Ratio R: matching weight or total agreement weight Optimality of decision rule heavily dependent on the probabilities P(γ|M) and P(γ|U) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Distance-Based Record Linking © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Distance-Based Record Linking Distance between any pair of records can be generally defined as © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

DBRL: 4 cases Mahalanobis distance, known covariance Mahalanobis distance, unknown covariance © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

DBRL: 4 cases Euclidean distance, unstandardized inputs Euclidean distance, standardized inputs © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Linkage rules Matching: Sort by distance, choose top j pairs as matches Disclosure analysis: Sort by distance, identify true matches among top j pairs © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Acknowledgements This lecture is based in part on a 2000 and 2004 lecture given by William Winkler, William Yancey and Edward Porter at the U.S. Census Bureau Some portions draw on Winkler (1995), “Matching and Record Linkage,” in B.G. Cox et. al. (ed.), Business Survey Methods, New York, J. Wiley, Some (non-confidential) portions drawn from Abowd, Stinson, Benedetto (2006), “Final Report to Social Security Administration on the SIPP/SSA/IRS Public Use File Project” © 2011 John M. Abowd, Lars Vilhuber, all rights reserved