The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12.

Slides:



Advertisements
Similar presentations
1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine
Advertisements

Brief introduction on Logistic Regression
Genetic Heterogeneity Taken from: Advanced Topics in Linkage Analysis. Ch. 27 Presented by: Natalie Aizenberg Assaf Chen.
Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/
Logistic Regression Psy 524 Ainsworth.
Wisconsin Department of Health Services Richard Miller Research Scientist Wisconsin Office of Health Informatics October 28, 2014 Matching Traffic Crash.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
EPI 809/Spring Probability Distribution of Random Error.
Validating uncertain predictions Tony O’Hagan, Leo Bastos, Jeremy Oakley, University of Sheffield.
Capturing Sensitive Data & Data Linkage. Capturing Sensitive Data Data Protection Act 1998 (Section 33) – Allows data to be used for research purposes.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved An application of probabilistic matching Abowd and Vilhuber (2004), JBES.
Using the Maryland Biological Stream Survey Data to Test Spatial Statistical Models A Collaborative Approach to Analyzing Stream Network Data Andrew A.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Inferences About Means of Two Independent Samples Chapter 11 Homework: 1, 2, 3, 4, 6, 7.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Lecture 9: One Way ANOVA Between Subjects
EPI 809/Spring Multiple Logistic Regression.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Chapter 9 Two-Sample Tests Part II: Introduction to Hypothesis Testing Renee R. Ha, Ph.D. James C. Ha, Ph.D Integrative Statistics for the Social & Behavioral.
AM Recitation 2/10/11.
Categorical Data Prof. Andy Field.
Selecting the Correct Statistical Test
Logistic Regression III: Advanced topics Conditional Logistic Regression for Matched Data Conditional Logistic Regression for Matched Data.
Beyond 2011: Automating the linkage of anonymous data Pete Jones Office for National Statistics.
A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department.
Quantitative Research in Education Sohee Kang Ph.D., lecturer Math and Statistics Learning Centre.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
14 Elements of Nonparametric Statistics
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Experimental Research Methods in Language Learning Chapter 11 Correlational Analysis.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 23/10/2015 9:22 PM 1 Two-sample comparisons Underlying principles.
I. Statistical Tests: A Repetive Review A.Why do we use them? Namely: we need to make inferences from incomplete information or uncertainty þBut we want.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
A Stochastic Model of Paratuberculosis Infection In Scottish Dairy Cattle I.J.McKendrick 1, J.C.Wood 1, M.R.Hutchings 2, A.Greig 2 1. Biomathematics &
Simon Power Managing Consultant John Rae Director Understanding Communities Through PayCheck
General Register Office for S C O T L A N D information about Scotland's people Comparison between NHSCR and Community health index sources of migration.
DIRECTIONAL HYPOTHESIS The 1-tailed test: –Instead of dividing alpha by 2, you are looking for unlikely outcomes on only 1 side of the distribution –No.
ANALYSIS PLAN: STATISTICAL PROCEDURES
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Design of the 2011 Census Coverage Survey Owen Abbott (ONS) James Brown (Institute of Education)
Improved Register Data Matching and its Impact on Survey Population Estimates Steve Vale Office for National Statistics, UK.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Logic and Vocabulary of Hypothesis Tests Chapter 13.
Intermediate Applied Statistics STAT 460 Lecture 18, 11/10/2004 Instructor: Aleksandra (Seša) Slavković TA: Wang Yu
An ecological analysis of crime and antisocial behaviour in English Output Areas, 2011/12 Regression modelling of spatially hierarchical count data.
Chapter 10 Comparing Two Treatments Statistics, 5/E by Johnson and Bhattacharyya Copyright © 2006 by John Wiley & Sons, Inc. All rights reserved.
Using Data from the National Survey of Children with Special Health Care Needs Centers for Disease Control and Prevention National Center for Health Statistics.
Nonparametric Statistics
1 Linking Social Security Death Index (SSDI) Data with Registry Data to Update Demographics and Vital Status David O’Brien, PhD, GISP Alaska Cancer Registry.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
 Kolmogor-Smirnov test  Mann-Whitney U test  Wilcoxon test  Kruskal-Wallis  Friedman test  Cochran Q test.
AP Statistics Chapter 25 Paired Samples and Blocks.
Chapter 9 Introduction to the t Statistic
Chapter 11: Simple Linear Regression
This Week Review of estimation and hypothesis testing
I. Statistical Tests: Why do we use them? What do they involve?
Presentation transcript:

The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF

The record linkage problem Given two files A and B, the aim is to find record pairs which refer to the same person. This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode The data matrix therefore looks like

With four linking fields Source of record Linking field 1 Linking field 2 Linking field 3 Linking field 4 File AA1A2A3A4 File BB1B2B3B4

What is the assumption of conditional independence? The likelihood that the two records refer to the same person is measured by a log likelihood ratio

What is the assumption of conditional independence? This is much easier to work out if the observations are independent conditional on match status because now

Why is the assumption of conditional independence important? It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields Enables the use of frequency based agreement weights Speeds up computing time Improves stability of parameter estimation But is almost always wrong e.g. gender is almost wholly predictable from first name But does it matter?

Who adopts the conditional independence assumption? Rec Link (US Census Bureau) – yes Link Plus (US Centers for Disease Control and Prevention) – yes GRLS/Fundy (Statistics Canada) – yes ORLS – yes (probably) RELAIS (Italian Statistical Institute) - no

Two questions To what extent is the assumption violated in real data sets? How much effect does it have on the output of linkage software?

What does the assumption look like in practice? A = Agree D = Disagree M = Match N = Non-match Linkage score Field 1Field 2Field 3Field 4Match status HighAAAAM AAAAM AAAAM AAAM …… MediumADAM DAAN ADAN DAAM …… LowDDDAN DDADN ADDDN DDDN DADDN

Calculating the correlations between linkage fields Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales

Run 1 - tetrachoric correlations for matches in the Census/CCS data – medium linkage scores only Matches N < 1707 first name last name house no dob year dob mon dob day post codegender first name last name house number year of birth month of birth day of birth post code gender

Run 1 - tetrachoric correlations for non-matches in the Census/CCS data – medium linkage scores only Non-matches N < 303 first name last name house no dob year dob mon dob day post codegender first name last name house number year of birth month of birth day of birth post code gender

Run 2 - tetrachoric correlations for matches in the NHSCR/HESA data – medium linkage scores only Matches N < 450 first name last name birth date post codegender first name last name date of birth post code gender

Run 2 - tetrachoric correlations for non-matches in the NHSCR/HESA data – medium linkage scores only Non Matches N < 131 first name last name birth date post codegender first name last name date of birth post code gender

So the assumption of independence is significantly violated. Does it matter? Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence” Run 4 – day, month and year treated as three separate fields (and therefore as independent) Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”

Is run 4 worse than runs 3 and 5?

Run 6 – the Clackmannanshire data

Conclusions Work in progress and limited amounts of data currently available No evidence that the assumption of conditional independence has negative effects on output quality Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available For the moment, any views on the methods used and/or findings so far?

The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF