Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12.

Similar presentations


Presentation on theme: "The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12."— Presentation transcript:

1 The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk

2 The record linkage problem Given two files A and B, the aim is to find record pairs which refer to the same person. This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode The data matrix therefore looks like

3 With four linking fields Source of record Linking field 1 Linking field 2 Linking field 3 Linking field 4 File AA1A2A3A4 File BB1B2B3B4

4 What is the assumption of conditional independence? The likelihood that the two records refer to the same person is measured by a log likelihood ratio

5 What is the assumption of conditional independence? This is much easier to work out if the observations are independent conditional on match status because now

6 Why is the assumption of conditional independence important? It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields Enables the use of frequency based agreement weights Speeds up computing time Improves stability of parameter estimation But is almost always wrong e.g. gender is almost wholly predictable from first name But does it matter?

7 Who adopts the conditional independence assumption? Rec Link (US Census Bureau) – yes Link Plus (US Centers for Disease Control and Prevention) – yes GRLS/Fundy (Statistics Canada) – yes ORLS – yes (probably) RELAIS (Italian Statistical Institute) - no

8 Two questions To what extent is the assumption violated in real data sets? How much effect does it have on the output of linkage software?

9 What does the assumption look like in practice? A = Agree D = Disagree M = Match N = Non-match Linkage score Field 1Field 2Field 3Field 4Match status HighAAAAM AAAAM AAAAM AAAM …… MediumADAM DAAN ADAN DAAM …… LowDDDAN DDADN ADDDN DDDN DADDN

10 Calculating the correlations between linkage fields Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales

11 Run 1 - tetrachoric correlations for matches in the Census/CCS data – medium linkage scores only Matches N < 1707 first name last name house no dob year dob mon dob day post codegender first name 1.00-0.110.09-0.33-0.46-0.40-0.17-0.12 last name-0.111.000.01-0.27-0.36-0.40-0.29-0.08 house number0.090.011.00-0.43-0.32-0.41-0.03-0.13 year of birth-0.33-0.27-0.431.000.170.24-0.12-0.02 month of birth-0.46-0.36-0.320.171.000.47-0.07-0.04 day of birth-0.40 -0.410.240.471.00-0.13-0.01 post code-0.17-0.29-0.03-0.12-0.07-0.131.00-0.05 gender-0.12-0.08-0.13-0.02-0.04-0.01-0.051.00

12 Run 1 - tetrachoric correlations for non-matches in the Census/CCS data – medium linkage scores only Non-matches N < 303 first name last name house no dob year dob mon dob day post codegender first name1.000.19-0.34-0.150.13-0.25-0.700.22 last name0.191.00-0.14-0.32-0.10-0.46 -0.11 house number-0.34-0.141.000.03-0.400.000.22-0.18 year of birth-0.15-0.320.031.00-0.030.220.02-0.13 month of birth0.13-0.10-0.40-0.031.00-0.03-0.33-0.08 day of birth-0.25-0.460.000.22-0.031.000.17-0.16 post code-0.70-0.460.220.02-0.330.171.00-0.16 gender0.22-0.11-0.18-0.13-0.08-0.16 1.00

13 Run 2 - tetrachoric correlations for matches in the NHSCR/HESA data – medium linkage scores only Matches N < 450 first name last name birth date post codegender first name1.00-0.07 -0.650.10 last name-0.071.00-0.03 -0.01 date of birth-0.07-0.031.00-0.26-0.01 post code-0.65-0.03-0.261.00-0.13 gender0.10-0.01 -0.131.00

14 Run 2 - tetrachoric correlations for non-matches in the NHSCR/HESA data – medium linkage scores only Non Matches N < 131 first name last name birth date post codegender first name1.000.01-0.66-0.15-0.54 last name0.011.000.24-0.44-0.06 date of birth-0.660.241.00-0.490.19 post code-0.15-0.44-0.491.00-0.07 gender-0.54-0.060.19-0.071.00

15 So the assumption of independence is significantly violated. Does it matter? Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence” Run 4 – day, month and year treated as three separate fields (and therefore as independent) Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence”

16 Is run 4 worse than runs 3 and 5?

17 Run 6 – the Clackmannanshire data

18 Conclusions Work in progress and limited amounts of data currently available No evidence that the assumption of conditional independence has negative effects on output quality Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available For the moment, any views on the methods used and/or findings so far?

19 The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF stephen.sharp@gro-scotland.gsi.gov.uk


Download ppt "The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12."

Similar presentations


Ads by Google