Presentation is loading. Please wait.

Presentation is loading. Please wait.

Challenges in data linkage: error and bias

Similar presentations


Presentation on theme: "Challenges in data linkage: error and bias"— Presentation transcript:

1 Challenges in data linkage: error and bias
 Katie Harron October 2014 UCL Institute of Child Health

2 The linkage problem Match status Match Non-match Link status Link
Match status Match (pair from same individual) Non-match (pair from different individuals) Link status Link Identified match False match Non-link Missed match Identified non-match 2

3 Deterministic linkage in Hospital Episode Statistics (HES)
Few false-matches 1 Sex Date of Birth NHS Number 2 Postcode Local Patient Identifier within Provider 3 More missed-matches What kind of linkage is this? Which patients might be missed?

4 Quality of unique identifiers
166,406 records of admissions to paediatric intensive care (PICANet) 85,137 non-matches 81,269 matches 46 (0.1%) same NHS number 3,207 (4%) different NHS numbers Hagger-Johnson et al. Causes and consequences of data linkage errors: False and missed matches following linkage of hospital data (under review)

5 Deterministic linkage with pseudonymisation at source
Courtesy of Peter Jones, ONS Beyond 2011 programme

6 Probabilistic linkage
Highest weight is retained pair 1 pair 2 pair 3 Low match High match weight weight Primary File Ronald Fisher Linking File Ronald Fisher Linking File Karl Pearson Linking File Carl Gauss

7 Probabilistic linkage
Highest weight is retained pair 3 Low match High match weight weight P(γ=1 | M) = m-probability = sensitivity the probability of agreement given the records from same subject  P(γ=1 | U) = u-probability= 1-specificity the probability of agreement given the records from different subjects Log ratio = w = log2 (m/u) if identifiers agree log2 [(1-m)/(1-u)] if identifiers disagree Match weight = W = ∑wi

8 Probabilistic linkage
Matches agreement on NHS number Non-matches agreement on sex Low match High match weight weight disagreement on date of birth agree on some ids disagree on some ids Chance (same date of birth) Recording errors Missing data

9 Matches Non-matches Two thresholds Missed matches False matches Links
Low match High match weight weight Links Links Two thresholds

10 Evaluating linkage quality
Small amounts of linkage error can result in substantially biased results The impact of linkage error on results is rarely reported Linkage error affects different types of analysis in different ways

11 Why it’s important to evaluate linkage error
Schmidlin et al (2013) Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort. BMC Med Inform Decis Mak 13 (1):1

12 Why it’s important to evaluate linkage error
Hobbs, G. & Vignoles, A., Is free school meal status a valid proxy for socio-economic status (in schools research)? Centre for the Economics of Education; London School of Economics and Political Science.

13 Why it’s important to evaluate linkage error
Highly sensitive Highly specific Lariscy (2011). Differential Record Linkage by Hispanic Ethnicity and Age in Linked Mortality Studies: Implications for the Epidemiologic Paradox. J Aging Health 23(8):

14 Why it’s important to evaluate linkage error
Ford et al Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatric and Perinatal Epidemiology 20(4):

15 Evaluating linkage quality
i) Sensitivity analysis using different linkage criteria ii) Subset of gold-standard data to quantify linkage bias Highly sensitive Highly specific iii) Comparisons of linked and unlinked data iv) Imputation for uncertain links

16 Imputation for linkage
Primary file Linking file Variable of interest Record 1 Exact match high Record 2 low Record 3 Prior-informed imputation high low Record 4 Match weight=10 med Match weight=1 low Record n Match weight=5 Match weight=4 high Match weight=3 med high Goldstein et al Stat Med 2012;31(28): Harron et al BMC Med Res Method 2014;14(1):36

17 Implications for data providers
i) Sensitivity analysis using different thresholds Availability of all candidate records (linked and unlinked) ii) Subset of gold-standard data to quantify linkage bias iii) Comparisons of linked and unlinked data Subset of data where true match status is known (gold-standard) iv) Imputation for uncertain links Harron et al Opening the black box of record linkage. J Epidemiol Commun H 66(12):1198

18 Summary Data linkage is a powerful tool for enhancing administrative data Linkage error has important effects on analyses Results vary according to choice of thresholds and methods Taking error into account is possible without releasing identifiable data Communication between linkers and data users is vital

19 Acknowledgements and funding
Harvey Goldstein, Ruth Gilbert, Gareth Hagger-Johnson and Angie Wade, UCL Institute of Child Health Berit Muller-Pebody, Public Health England Roger Parslow, Tom Fleming, Lee Norman and the PICANet team, University of Leeds This work was supported by funding from the National Institute for Health Research Health Technology Assessment (NIHR HTA) programme (project number 08/13/47). The views and opinions expressed therein are those of the authors and do not necessarily reflect those of the HTA programme, NIHR, NHS or the Department of Health. The authors state no conflicts of interest.


Download ppt "Challenges in data linkage: error and bias"

Similar presentations


Ads by Google