Record Linkage Simulation Biolink Meeting June 3 2013 Adelaide Ariel.

Slides:



Advertisements
Similar presentations
Agency for Healthcare Research and Quality (AHRQ)
Advertisements

Introduction Simple Random Sampling Stratified Random Sampling
Experimental and Ex Post Facto Designs
Design of Experiments Lecture I
What does integrated statistical and contextual knowledge look like for 3.10? Anne Patel & Jake Wills Otahuhu College & Westlake Boys High School CensusAtSchool.
Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 12 Measures of Association.
Quality assurance -Population and Housing Census Alma Kondi, INSTAT, Albania.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Active Learning and Collaborative Filtering
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 10 th Edition.
Evaluating Hypotheses
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
8-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft.
Copyright ©2011 Pearson Education 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft Excel 6 th Global Edition.
By Dr. Ahmed Mostafa Assist. Prof. of anesthesia & I.C.U. Evidence-based medicine.
Introduction to the design (and analysis) of experiments James M. Curran Department of Statistics, University of Auckland
How do cancer rates in your area compare to those in other areas?
Synthetic Data Generation - Darshana Pathak. Synthetic Data A process of creation of realistic data set. Realistic means having characteristics of real.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Determining Sample Size
Chapter 1: Introduction to Statistics
Population Biology: PVA & Assessment Mon. Mar. 14
Slide Slide 1 Chapter 8 Sampling Distributions Mean and Proportion.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 11 th Edition.
Confidence Interval Estimation
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 11 th Edition.
Chapter 9: Testing Hypotheses
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
PARAMETRIC STATISTICAL INFERENCE
Experimental Design making causal inferences Richard Lambert, Ph.D.
Patterns of Event Causality Suggest More Effective Corrective Actions Abstract: The Occurrence Reporting and Processing System (ORPS) has used a consistent.
DTC Quantitative Methods Survey Research Design/Sampling (Mostly a hangover from Week 1…) Thursday 17 th January 2013.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
Gile Sampling1 Sampling. Fundamental principles. Daniel Gile
Sampling Error.  When we take a sample, our results will not exactly equal the correct results for the whole population. That is, our results will be.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Age and Sex Distribution United Nations Statistics.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
QC THE MULTIRULE INTERPRETATION
Confidence intervals. Estimation and uncertainty Theoretical distributions require input parameters. For example, the weight of male students in NUS follows.
BC Jung A Brief Introduction to Epidemiology - XIII (Critiquing the Research: Statistical Considerations) Betty C. Jung, RN, MPH, CHES.
© Copyright McGraw-Hill 2004
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics.
Course: Research in Biomedicine and Health III Seminar 5: Critical assessment of evidence.
T tests comparing two means t tests comparing two means.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Confidence Intervals and Hypothesis Testing Mark Dancox Public Health Intelligence Course – Day 3.
Sampling Concepts Nursing Research. Population  Population the group you are ultimately interested in knowing more about “entire aggregation of cases.
Chapter 11: Test for Comparing Group Means: Part I.
Chapter 8 Confidence Interval Estimation Statistics For Managers 5 th Edition.
Understanding Populations & Samples
AC 1.2 present the survey methodology and sampling frame used
Chapter 8 Introducing Inferential Statistics.
Assessing Disclosure Risk in Microdata
2 independent Groups Graziano & Raulin (1997).
Fundamentals of Statistics
Gerald Dyer, Jr., MPH October 20, 2016
Collecting and Interpreting Quantitative Data – Introduction (Part 1)
DESIGN OF EXPERIMENTS by R. C. Baker
A handbook on validation methodology. Metrics.
Presentation transcript:

Record Linkage Simulation Biolink Meeting June Adelaide Ariel

2 Overview Introduction Our approach Simulation Result Conclusion

3 Introduction Factors influence the performance of record linkage: The number of identifiers used as linkage variables In general, the more identifiers used the better. The discriminative power of the identifiers Identifiers should have high discriminative power. Can be a threat to privacy. The quality of the identifiers All datasets linked should have the same high quality level of identifiers Superior quality in one dataset cannot compensate for lower quality in other ds. The size of the population in the corresponding datasets The bigger the population, the more likely it is to have false positives.

4 Our approach Main assumption: The effect of errors in one record can be ‘reduced’ by considering using only a subset of identifiers. Benefits are twofold: Data management: able to identify minimum requirements for dataset It is handy for data owners to know which identifiers should have at least a certain quality level. A sort of checklist can be developed to assess whether a new dataset meets this requirement to be linked to existing datasets. Efficiency: recognize which identifiers should be used to get an acceptable level of correct links Current practice in deterministic record linkage is relaxing the number of identifiers to obtain more links. This can lead to linkages with lower quality.

5 Goal of simulation study The goals: To evaluate which linkage keys provide an acceptable level of correct links To observe which linkage keys produce similar results To assess the extent of which the probability method outperforms the deterministic method (with careful interpretation) To examine which subpopulation groups affected by the linkage (relevant only in case real datasets are linked).

6 Simulation study Some considerations for simulation Data size and population covered The datasets in the Biolink project vary from 500 to 8M Average: less than 10,000 Dominated by cohorts (thus similar age, or same sex) Identifiers may contain errors Simple errors (typographical errors) Complex errors (determined by the value of identifiers) Methods used Deterministic and probabilistic are designed to cope with certain situations and hence should be carefully compared

7 Simulation data development The following simulation datasets are used to represent registries and biobanks populations: Data set reflecting general population (e.g., having a broad spread of age, sex, postcode, ethnicities) We use information on the Dutch population obtained at the site of Statistics Netherlands (Statline). Size: 160,000 records Data set reflecting specific population (e.g., short age interval, not many variations in the ethnicities) We use information on the Dutch Cancer population obtained at the site of NKR. Size 16,000 records Data set reflecting very specific population (e.g., almost homogen in the ethnicities, a group of certain age, same sex) We use information on a female cohort of NKI. Size 1,600 records

8 Simulation data errors Errors are added to replace the correct value of the identifers: surname, date of birth, and postcode. Both random and systematic errors are introduced:  Random errors: take place in any record regardless of the identifiers value Insert, delete, substitute, swap the characters  Systematic errors: take place in certain records depending on the identifiers value. Foreigners are more likely to get assigned a generic date of birth Females can use their partner’s name Young people are more likely to change address Urban people are more likely to move in the neighboorhood area We need such information from e.g., Palga to create a registry having a specific population, and NKI for a registry with a very specific population (e.g., breast cancer cohort).

9 Linking methods and evaluation Candidates for linkage keys: Linkage key according to the ‘rule’ [*] Linkage keys currently applied [**] Linkage keys others [***] Linkage keys chosen for evaluation: Baseline: all identifiers (surname, dob, sex, postcode) Linkage key 1: surname4, dob, sex, postcode4 [**] Linkage key 2: surname4, dob, postcode [***] Linkage key 3: surname4, dob, sex [*] Linkage key 4: surname4, sex, postcode [***] Linkage key 5: surname4, sex, postcode4 [***] Linkage key 6: dob, sex, postcode [**]

10 Linking methods and evaluation Simulate a series of record linkages under the following conditions: Different number of overlap (10%, 60%, 90%) Various error levels (10%, 20%, 30%) Total: 40 datasets Evaluation criteria: A = True Positives obtained/Total true links (sensitivity) B = True Positives obtained/Total links obtained (precision) For ease of comparison we use C = (A+B)/2 Maximum A = Maximum B = 1, hence Maximum C = 1 The higher the C, the better.

11 Linking methods and evaluation Softwares used: R to create simulation datasets and errors SAS 9.2 to link the datasets

12 Evaluation results

13 Evaluation results

14 Conclusions (tentative) We observed the following indications: Linkage key: Surname4, DOB and sex gives the best result Linkage key: DOB, sex and postcode gives almost similar result Probabilistic method performs up to 5% better than deterministic method when: More identifiers were used as a linkage key. The population groups in the datasets were more similar. Probabilistic (all identifiers) can be used to validate the deterministic method. Need to verify these on real datasets.