Presentation on theme: "Record Linkage Simulation Biolink Meeting June 3 2013 Adelaide Ariel."— Presentation transcript:
Record Linkage Simulation Biolink Meeting June 3 2013 Adelaide Ariel
2 Overview Introduction Our approach Simulation Result Conclusion
3 Introduction Factors influence the performance of record linkage: The number of identifiers used as linkage variables In general, the more identifiers used the better. The discriminative power of the identifiers Identifiers should have high discriminative power. Can be a threat to privacy. The quality of the identifiers All datasets linked should have the same high quality level of identifiers Superior quality in one dataset cannot compensate for lower quality in other ds. The size of the population in the corresponding datasets The bigger the population, the more likely it is to have false positives.
4 Our approach Main assumption: The effect of errors in one record can be ‘reduced’ by considering using only a subset of identifiers. Benefits are twofold: Data management: able to identify minimum requirements for dataset It is handy for data owners to know which identifiers should have at least a certain quality level. A sort of checklist can be developed to assess whether a new dataset meets this requirement to be linked to existing datasets. Efficiency: recognize which identifiers should be used to get an acceptable level of correct links Current practice in deterministic record linkage is relaxing the number of identifiers to obtain more links. This can lead to linkages with lower quality.
5 Goal of simulation study The goals: To evaluate which linkage keys provide an acceptable level of correct links To observe which linkage keys produce similar results To assess the extent of which the probability method outperforms the deterministic method (with careful interpretation) To examine which subpopulation groups affected by the linkage (relevant only in case real datasets are linked).
6 Simulation study Some considerations for simulation Data size and population covered The datasets in the Biolink project vary from 500 to 8M Average: less than 10,000 Dominated by cohorts (thus similar age, or same sex) Identifiers may contain errors Simple errors (typographical errors) Complex errors (determined by the value of identifiers) Methods used Deterministic and probabilistic are designed to cope with certain situations and hence should be carefully compared
7 Simulation data development The following simulation datasets are used to represent registries and biobanks populations: Data set reflecting general population (e.g., having a broad spread of age, sex, postcode, ethnicities) We use information on the Dutch population obtained at the site of Statistics Netherlands (Statline). Size: 160,000 records Data set reflecting specific population (e.g., short age interval, not many variations in the ethnicities) We use information on the Dutch Cancer population obtained at the site of NKR. Size 16,000 records Data set reflecting very specific population (e.g., almost homogen in the ethnicities, a group of certain age, same sex) We use information on a female cohort of NKI. Size 1,600 records
8 Simulation data errors Errors are added to replace the correct value of the identifers: surname, date of birth, and postcode. Both random and systematic errors are introduced: Random errors: take place in any record regardless of the identifiers value Insert, delete, substitute, swap the characters Systematic errors: take place in certain records depending on the identifiers value. Foreigners are more likely to get assigned a generic date of birth Females can use their partner’s name Young people are more likely to change address Urban people are more likely to move in the neighboorhood area We need such information from e.g., Palga to create a registry having a specific population, and NKI for a registry with a very specific population (e.g., breast cancer cohort).
9 Linking methods and evaluation Candidates for linkage keys: Linkage key according to the ‘rule’ [*] Linkage keys currently applied [**] Linkage keys others [***] Linkage keys chosen for evaluation: Baseline: all identifiers (surname, dob, sex, postcode) Linkage key 1: surname4, dob, sex, postcode4 [**] Linkage key 2: surname4, dob, postcode [***] Linkage key 3: surname4, dob, sex [*] Linkage key 4: surname4, sex, postcode [***] Linkage key 5: surname4, sex, postcode4 [***] Linkage key 6: dob, sex, postcode [**]
10 Linking methods and evaluation Simulate a series of record linkages under the following conditions: Different number of overlap (10%, 60%, 90%) Various error levels (10%, 20%, 30%) Total: 40 datasets Evaluation criteria: A = True Positives obtained/Total true links (sensitivity) B = True Positives obtained/Total links obtained (precision) For ease of comparison we use C = (A+B)/2 Maximum A = Maximum B = 1, hence Maximum C = 1 The higher the C, the better.
11 Linking methods and evaluation Softwares used: R to create simulation datasets and errors SAS 9.2 to link the datasets
14 Conclusions (tentative) We observed the following indications: Linkage key: Surname4, DOB and sex gives the best result Linkage key: DOB, sex and postcode gives almost similar result Probabilistic method performs up to 5% better than deterministic method when: More identifiers were used as a linkage key. The population groups in the datasets were more similar. Probabilistic (all identifiers) can be used to validate the deterministic method. Need to verify these on real datasets.