Presentation is loading. Please wait.

Presentation is loading. Please wait.

THRio Database Linkage and THRio Database Issues.

Similar presentations


Presentation on theme: "THRio Database Linkage and THRio Database Issues."— Presentation transcript:

1 THRio Database Linkage and THRio Database Issues

2 Database matching There are several systems that do not “talk” to each other There are several systems that do not “talk” to each other SINAN – reportable diseases (TB, AIDS) SINAN – reportable diseases (TB, AIDS) SIM – Mortality SIM – Mortality SICOM – Pharmaceutical database (ARVs) SICOM – Pharmaceutical database (ARVs) THRio – Our DB THRio – Our DB Original plan Original plan Match THRio with all other 3 DBs above Match THRio with all other 3 DBs above

3 Database matching Problems Problems There is no unique identifier common for all systems There is no unique identifier common for all systems We use name, gender and DOB and mother’s name as surrogates We use name, gender and DOB and mother’s name as surrogates The information is not uniform – many missing variables – especially mother’s name The information is not uniform – many missing variables – especially mother’s name THRio THRio Standardization of names abbreviations Standardization of names abbreviations Double data entry Double data entry Not enough – names are misspelled Not enough – names are misspelled The other databases – even worse The other databases – even worse No QC No QC

4 Database matching Proposed strategy Proposed strategy Compare different approaches Compare different approaches Translated SOUNDEX Translated SOUNDEX Reclink – probabilistic linkage Reclink – probabilistic linkage Other algorithms Other algorithms Apply to different examples and get sensitivity/specificity for each one Apply to different examples and get sensitivity/specificity for each one SICOM SICOM Sequential matching Sequential matching Match TB before doing the sequential Match TB before doing the sequential

5 Database matching The project was split: The project was split: ARV database revisited ARV database revisited Development of a new algorithm for database linkage Development of a new algorithm for database linkage

6 Database matching ARV database revisited ARV database revisited Consistency problems (as pointed out before) Consistency problems (as pointed out before) First HAART abstracted for THRio First HAART abstracted for THRio Inconsistency confirmed Inconsistency confirmed Dates did not match (40%) Dates did not match (40%) Drugs did not match Drugs did not match Now all the ART history will be collected (since HAART only) Now all the ART history will be collected (since HAART only) Should we insist and compare the database with the whole history? Should we insist and compare the database with the whole history?

7 Database matching Development of algorithm for database linkage Development of algorithm for database linkage Using Python to implement the interface Using Python to implement the interface Adapted soundex algorithm Adapted soundex algorithm “Gestalt” algorithm – rather hyperbolic “Gestalt” algorithm – rather hyperbolic Direct field comparisons Direct field comparisons Including an hierarchical structure for searching and comparing records Including an hierarchical structure for searching and comparing records Means taking advantage of differences in amount of information available Means taking advantage of differences in amount of information available Computational problems Computational problems Optimization Optimization

8 Database matching Blocking Blocking Speeds up computation Speeds up computation I’ll be concerned with records that are a little similar to begin with I’ll be concerned with records that are a little similar to begin with Soundex Soundex First and last names First and last names Mother’s first and last names Mother’s first and last names First name and mother’s last name First name and mother’s last name Needed to expand to account for errors in the first and last names’ first letter Needed to expand to account for errors in the first and last names’ first letter

9 Database matching Full comparison Full comparison All fields exactly the same All fields exactly the same Small error in DOB Small error in DOB Similar names (gestalt) – generates scores Similar names (gestalt) – generates scores A combination of the above A combination of the above Several “levels” created Several “levels” created Have to choose 2 cutoffs Have to choose 2 cutoffs Not a match Not a match Definitely a match Definitely a match Have to manually decide Have to manually decide

10 Database matching Computational problems – testing phase Computational problems – testing phase Using PostgreSQL and Python Using PostgreSQL and Python Too slow when matching with the TB database Too slow when matching with the TB database > 100,000 records > 100,000 records Changed the algorithm to Python only Changed the algorithm to Python only Computational times (currently) Computational times (currently) THRio x SIM (12,689 X 2,922) THRio x SIM (12,689 X 2,922) 3-4 minutes 3-4 minutes THRio x TB (12,689 X 102,919) THRio x TB (12,689 X 102,919) 100-105 minutes 100-105 minutes

11 Database matching Results Results First we chose a sample of the mortality database First we chose a sample of the mortality database Year 2005 Year 2005 AIDS only AIDS only 871 records 871 records Matched with THRio database Matched with THRio database 10,344 records at the time 10,344 records at the time

12 Database matching Compared Manual x Reclink x Algorithm Compared Manual x Reclink x Algorithm We were going to use the manual linkage as the gold standard We were going to use the manual linkage as the gold standard The algorithm found 13 extra right matches The algorithm found 13 extra right matches We used the combination of those as the standard We used the combination of those as the standard

13 Database matching

14 The algorithm outperformed both RecLink and manual check The algorithm outperformed both RecLink and manual check But after some adjustments But after some adjustments That was just the “training phase” That was just the “training phase” The only mistake has actually to be checked if it is a twin brother The only mistake has actually to be checked if it is a twin brother Full info and only one different letter in the first name Full info and only one different letter in the first name We still have to test it again with a different sample and with TB We still have to test it again with a different sample and with TB

15 Database matching THRio (latest) x SIM (2003-2005) THRio (latest) x SIM (2003-2005) 340 matches (total) 340 matches (total) 79 (23%) to be manually checked only 79 (23%) to be manually checked only This means that both DBs have good quality, at lest in terms of completeness This means that both DBs have good quality, at lest in terms of completeness Ended up with 273 matches and one possible mistake Ended up with 273 matches and one possible mistake When we actually implement it… When we actually implement it… Extra check with date of last annotation in the chart Extra check with date of last annotation in the chart

16 Database matching Challenge Challenge TB database TB database Data quality is much poorer than SIM Data quality is much poorer than SIM Might lead to lower sensitivity Might lead to lower sensitivity Will lead to much more manual checking Will lead to much more manual checking Development of interface to help work Development of interface to help work

17 Database matching THRio (latest) x TB (1995-2005) THRio (latest) x TB (1995-2005) 6453 matches (total) 6453 matches (total) 3870 (60%) to be manually checked 3870 (60%) to be manually checked 721 (11%) with names only 721 (11%) with names only Quality is much worse than SIM Quality is much worse than SIM Many duplicates Many duplicates Proposed solutions: Proposed solutions: Reduce time frame (for prospective TB cases only) Reduce time frame (for prospective TB cases only) Use date of TB diagnosis to exclude duplicates Use date of TB diagnosis to exclude duplicates GUI to help GUI to help

18 Database matching Further discussion for mortality: Further discussion for mortality: What database to use? What database to use? All causes X HIV-AIDS as a basic cause All causes X HIV-AIDS as a basic cause Patients may be dying of other causes Patients may be dying of other causes Municipality X State Municipality X State Patients may live in other cities Patients may live in other cities Municipality just records deaths that occurred in the city Municipality just records deaths that occurred in the city

19 Data analysis issues

20 Complex structure Complex structure Currently 17 tables with information Currently 17 tables with information Dates are not date fields Dates are not date fields We need dates!!! We need dates!!! We don’t collect information about specific visits We don’t collect information about specific visits It is the information since last annotation up to the current one – could mean multiple visits It is the information since last annotation up to the current one – could mean multiple visits Definitions are hard to make Definitions are hard to make

21 Data analysis issues All the events have to be based on dates All the events have to be based on dates Partial missing dates Partial missing dates In general I’ll accept missing days – turned to 15 In general I’ll accept missing days – turned to 15 What to use as a surrogate? What to use as a surrogate? For data collected under the study – date of last annotation For data collected under the study – date of last annotation What about baseline data? What about baseline data?

22 Data analysis issues Definition of Baseline data Definition of Baseline data Study begins on September 1 st 2005 Study begins on September 1 st 2005 Baseline data collection finished on June 2006 Baseline data collection finished on June 2006 “Baseline form” doesn’t mean baseline information “Baseline form” doesn’t mean baseline information Is it baseline for the study or for the patient? Is it baseline for the study or for the patient? What about new patients? Do they have baseline data? What about new patients? Do they have baseline data?

23 Data analysis issues Definition of a new patient Definition of a new patient We have two “candidate” dates We have two “candidate” dates Date of enrollment in the clinic Date of enrollment in the clinic Could be long before HIV diagnosis Could be long before HIV diagnosis Date of HIV diagnosis Date of HIV diagnosis Could be long before enrollment in that clinic Could be long before enrollment in that clinic A “new” patient is not necessarily new, depending on what we want A “new” patient is not necessarily new, depending on what we want Do we need newly diagnosed or newly enrolled? Do we need newly diagnosed or newly enrolled? Should we use both? Should we use both?

24 Data analysis issues Several possible outcomes Several possible outcomes Primary outcome of study (TB) Primary outcome of study (TB) Secondary outcome (death) Secondary outcome (death) Operational outcomes Operational outcomes Waiting for PPD Waiting for PPD PPD placed and read PPD placed and read Reactive PPD Reactive PPD INH started INH started How to deal with all of these? How to deal with all of these?

25 Data analysis issues General output for data analysis General output for data analysis For each patient, look for baseline status For each patient, look for baseline status As of Sept 2005 or at enrollment As of Sept 2005 or at enrollment Look for all changes in time Look for all changes in time Need the dates!!! Need the dates!!! Set up like a database for survival analysis Set up like a database for survival analysis For every change repeat records with For every change repeat records with Initial status Initial status Initial date Initial date Final status Final status Final date Final date Possible to customize for specific outcomes Possible to customize for specific outcomes

26 Thank you!


Download ppt "THRio Database Linkage and THRio Database Issues."

Similar presentations


Ads by Google