Presentation is loading. Please wait.

Presentation is loading. Please wait.

SQL for Cleaning Data Farrokh Alemi, Ph.D.

Similar presentations


Presentation on theme: "SQL for Cleaning Data Farrokh Alemi, Ph.D."— Presentation transcript:

1 SQL for Cleaning Data Farrokh Alemi, Ph.D.
This section provides the SQL for how to clean data. This brief presentation was organized by Dr. Alemi.

2 Cleaning Data These slides provide a number of SQL code for removing data that do not make sense. These codes are not necessarily connected to each other. You should use these code snipets to understand the ideas for how data can be cleaned and then use what works with your data.

3 Cleaning Data Not All of Code is Shown
These code snippets are not necessarily connected to each other. You should use these code snippets to understand the ideas for how data can be cleaned and then use what works with your data.

4 Remove Duplicated Data We begin with code for removing duplicate data.

5 DROP TABLE #NoDuplicate SELECT id, icd9, AgeAtDx INTO #NoDuplicate
FROM dbo.final GROUP BY id, cd9, AgeAtDx In this code snippet we remove duplication for the same person, same diagnosis, reported at exactly same date. This is done by the GROUP BY command. If there are multiple entries for diagnosis of a patient at certain age, then only one will be kept.

6 Remove Zombies & Un-borns
Often the data contain erroneously entered date of death, showing that the date of death is prior to visits. Occasionally this makes sense as some health care services are offered shortly after the patient’s death. For example, autopsy occurs after death. Repeated outpatient or inpatient visits after a date of death suggests a problem with the date of death. In this snippet we show you how to remove patients whose date of death precedes one or more visits. The same occurs for date of birth. If the patient is reported to have a visit prior to being born, then perhaps the date of birth is problematic. This snippet removes errors in entry of date of death or date of birth.

7 HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null )
DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 In this code snippet we keep only patients with reasonable date of birth and death.

8 HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null )
DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 Date of death is checked by making sure that the patient’s reported minimum age at death is larger or the same than the maximum age of diagnoses. In addition, we allow all patients who have a null value for age at death, meaning they have not died, to be included in the cleaned data file.

9 HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null )
DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 The date of birth is checked by requiring that the minimum of age at diagnosis to be larger than zero. You may want to set this minimum higher if you want to only include adults in your cleaned data.

10 HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null )
DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 Note also that the entire code leads to selection of ID of patients. These are patients whose data are OK. To get to the data for these patients, that is other variables besides ID, then the temporary file must be merged with the original data file.

11 WHERE AgeAtDeath >= AgeAtDx or ageatdx>0 GROUP BY ID
DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data WHERE AgeAtDeath >= AgeAtDx or ageatdx>0 GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 Note that if we had used a WHERE command instead of the HAVING command then we would have deleted the visits that have unreasonable dates but not eliminated the entire record of the patient. Since problems with date of death and date of birth affect age at time of all visits, the entire record should be eliminated.

12 Remove Visits with Unusual Dates
This snippet of code removes visits with unusual date of visit. This could be a missing or out of range dates. Out of range analysis should be done on all variables and not just dates.

13 WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null
DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null This snippet of code removes visits with unusual date of visit. This could be a missing or out of range dates. Out of range analysis should be done on all variables and not just dates.

14 WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null
DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null Here we check if the age at diagnosis is between 18 and The function between is useful for checking for values that should be in a particular range.

15 WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null
DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null We are also excluding missing date entries.

16 WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null
DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null If missing values are not occurring at random, then missing values must be kept and treated as a separate binary variable. To check to see if missing values are occurring at random assign 1 to every missing value and 0 to other values. Then check and see if this binary variable is related to the outcome.

17 WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null
DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null In this snippet we keep only those visits and related diagnosis that meet both criteria. Note that the elimination is not for the entire record of the patient but only for portion of the record that has an out of range or missing value. This is why we need to keep both the id of the patient and the id of the visit.

18 WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null
DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null Even though we do not show it here, the two IDs should be used to merge the calculated temporary file with the original data so that all relevant fields, not just IDs, are available for analysis.

19 Remove Inconsistent Data
This snippet of code removes inconsistent data. Some combination of data are unreasonable and may need to be removed.

20 WHERE Not (Gender = 'Male' and Pregnant = 'Yes');
SELECT Id FROM dbo.data WHERE Not (Gender = 'Male' and Pregnant = 'Yes'); Here we are keeping only patients which are not pregnant males. Think through your data and remove all patients who have combination of fields that should not occur.

21 Remove Complications? Sometimes the available data are reasonable but should be ignored in the context of the analysis planned. Even though the data are correct, nothing is wrong about them, nevertheless they should be ignored. If we are studying the impact of treatment on survival, statisticians require one to drop complications from multivariate models. In electronic health records, complications are diagnoses that occur after treatment. Same diagnosis before treatment is considered medical history, or at time of treatment is considered comorbidity, but after treatment it is considered a complication. This requires us to drop some of the diagnoses and retain others.

22 Over weight Diabetes Antibiotic Treatment Survival Infection
Here we see an example, where because the patient had an infection and was overweight, a large dose of antibiotic was given, which distorted the microbes in the patient’s gut and the patient developed diabetes. If we keep diabetes in our multivariate models then the effect of antibiotic on survival will be distorted. In these situations, we want to keep comorbidities, i.e. over weight and infection, but not complications, which is diabetes. Inside Electronic Health Records, both complications and comorbidities could have the same International Classification of Disease code. So, we have to examine if the disease occurred after or before treatment to know if it is complication, medical history, or comorbidity. Complications occur after.

23 WHERE AgeAtDX <= AgeAtTreatment GROUP BY ID, Diagnosis
SELECT ID, Diagnosis FROM dbo.data WHERE AgeAtDX <= AgeAtTreatment GROUP BY ID, Diagnosis This requires a code that drops diagnoses that occur after treatment. In this code, we remove complications of treatment and retain all diagnoses that occur prior to or at time of treatment. On this code, the WHERE command states that the age of diagnosis should be less than or equal to age at start of treatment.

24 WHERE AgeAtDX <= AgeAtTreatment GROUP BY ID, Diagnosis;
SELECT ID, Diagnosis FROM dbo.data WHERE AgeAtDX <= AgeAtTreatment GROUP BY ID, Diagnosis; One reason to remove complications from analysis of impact of treatment on outcome has to do with the fact that at the time of selection of treatment we do not know the complications.

25 Errors Could Be Informative
Be careful in eliminating data. What you think is not reasonable may have specific meaning. Even data entry errors can be used to predict future events. If errors are not occurring at random, create binary variables to indicate where they occur and see what variables are associated with errors.

26 SQL can be used to Remove data that do not make sense
These slides have shown snippets of SQL code that can be used to remove data that do not make sense or can lead to errors in our analysis.


Download ppt "SQL for Cleaning Data Farrokh Alemi, Ph.D."

Similar presentations


Ads by Google