SQL for Cleaning Data Farrokh Alemi, Ph.D.

Slides:



Advertisements
Similar presentations
Benchmarking Clinicians Farrokh Alemi, Ph.D.. Why should it be done? Hiring, promotion, and management decisions Help clinicians improve.
Advertisements

INTRODUCTION TO ICD-9-CM
Copyright © 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2002 by Saunders, an imprint of Elsevier Inc. Slide 1 Copyright © 2012, 2011, 2010, 2009,
MEDICAL RECORDS MANAGEMENT IN EYE CARE SERVICES 6.International classification of Disease & Procedures and the method of Indexing data.
Multiple Regression Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Recommendations on Minimum Data Recording Requirements in Hospitals from the Directorate of Health in Iceland: Is it possible to use Hospital Patient Registry.
Data Quality Data Cleaning Beverly Musick, M.S. May 20, This module was recorded at the health informatics –training course— data management series.
Hospital maintain various indexes and register so that each health records and other health information can be located and classified for Patient care.
Nursing Library Training using Sunrise Press data.
Risk Assessment Farrokh Alemi, Ph.D.. Session Objectives 1.Discuss the role of risk assessment in the TQM process. 2.Describe the five severity indices.
Data Verification and Validation
Extracting Information from an Excel List The purpose of creating a database, or list in Excel, is to be able to manipulate the data elements in ways that.
Arizona’s Sentinel Site Data Quality Efforts Fragmented Records and MOGE Coding Lisa Rasmussen Arizona Department of Health Services March 30, 2011.
Medical Documentation CHAPTER 17. Purposes of Documentation  Communication  Most patients receive care from more than one source  Allows all health.
Measures of disease frequency Simon Thornley. Measures of Effect and Disease Frequency Aims – To define and describe the uses of common epidemiological.
Introduction to Health Informatics Leon Geffen MBChB MCFP(SA)
DATA TYPES.
Data Modeling (Entity Relationship Diagram)
Quality of Electronic Emergency Department Data: How Good Are They?
Medical Insurance Claims Lesson 3: The CMS-1500
The Increased Mortality and the Medicare Disability Eligibility Status of the HCV Population in the US Gabriela Dieguez1, Bruce Pyenson1, Steven E Marx2,
Chapter 2 Supplementary Classification:
Data Collection Principles
Patient Medical Records
Chapter 2 Section 3.
REDCap Data Migration from CSV file
Dead Man Visiting Farrokh Alemi, PhD Narrated by …
Data quality 1: Individual records
Stratified Covariate Balancing Using R
SQL Text Manipulation Farrokh Alemi, Ph.D.
Graphical Interface for Queries
IT Applications Theory Slideshows
Normalization of Databases
GROUP BY & Subset Data Analysis
SQL for Predicting from Likelihood Ratios
Entity Relationship Diagrams
SQL for Calculating Likelihood Ratios
Types of Joins Farrokh Alemi, Ph.D.
Receiver Operating Curves
Stratification Matters: Analysis of 3 Variables
Saturday, August 06, 2016 Farrokh Alemi, PhD.
SELECT & FROM Commands Farrokh Alemi, PhD
Rank Order Function Farrokh Alemi, Ph.D.
Date Functions Farrokh Alemi, Ph.D.
DONE Need password feature
Creating Tables & Inserting Values Using SQL
Procedures Organized by Farrokh Alemi, Ph.D. Narrated by Yara Alemi
MHSA OMA Forms Overview
Comparing two Rates Farrokh Alemi Ph.D.
Cursors Organized by Farrokh Alemi, Ph.D. Narrated by Yara Alemi
Dead Patients Visiting
Multivariate Analysis Project
Convert from Variable Character to Float
Wednesday, September 21, 2016 Farrokh Alemi, PhD.
Normalization Organized by Farrokh Alemi, Ph.D.
Indexing & Computational Efficiency
Selecting the Right Predictors
Propagation Algorithm in Bayesian Networks
Managing Medical Records Lesson 1:
Data Management – Processing
Decision Tables SEEM 3430 Tutorial LI Jing.
Boolean Expressions to Make Comparisons
Improving Overlap Farrokh Alemi, Ph.D.
Chapter 2 Section 3.
MOON Data File Components
Stratified Covariate Balancing Using R
Decision Tables SEEM 3430 Tutorial Lanjun Zhou.
Family Health History Health project.
STAT 490DS1 Data Quality.
Presentation transcript:

SQL for Cleaning Data Farrokh Alemi, Ph.D. This section provides the SQL for how to clean data. This brief presentation was organized by Dr. Alemi.

Cleaning Data These slides provide a number of SQL code for removing data that do not make sense. These codes are not necessarily connected to each other. You should use these code snipets to understand the ideas for how data can be cleaned and then use what works with your data.

Cleaning Data Not All of Code is Shown These code snippets are not necessarily connected to each other. You should use these code snippets to understand the ideas for how data can be cleaned and then use what works with your data.

Remove Duplicated Data We begin with code for removing duplicate data.

DROP TABLE #NoDuplicate SELECT id, icd9, AgeAtDx INTO #NoDuplicate FROM dbo.final GROUP BY id, cd9, AgeAtDx In this code snippet we remove duplication for the same person, same diagnosis, reported at exactly same date. This is done by the GROUP BY command. If there are multiple entries for diagnosis of a patient at certain age, then only one will be kept.

Remove Zombies & Un-borns Often the data contain erroneously entered date of death, showing that the date of death is prior to visits. Occasionally this makes sense as some health care services are offered shortly after the patient’s death. For example, autopsy occurs after death. Repeated outpatient or inpatient visits after a date of death suggests a problem with the date of death. In this snippet we show you how to remove patients whose date of death precedes one or more visits. The same occurs for date of birth. If the patient is reported to have a visit prior to being born, then perhaps the date of birth is problematic. This snippet removes errors in entry of date of death or date of birth.

HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 In this code snippet we keep only patients with reasonable date of birth and death.

HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 Date of death is checked by making sure that the patient’s reported minimum age at death is larger or the same than the maximum age of diagnoses. In addition, we allow all patients who have a null value for age at death, meaning they have not died, to be included in the cleaned data file.

HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 The date of birth is checked by requiring that the minimum of age at diagnosis to be larger than zero. You may want to set this minimum higher if you want to only include adults in your cleaned data.

HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 Note also that the entire code leads to selection of ID of patients. These are patients whose data are OK. To get to the data for these patients, that is other variables besides ID, then the temporary file must be merged with the original data file.

WHERE AgeAtDeath >= AgeAtDx or ageatdx>0 GROUP BY ID DROP TABLE #nonZ SELECT Id INTO #nonZ FROM dbo.data WHERE AgeAtDeath >= AgeAtDx or ageatdx>0 GROUP BY ID HAVING (Min(AgeAtDeath)>=Max(AgeAtDx) OR Min(AgeAtDeath) is null ) AND Min(AgeAtDx)>0 Note that if we had used a WHERE command instead of the HAVING command then we would have deleted the visits that have unreasonable dates but not eliminated the entire record of the patient. Since problems with date of death and date of birth affect age at time of all visits, the entire record should be eliminated.

Remove Visits with Unusual Dates This snippet of code removes visits with unusual date of visit. This could be a missing or out of range dates. Out of range analysis should be done on all variables and not just dates.

WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null This snippet of code removes visits with unusual date of visit. This could be a missing or out of range dates. Out of range analysis should be done on all variables and not just dates.

WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null Here we check if the age at diagnosis is between 18 and 105. The function between is useful for checking for values that should be in a particular range.

WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null We are also excluding missing date entries.

WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null If missing values are not occurring at random, then missing values must be kept and treated as a separate binary variable. To check to see if missing values are occurring at random assign 1 to every missing value and 0 to other values. Then check and see if this binary variable is related to the outcome.

WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null In this snippet we keep only those visits and related diagnosis that meet both criteria. Note that the elimination is not for the entire record of the patient but only for portion of the record that has an out of range or missing value. This is why we need to keep both the id of the patient and the id of the visit.

WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null DROP TABLE #InRange SELECT ID, VisitId INTO #InRange FROM dbo.data WHERE AgeAtDx between 18 and 105 and AgeAtDx is not null Even though we do not show it here, the two IDs should be used to merge the calculated temporary file with the original data so that all relevant fields, not just IDs, are available for analysis.

Remove Inconsistent Data This snippet of code removes inconsistent data. Some combination of data are unreasonable and may need to be removed.

WHERE Not (Gender = 'Male' and Pregnant = 'Yes'); SELECT Id FROM dbo.data WHERE Not (Gender = 'Male' and Pregnant = 'Yes'); Here we are keeping only patients which are not pregnant males. Think through your data and remove all patients who have combination of fields that should not occur.

Remove Complications? Sometimes the available data are reasonable but should be ignored in the context of the analysis planned. Even though the data are correct, nothing is wrong about them, nevertheless they should be ignored. If we are studying the impact of treatment on survival, statisticians require one to drop complications from multivariate models. In electronic health records, complications are diagnoses that occur after treatment. Same diagnosis before treatment is considered medical history, or at time of treatment is considered comorbidity, but after treatment it is considered a complication. This requires us to drop some of the diagnoses and retain others.

Over weight Diabetes Antibiotic Treatment Survival Infection Here we see an example, where because the patient had an infection and was overweight, a large dose of antibiotic was given, which distorted the microbes in the patient’s gut and the patient developed diabetes. If we keep diabetes in our multivariate models then the effect of antibiotic on survival will be distorted. In these situations, we want to keep comorbidities, i.e. over weight and infection, but not complications, which is diabetes. Inside Electronic Health Records, both complications and comorbidities could have the same International Classification of Disease code. So, we have to examine if the disease occurred after or before treatment to know if it is complication, medical history, or comorbidity. Complications occur after.

WHERE AgeAtDX <= AgeAtTreatment GROUP BY ID, Diagnosis SELECT ID, Diagnosis FROM dbo.data WHERE AgeAtDX <= AgeAtTreatment GROUP BY ID, Diagnosis This requires a code that drops diagnoses that occur after treatment. In this code, we remove complications of treatment and retain all diagnoses that occur prior to or at time of treatment. On this code, the WHERE command states that the age of diagnosis should be less than or equal to age at start of treatment.

WHERE AgeAtDX <= AgeAtTreatment GROUP BY ID, Diagnosis; SELECT ID, Diagnosis FROM dbo.data WHERE AgeAtDX <= AgeAtTreatment GROUP BY ID, Diagnosis; One reason to remove complications from analysis of impact of treatment on outcome has to do with the fact that at the time of selection of treatment we do not know the complications.

Errors Could Be Informative Be careful in eliminating data. What you think is not reasonable may have specific meaning. Even data entry errors can be used to predict future events. If errors are not occurring at random, create binary variables to indicate where they occur and see what variables are associated with errors.

SQL can be used to Remove data that do not make sense These slides have shown snippets of SQL code that can be used to remove data that do not make sense or can lead to errors in our analysis.