Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Basic Data Cleaning Principles.

Similar presentations


Presentation on theme: "© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Basic Data Cleaning Principles."— Presentation transcript:

1 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Basic Data Cleaning Principles

2 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data  Believe it or not, most good data analysts probably spend the majority of their time cleaning data and only a relatively small percentage doing formal statistical analyses.  Regardless of how good your quality control, errors creep into datasets.  In addition, missing data and skip patterns need to be dealt with, especially when creating new variables.

3 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Things to look for  impossible values  improbable values  obvious outliers  do the data make sense?  are there inconsistent or illogical patterns?  are there missing data? If yes and due to skip patterns, are there logical codes we can assign?  are there text or alpha variables?

4 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Strategies for exploring your data  simple frequencies of all categorical variables  univariate stats (mean, SD, percentiles, minimum and maximum values) for all continuous variables  selected crosstabs, especially for nested questions (i.e., if “yes” to Q1, then ask Q2)  listings of selected variables

5 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample frequency table Consider the following frequency table for the # asthma hospitalizations in the past year. Are the “5” and “10” values real? How might you analyze such data?

6 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does anything strike you as peculiar or suspect with this variable? The 4.42 was a data entry error. s/b 7.62

7 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does anything catch your attention? The 25.5 should have been 77.

8 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does this table suggest any problems? What if I said this was from a study of survival in patients with > 6 months on LTOT?

9 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data Consider the following table. How might we resolve the 3 people who answered both questions? What about the 7 folks who skipped Q5c but shouldn’t have?

10 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data How would we define the following variable? Are there any problems with the following? Smoke = 1 if Q5 (current smoker) = yes Smoke = 2 if Q5c (ever smoker) = yes Smoke = 3 otherwise

11 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data What might be a better definition of smoke that properly deals with missing data? Smoke = 1 if Q5 (current smoker) = yes Smoke = 2 if Q5c (ever smoker) = yes Smoke = 3 if Q5 = no and Q5c = no Smoke = “.” otherwise Even this doesn’t work if we still have to deal with the 3 inconsistent responses!

12 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Imputing values for logical skip patterns Consider the following two questions: Q3a will be skipped, and hence be missing, for everyone who answers “no” to Q3. Is there a logical value to assign in this case? What are merits of assigning “0” (no) vs. NA?

13 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Listing data to check recodes

14 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Garbage In = Garbage Out ! Spending the time getting to know and understand your data will pay off in the long run. The Bottom Line:

15 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH

16 Statistics Inside the Black Box

17 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH

18 Statistics: Inside the Black Box  Statistics can be said to be about estimating quantities of interest (e.g., the prevalence of TB in a Rio favela or the rate of decline of lung function with age) and then making inferences about these quantities (e.g., does TB prevalence vary by HIV status).  We will focus on the “estimation step”, including model building, and interpreting the coefficients in your models.

19 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 1 Most people have heard of the normal distribution. When we say that some variable is normally distributed with mean, , and variance,    we are tacitly assuming that we can write an equation describing the probability (or likelihood) of the observed data as a function of  and  . The values of  and   that maximize the probability are termed “maximum likelihood estimates (MLEs).

20 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 2 Whenever you fit a regression model, you are asking the computer to generate maximum likelihood estimates. However rather than simply estimate a single overall mean, , we typically want to describe the mean in terms of other explanatory variables. For example, mean FEV 1      Age   Height The coefficients in this model (the  s) are also MLEs!

21 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 3  If the source data are normally distributed, then the MLEs will be normally distributed.  Even if the source data are not normally distributed, the MLEs derived from such data will be ≈ normally distributed for large enough sample sizes. We use these properties to test specific hypotheses of interest (e.g., H 0 :   =0). MLEs have two very desirable properties for statisticians:

22 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 4  the binomial distribution, which forms the basis for logistic regression and is used to analyze binary (yes/no) data  the Poisson distribution, useful for modeling rates of occurrence, and  the Cox proportional hazards model, used to analyze time to event data. In addition to the normal distribution, other common distributions used in the medical literature are:

23 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 5 Each distribution gives rise to an equation that relates a basic parameter of the model to a collection of predictor variables. e.g., normal:       Age   Height binomial: ln[P/(1-P)]      Age   Male Cox: ln[  t  ln[ 0  t  ]     Pkyrs   Male

24 © 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 6  We will teach you a systematic way to use these equations to help you interpret the coefficients in your model.  We will also teach you how to construct your models so as to test specific biological hypotheses of interest.


Download ppt "© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Basic Data Cleaning Principles."

Similar presentations


Ads by Google