© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Basic Data Cleaning Principles.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Point and Confidence Interval Estimation of a Population Proportion, p
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Analysis of Simulation Input.. Simulation Machine n Simulation can be considered as an Engine with input and output as follows: Simulation Engine Input.
Today Today: More on the Normal Distribution (section 6.1), begin Chapter 8 (8.1 and 8.2) Assignment: 5-R11, 5-R16, 6-3, 6-5, 8-2, 8-8 Recommended Questions:
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor.
Generalized Linear Models
Logistic regression for binary response variables.
 Catalogue No: BS-338  Credit Hours: 3  Text Book: Advanced Engineering Mathematics by E.Kreyszig  Reference Books  Probability and Statistics by.
Correlation & Regression
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
1 CHAPTER 7 Homework:5,7,9,11,17,22,23,25,29,33,37,41,45,51, 59,65,77,79 : The U.S. Bureau of Census publishes annual price figures for new mobile homes.
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Bayesian Analysis and Applications of A Cure Rate Model.
7.4 – Sampling Distribution Statistic: a numerical descriptive measure of a sample Parameter: a numerical descriptive measure of a population.
Topic 10 - Linear Regression Least squares principle - pages 301 – – 309 Hypothesis tests/confidence intervals/prediction intervals for regression.
CS 478 – Tools for Machine Learning and Data Mining Linear and Logistic Regression (Adapted from various sources) (e.g., Luiz Pessoa PY 206 class at Brown.
Logistic Regression. Conceptual Framework - LR Dependent variable: two categories with underlying propensity (yes/no) (absent/present) Independent variables:
HSRP 734: Advanced Statistical Methods July 17, 2008.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Likelihood Methods in Ecology November 16 th – 20 th, 2009 Millbrook, NY Instructors: Charles Canham and María Uriarte Teaching Assistant Liza Comita.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Selecting Input Probability Distribution. Simulation Machine Simulation can be considered as an Engine with input and output as follows: Simulation Engine.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Generalized Linear Models (GLMs) and Their Applications.
1 Chapter 9: Sampling Distributions. 2 Activity 9A, pp
Chapter 8: Simple Linear Regression Yang Zhenlin.
A first order model with one binary and one quantitative predictor variable.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Logistic regression (when you have a binary response variable)
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
Armando Teixeira-Pinto AcademyHealth, Orlando ‘07 Analysis of Non-commensurate Outcomes.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Introduction to statistics Definitions Why is statistics important?
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Review: Stages in Research Process Formulate Problem Determine Research Design Determine Data Collection Method Design Data Collection Forms Design Sample.
Lecture 7 Data Analysis.  Developing coding scheme  Data processing  Data entry  Data cleaning & transformation  Data analysis  Interpretation of.
Nonparametric Statistics
SECTION 7.2 Estimating a Population Proportion. Where Have We Been?  In Chapters 2 and 3 we used “descriptive statistics”.  We summarized data using.
Hypothesis Tests for 1-Proportion Presentation 9.
Chapter 17 STRUCTURAL EQUATION MODELING. Structural Equation Modeling (SEM)  Relatively new statistical technique used to test theoretical or causal.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Introduction to Biostatistics Lecture 1. Biostatistics Definition: – The application of statistics to biological sciences Is the science which deals with.
Chapter 13 LOGISTIC REGRESSION. Set of independent variables Categorical outcome measure, generally dichotomous.
Sampling and Sampling Distributions
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Logistic Regression APKC – STATS AFAC (2016).
Generalized Linear Models
Nonparametric Statistics
Discrete Event Simulation - 4
Global PaedSurg Research Training Fellowship
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Presentation transcript:

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Basic Data Cleaning Principles

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data  Believe it or not, most good data analysts probably spend the majority of their time cleaning data and only a relatively small percentage doing formal statistical analyses.  Regardless of how good your quality control, errors creep into datasets.  In addition, missing data and skip patterns need to be dealt with, especially when creating new variables.

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Things to look for  impossible values  improbable values  obvious outliers  do the data make sense?  are there inconsistent or illogical patterns?  are there missing data? If yes and due to skip patterns, are there logical codes we can assign?  are there text or alpha variables?

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Strategies for exploring your data  simple frequencies of all categorical variables  univariate stats (mean, SD, percentiles, minimum and maximum values) for all continuous variables  selected crosstabs, especially for nested questions (i.e., if “yes” to Q1, then ask Q2)  listings of selected variables

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample frequency table Consider the following frequency table for the # asthma hospitalizations in the past year. Are the “5” and “10” values real? How might you analyze such data?

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does anything strike you as peculiar or suspect with this variable? The 4.42 was a data entry error. s/b 7.62

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does anything catch your attention? The 25.5 should have been 77.

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does this table suggest any problems? What if I said this was from a study of survival in patients with > 6 months on LTOT?

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data Consider the following table. How might we resolve the 3 people who answered both questions? What about the 7 folks who skipped Q5c but shouldn’t have?

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data How would we define the following variable? Are there any problems with the following? Smoke = 1 if Q5 (current smoker) = yes Smoke = 2 if Q5c (ever smoker) = yes Smoke = 3 otherwise

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data What might be a better definition of smoke that properly deals with missing data? Smoke = 1 if Q5 (current smoker) = yes Smoke = 2 if Q5c (ever smoker) = yes Smoke = 3 if Q5 = no and Q5c = no Smoke = “.” otherwise Even this doesn’t work if we still have to deal with the 3 inconsistent responses!

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Imputing values for logical skip patterns Consider the following two questions: Q3a will be skipped, and hence be missing, for everyone who answers “no” to Q3. Is there a logical value to assign in this case? What are merits of assigning “0” (no) vs. NA?

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Listing data to check recodes

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Garbage In = Garbage Out ! Spending the time getting to know and understand your data will pay off in the long run. The Bottom Line:

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH

Statistics Inside the Black Box

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH

Statistics: Inside the Black Box  Statistics can be said to be about estimating quantities of interest (e.g., the prevalence of TB in a Rio favela or the rate of decline of lung function with age) and then making inferences about these quantities (e.g., does TB prevalence vary by HIV status).  We will focus on the “estimation step”, including model building, and interpreting the coefficients in your models.

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 1 Most people have heard of the normal distribution. When we say that some variable is normally distributed with mean, , and variance,    we are tacitly assuming that we can write an equation describing the probability (or likelihood) of the observed data as a function of  and  . The values of  and   that maximize the probability are termed “maximum likelihood estimates (MLEs).

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 2 Whenever you fit a regression model, you are asking the computer to generate maximum likelihood estimates. However rather than simply estimate a single overall mean, , we typically want to describe the mean in terms of other explanatory variables. For example, mean FEV 1      Age   Height The coefficients in this model (the  s) are also MLEs!

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 3  If the source data are normally distributed, then the MLEs will be normally distributed.  Even if the source data are not normally distributed, the MLEs derived from such data will be ≈ normally distributed for large enough sample sizes. We use these properties to test specific hypotheses of interest (e.g., H 0 :   =0). MLEs have two very desirable properties for statisticians:

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 4  the binomial distribution, which forms the basis for logistic regression and is used to analyze binary (yes/no) data  the Poisson distribution, useful for modeling rates of occurrence, and  the Cox proportional hazards model, used to analyze time to event data. In addition to the normal distribution, other common distributions used in the medical literature are:

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 5 Each distribution gives rise to an equation that relates a basic parameter of the model to a collection of predictor variables. e.g., normal:       Age   Height binomial: ln[P/(1-P)]      Age   Male Cox: ln[  t  ln[ 0  t  ]     Pkyrs   Male

© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 6  We will teach you a systematic way to use these equations to help you interpret the coefficients in your model.  We will also teach you how to construct your models so as to test specific biological hypotheses of interest.