Improving the quality of data through imputing missing values (Part One: Introduction to types of missing data) Saeid Shahraz MD, PhD Student Heller School.

Slides:



Advertisements
Similar presentations
SADC Course in Statistics General approaches to sample size determinations (Session 12)
Advertisements

Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
Treatment of missing values
Chapter 11 Other Chi-Squared Tests
Issues in factorial design
Probability Unit 3.
The Basics of Experimentation I: Variables and Control
Introduction to Basic Statistical Methodology. CHAPTER 1 ~ Introduction ~
Chapter 8: Binomial and Geometric Distributions
COUNTING AND PROBABILITY
CHAPTER 13: Binomial Distributions
MM207 Statistics Welcome to the Unit 7 Seminar Prof. Charles Whiffen.
STAT Section 5 Lecture 23 Professor Hao Wang University of South Carolina Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before.
Objectives Use simulations and hypothesis testing to compare treatments from a randomized experiment.
Hypothesis Testing “Teach A Level Maths” Statistics 2 Hypothesis Testing © Christine Crisp.
EXTRA PRACTICE WITH ANSWERS
The Practice of Statistics
Statistics Alan D. Smith.
Statistical Analysis – Chapter 4 Normal Distribution
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Introduction to the Design of Experiments
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University.
Chapter 5 Sampling Distributions
Statistical Analysis Statistical Analysis
Chapter 5 Data Production
American Pride and Social Demographics J. Milburn, L. Swartz, M. Tottil, J. Palacio, A. Qiran, V. Sriqui, J. Dorsey, J. Kim University of Maryland, College.
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
American Pride and Social Demographics J. Milburn, L. Swartz, M. Tottil, J. Palacios, A. Qiran, V. Sriqui, J. Dorsey, J. Kim University of Maryland, College.
Why are White Nursing Home Residents Twice as Likely as African Americans to Have an Advance Directive? Understanding Ethnic Differences in Advance Care.
The Scientific Method. The Scientific Method The Scientific Method is a problem solving-strategy. *It is just a series of steps that can be used to solve.
Ch.4 DISCRETE PROBABILITY DISTRIBUTION Prepared by: M.S Nurzaman, S.E, MIDEc. ( deden )‏
Chapter 2 Data.
Gender Differences In Relational Versus Achievement Influences On Self-esteem Rick L. Payne, B.A., B.S. Department of Psychology, University of Dayton.
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
"The Effects of Health Care Financing Arrangements on Consumer Utilization Decisions in Harris County." Presented at the Healthcare Safety Net Initiatives.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
GrowingKnowing.com © Binomial probabilities Your choice is between success and failure You toss a coin and want it to come up tails Tails is success,
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Imputation for Multi Care Data Naren Meadem. Introduction What is certain in life? –Death –Taxes What is certain in research? –Measurement error –Missing.
Study Session Experimental Design. 1. Which of the following is true regarding the difference between an observational study and and an experiment? a)
Essential Question:  How do scientists use statistical analyses to draw meaningful conclusions from experimental results?
Descriptive Research Study Investigation of Positive and Negative Affect of UniJos PhD Students toward their PhD Research Project Dr. K. A. Korb University.
Bell Work 89 The following is a list of test scores from Mrs. Howard’s second period math class: 82, 83, 85, 87, 87, 87, 89, 90, 91, 95, 97, Find.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 5-1 Business Statistics: A Decision-Making Approach 8 th Edition Chapter 5 Discrete.
Discrete Probability Distributions Define the terms probability distribution and random variable. 2. Distinguish between discrete and continuous.
Discrete Probability Distributions Define the terms probability distribution and random variable. 2. Distinguish between discrete and continuous.
Scientific Method Probability and Significance Probability Q: What does ‘probability’ mean? A: The likelihood that something will happen Probability.
Chapter Six: The Basics of Experimentation I: Variables and Control.
Introduction Studies are important for gathering information. In this lesson, you will learn how to effectively design a study so that it yields reliable.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
URBDP 591 I Lecture 4: Research Question Objectives How do we define a research question? What is a testable hypothesis? How do we test an hypothesis?
Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.
Section 5-3 Binomial Probability Distributions. Binomial Probability Distribution A binomial probability distribution results from a procedure that meets.
Postgraduate books recommended by Degree Management and Postgraduate Education Bureau, Ministry of Education Medical Statistics (the 2nd edition) 孙振球 主.
Chapter 3 Producing Data. Observational study: observes individuals and measures variables of interest but does not attempt to influence the responses.
Monday, June 23, 2008Slide 1 KSU Females prospective on Maternity Services in PHC Maternity Services in Primary Health Care Centers : The Females Perception.
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
Examining difference: chi-squared (x 2 ). When to use Chi-Squared? Chi-squared is used to examine differences between what you actually find in your study.
Chapter5 Statistical and probabilistic concepts, Implementation to Insurance Subjects of the Unit 1.Counting 2.Probability concepts 3.Random Variables.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
Chapter 12 Tests with Qualitative Data
The bane of data analysis
Analysis of missing responses to the sexual experience question in evaluation of an adolescent HIV risk reduction intervention Yu-li Hsieh, Barbara L.
Chapter 13: Item nonresponse
Presentation transcript:

Improving the quality of data through imputing missing values (Part One: Introduction to types of missing data) Saeid Shahraz MD, PhD Student Heller School of Social Policy and Management 4/10/2017 Saeid Shahraz

Basic questions What does the ‘missing data’ mean? What does ‘imputation’ mean? What does ‘data improvement’ mean? How much missingness is acceptable? Is missing data a usual problem? Is ‘imputation’ always a right solution? 4/10/2017 Saeid Shahraz

What does the “missing data” mean? Please look at Table one in the next slide. We have 5 observations in this ultra-small data set and as you see observations number 3 and number 5 have missing values on the variable “number of follow-up rehabilitation visits”. 4/10/2017 Saeid Shahraz

Table 1-Two values are missing Id Gender Age Rehab visits 1 12 7 2 13 6 3 16 4 67 5 72 4/10/2017 Saeid Shahraz

What does “ imputation” mean? If we figure out what the missing values are and put them in the missing boxes we have done imputation. So please look at Table two in which the missing values have been imputed. Please do not think of how the imputation processed. Indeed, I put some arbitrary numbers in. 4/10/2017 Saeid Shahraz

Table 2-Two values imputed Id Gender Age Rehab visits 1 12 7 2 13 6 3 16 4 67 5 72 15 4/10/2017 Saeid Shahraz

What does “data improvement” mean? Please look at Table three. In this table you see three columns for number of visits. The left column is the actual (non-missing) variable. The middle is a column with missing values and the most right column is the one with imputed values. The last row of the table shows you what the average numbers of visits are given the actual data, the missing data, and the imputed data. You clearly see that the average for imputed column is closer to that of the actual information. So, this means “imputation” actually improved the quality of data. 4/10/2017 Saeid Shahraz

Table 3- Data improvement Id Gender Age Rehab visits-actual Rehab visits-missing Rehab visits-imputed 1 12 7 2 13 6 3 16 8 4 67 5 72 17 15 Average of Rehab variable 10.2 8.7 9.4 4/10/2017 Saeid Shahraz

How much missingness is acceptable? Like a threshold for the significance level for p-values, there is no empirical answer to the question. Leong and Austin (2006) for instance suggested 5%. I have personally seen in actual research work some social science and health service researches accepted 10% of missingness. So, for now, let us agree with the tolerance level at 5%. 4/10/2017 Saeid Shahraz

Is missing data a usual problem? Yes. In most administrative data sets that I have been working with a considerable number of values on my desired variables were missing. We need to seriously think of significant amount of missing even when the data has a reputation for being clean and complete. Examples of the latter is Demographic and Health Surveys, better known as DHS. These data sets carry a lot of invaluable information but missing data is sometimes a prohibiting factor for researchers using them. 4/10/2017 Saeid Shahraz

Is imputation always a right solution? With some exceptions yes. But I would like you to answer this question when we are done with the whole presentations. 4/10/2017 Saeid Shahraz

TYPES OF MISSING (RUBIN’S TYPOLOGY) MISSING COMPLETELY AT RANDOM (MCAR) MISSING AT RANDOM (MAR) MISSING NOT AT RANDOM (MNAR) 4/10/2017 Saeid Shahraz

Missing Completely At Random (MCAR) The cause of missingness cannot be found through looking at other observed variables. The cause of missingness is independent of values of missing variable. NO-NO condition 4/10/2017 Saeid Shahraz

MCAR: EXAMPLE ONE: Lab samples thrown out Imagine that blood samples from a randomly selected population to test fasting blood sugar have been sent to 3 labs. One of the labs reports that all the samples have been accidentally thrown out. So, a portion of data on the variable blood sugar level will be missed in the final data set. Here, the event causes missingness is exogenous to the process of data gathering and characteristics of the population ( independency of the likelihood of missing from observed information). Also, the missingness was independent of whether or not blood sugar was high or low.   4/10/2017 Saeid Shahraz

Missing Completely At Random MCAR-1 Missing Completely At Random 1.Variable with considerable missing values 2.Other observed variables 3.Missingness depends on missing (unobserved ) values 4.Missing depends on other variables? Example 1: Lab samples thrown out Blood sugar Age-sex-weight for example Did higher or lower blood sugar have an effect on the probability of missing blood sugar? No Did age or sex or weight increase or decrease the probability of missing on blood sugar? No 4/10/2017 Saeid Shahraz

MCAR: EXAMPLE TWO: Coin tossing This example is the famous coin tossing in sport to define which team own the ball first. Two possibilities: head and tail. Imagine that we know the age of the referee and the type of the sport in our data set and some of the values on the result of coin tossing are missing from the data. Obviously, having missing values on the result is not dependent on either observed variables (age of the referee and type of sport) or on the missing (unobserved) values. To elaborate on the latter I would say having 70% of the results on coin tossing as head up does not imply that 70% or the majority of the missing values have to be head up.   4/10/2017 Saeid Shahraz

Missing Completely At Random MCAR-2 Missing Completely At Random Variable with considerable missing values Other observed variables Missingness depends on missing (unobserved ) values Missing depends on other variables? Example 2: Coin tossing in sport Missing on the result of coin tossing Type of sport and age of the referee Did having head up depend on having head up in previous trials? No Would type of sport or age of referee affect the probability of head up? No 4/10/2017 Saeid Shahraz

Missing At Random (MAR) The cause of missing values is independent of missing (unobservable) values But can be predicted by other observed values NO-YES condition 4/10/2017 Saeid Shahraz

MAR: EXAMPLE ONE: Females and kidney donation The example is a study through which the effect of kidney donation on the donor’s household income is investigated. If during the study it is found that female donors more than male donors tend to refuse to answer to the income question the missing pattern on the income variable is called Missing At Random or MAR. In this case women with low or high income respond to the question of income with the same probability. In other words the missingness is independent of the missing (unobserved) values   4/10/2017 Saeid Shahraz

MAR-1 Missing At Random Variable with considerable missing values Other observed variables Missingness depends on missing (unobserved ) values Missing depends on other variables? Example 1: females and kidney donation Missing values on income of the family donated kidney Sex of the donor, age of the donor, ethnicity of the donor Did women with high income in oppose to women with low income have a greater chance to refuse to answer the income question? No Did sex of the donor affect the probability of responding to the income question? Yes 4/10/2017 Saeid Shahraz

MAR: EXAMPLE TWO: attitudes toward having social insurance This is a study on the attitudes towards implementing a universal social welfare insurance program. It was found that people with affiliation to a type of political party tended not to respond to the insurance question. In this example, the pattern of missing on the response to having social insurance is MAR because at least one observed variable (political party) somehow determined the likelihood of the response to be missing. Positive or negative response toward having the social insurance was assumed to be independent of missing pattern. This means that the probability of missing answer to the insurance questions was the same for both people who tended to provide negative results and those who wanted to answer positively.   4/10/2017 Saeid Shahraz

MAR-2 Missing At Random Variable with considerable missing values Other observed variables Missingness depends on missing (unobserved ) values Missing depends on other variables? Example 2: attitudes toward having social insurance Missing values on yes/no answer to having universal social insurance Political party affiliation Did positive or negative response to the necessity of having the insurance affect the likelihood of missing? No Did political affiliation of the person predict the likelihood of missingness? Yes 4/10/2017 Saeid Shahraz

Missing Not At Random (MNAR) The cause of missing values is dependent of missing (unobservable) values And can usually be predicted by other observed values YES-YES condition 4/10/2017 Saeid Shahraz

MNAR: EXAMPLE ONE: Synthetic insulin and blood sugar reduction time The first scenario is a research study through which the effect of a new type of synthetic insulin on the time of blood sugar reduction in human is investigated. The protocol mandates the researcher if the reduction time is greater than one third of the standard reduction time (defined in the protocol) the researchers should stop the treatment and refer the patient to the emergency department. These patients quit the study and the final result on the reduction time is missing. In this example, the likelihood of missing depends exactly on the unobserved (missing) values. This means that reduction time pattern (the variable that has considerable number of missing cases) determines whether or not the value is missing or not   4/10/2017 Saeid Shahraz

MNAR-1 Missing Not At Random Variable with considerable missing values Other observed variables Missingness depends on missing (unobserved ) values Missing depends on other variables? Example 1: Synthetic insulin and blood sugar reduction time Missing values on blood sugar reduction time Sex, age , and ethnicity of the patient Did the reduction time depend on the value of reduction time? Yes Did the demographics of the patient affect the likelihood of missing? likely 4/10/2017 Saeid Shahraz

MNAR: EXAMPLE TWO: A new pain killer and experience with pain The second scenario is a study in which a new pain killer medication is administered to patients with migraine headache and the amount of pain reduction is asked the day after. It was found out that missing values on the variable ‘how much pain was reduced’ were much greater among patients who experienced severe pain.   4/10/2017 Saeid Shahraz

MNAR-2 Missing Not At Random Variable with considerable missing values Other observed variables Missingness depends on missing (unobserved ) values Missing depends on other variables? Example 2: A new pain killer and experience with pain Missing values on amount of pain reduction Sex ,age, and having mood disorders Did the likelihood of missing depend on the amount of pain reduction? Yes Did the demographics of the participant and his or her history of mood disorder affect the likelihood of missing? likely 4/10/2017 Saeid Shahraz

Thank you and looking forward to having you for the next session Please email me your questions at sshahraz@yahoo.com 4/10/2017 Saeid Shahraz