The Robust Approach Dealing with real data. Estimating Population Parameters Four properties are considered desirable in a population estimator:  Sufficiency.

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

T-tests continued.
Measures of Dispersion
Statistical Background
Inference about a Mean Part II
AP Statistics Section 10.2 A CI for Population Mean When is Unknown.
Getting Started with Hypothesis Testing The Single Sample.
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
Bootstrapping applied to t-tests
Initial Data Analysis Central Tendency.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Psy B07 Chapter 2Slide 1 DESCRIBING AND EXPLORING DATA.
+ DO NOW What conditions do you need to check before constructing a confidence interval for the population proportion? (hint: there are three)
Estimating a Population Mean
Chapter 8: Estimating with Confidence
Summary statistics Using a single value to summarize some characteristic of a dataset. For example, the arithmetic mean (or average) is a summary statistic.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Section 8.3 Estimating a Population Mean. Section 8.3 Estimating a Population Mean After this section, you should be able to… CONSTRUCT and INTERPRET.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
The Robust Approach Dealing with real data. Review With regular analyses we have certain assumptions that are made, or requirements that have to be met.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 10. Hypothesis Testing II: Single-Sample Hypothesis Tests: Establishing the Representativeness.
Central Tendency Mechanics. Notation When we describe a set of data corresponding to the values of some variable, we will refer to that set using an uppercase.
Introduction to Biostatistics, Harvard Extension School © Scott Evans, Ph.D.1 Descriptive Statistics, The Normal Distribution, and Standardization.
Section 8.3 Estimating a Population Mean. Section 8.3 Estimating a Population Mean After this section, you should be able to… CONSTRUCT and INTERPRET.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.3 Estimating a Population Mean.
Statistics 11 The mean The arithmetic average: The “balance point” of the distribution: X=2 -3 X=6+1 X= An error or deviation is the distance from.
Chapter 3 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 3: Measures of Central Tendency and Variability Imagine that a researcher.
Section 10.1 Confidence Intervals
Introduction to Statistics Santosh Kumar Director (iCISA)
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Central Tendency & Dispersion
Measures of variability: understanding the complexity of natural phenomena.
Robust Estimators.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Estimating with Confidence Section 11.1 Estimating a Population Mean.
Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Ex St 801 Statistical Methods Inference about a Single Population Mean (CI)
Non-parametric Approaches The Bootstrap. Non-parametric? Non-parametric or distribution-free tests have more lax and/or different assumptions Properties:
Modern Approaches The Bootstrap with Inferential Example.
+ Unit 5: Estimating with Confidence Section 8.3 Estimating a Population Mean.
+ Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population Mean.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter 9 Introduction to the t Statistic
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Descriptive Statistics (Part 2)
Chapter 8: Estimating with Confidence
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Chapter 8: Estimating with Confidence
Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
8.3 Estimating a Population Mean
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
2/5/ Estimating a Population Mean.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Presentation transcript:

The Robust Approach Dealing with real data

Estimating Population Parameters Four properties are considered desirable in a population estimator:  Sufficiency  Unbiasedness  Efficiency  Resistance

Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample. We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.

Properties of a Statistic Sufficiency  A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter For example, this property makes the mean more attractive as a measure of central tendency compared to the mode or median. Unbiasedness  A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating. As one can see using the resampling procedure, the mean can be shown to be an unbiased estimator

Properties of a Statistic Efficiency  The efficiency of a statistic is reflected in the variance that is observed when one examines the statistic over independently chosen samples Standard error The smaller the variance, the more efficient the statistic is said to be Resistance  The resistance of an estimator refers to the degree to which that estimate is effected by extreme values i.e. outliers  Small changes in the data result in only small changes in estimate  Finite-sample breakdown point Measure of resistance to contamination The smallest proportion of observations that, when altered sufficiently, can render the statistic arbitrarily large or small  Median = n/2  Trimmed mean = whatever the trimming amount is  Mean = 1/n

The Problems Nonnormality Arbitrarily small departures from normality can have tremendous influence on mean and variance estimates, resulting in:  Low power  Underestimated effect size  Inability to accurately assess correlation  Problematic inference

The problems Heterogeneity of variances Among groups it leads to low power and biased results In the heteroscedastic situation in regression, bias may even be worse

The problems Communication Those that know about the problem and have for some time (statisticians) have been unable to get their findings to the larger audience of applied researchers Standard methods still dominate, and who knows how many findings have been lost (type II error) or found (type I error) due to problematic data

Measures of Central Tendency What we want: A statistic whose standard error will not be grossly affected with small departures from normality Power to be comparable to that of mean and sd when dealing with a normal population The value to be fairly stable when dealing with non- normal distributions Two classes to speak of:  Trimmed mean  M-estimators

Trimmed mean You are very familiar with this in terms of the median, in which essentially all but the middle value is trimmed But now we want to retain as much of the data for best performance but enough to ensure resistance to outliers How much to trim? About 20%, and that means from both sides Example: 15 values..2 * 15 = 3, remove 3 largest and 3 smallest

Trimmed mean How does it perform? In non-normal situations it will perform better than the mean We already know it will be resistant to outliers It will have a reduced standard error as well

Trimmed mean How does it perform? Under normal situations about as well as the mean  Slightly less efficient With a symmetric population the mean, median, trimmed mean etc. values will be the same But the population becomes skewed, the mean is much more affected

Trimmed mean It may be difficult at first to get used to the idea of trimming your data  One way to start getting over it is to ask yourself if you ever had a problem with the median as a measure of location However, the gains in using the trimmed mean (accuracy of inference, resistance, efficiency) have been shown to offset the sufficiency loss What you might also consider is:  Do you, when conducting analyses qualify your inferences in terms of generalizing to outliers specifically? Or do you think of it as applying to the groups in general?

M-estimators M-estimators are another robust measure of location Involves the notion of a ‘loss function’  Examples:  If we want to minimize squared errors from a particular value the result of the measure of central tendency will be the mean  If we want to minimize absolute errors, the result will be a median M-estimators are more mathematically complex, but we can get the gist in that less weight is given to values that are further away from ‘center’ Different M-Estimators give different weights for deviating values

M-estimators Wilcox example with more detail, to show the ‘gist’ of the calculation Data = 3,4,8,16,24,53 We will start by using a measure of outlierness as follows What it means:  M = median  MAD = median absolute deviation Order deviations from the median, pick the median of those outliers .6745 = dividing by this allows this measure of variance to equal the population standard deviation When we do will call it MADN  So basically it’s the old ‘Z score > x’ approach just made resistant to outliers

M-estimators Median = 12 Median absolute deviation    MAD is 8.5, 8.5/.6745 = 12.6 So if the absolute deviation from the median divided by 12.6 is greater than 1.28, we will call it an outlier In this case the value of 53 is an outlier  (53-12)/12.6 = 3.25

M-estimators L = number of outliers less than the median  For our data none qualify U = number of outliers greater than the median  For our data 1 value is an upper outlier B = sum of values that are not outliers Notice that if there are no outliers, this would default to the mean

M-estimators Pretty much the same as the trimmed mean when normal or non-normal  However either might be more accurate in some situations, best to compare both And as with the trimmed mean, it will outperform the mean if there are outliers

Inferential use of robust statistics In general, the approach will be the same using robust statistics as we have with regular ones as far as hypothesis testing and interval estimation Of particular concern will be estimating the standard error and the relative merits of the robust approach

The Trimmed Mean Consider the typical t-statistic for the one sample case This will hold using a trimmed mean as well, except we will be using the remaining values after trimming and our inference will regard the population trimmed mean instead of the population mean

The Trimmed Mean The problem is calculating the standard error When using the trimmed mean, its properties mean that we do not have independent values in the calculation of the standard error due to the ordering of observations, the result of which is introducing bias into the calculation  Our ‘area under the curve’ no longer would equal 1 The gist is that in order to get around this problem we will winsorize, rather than trim, in calculating the standard error

Example: Winsorized Mean Make some percentage of the most extreme values the same as the previous, non- extreme value Think of the 20% Winsorized mean as affecting the same number of values as the trimming = 3.75 in this example Mean = 

Trimmed mean So what we do is windsorize the data to calculate the standard error, and this will solve our technical issues We would calculate the CIs in the same fashion as always, just dealing with the trimmed situation. To determine the df and thus critical value, subtract 2*number of values trimmed from the regular n-1 degrees of freedom

Trimmed means The two sample case Here again, the concept remains the same

Trimmed means Calculating the variance for the group one mean h refers to the n remaining after trimming Do the same for group 2 *Note that this formulation works for unequal sample sizes also

Trimmed means While these approaches work well for dealing with normally distributed data, as mentioned the typical t approach is unsatisfactory when dealing with non- normal data Use the bootstrap approaches as described previously

M-estimators We can make inferences using M-estimators as well While we’d like to do the same with M- estimators, it really can’t be done outside of the bootstrap approach In particular, use the percentile bootstrap (as opposed to the percentile t) approach and do your hypothesis testing with confidence intervals

Effect size As Cohen’s d is a sample statistic, use the appropriate data for the trimmed case  Calculate Cohen’s d with the non-trimmed values (trimmed means and winsorized variance/sd) With M-estimators the general approach remains in effect as well

Summary Given the issues regarding means, variances and inferences based on them, a robust approach is appropriate and preferred when dealing with outliers/non-normal data  Increased power  More accurate assessment of group tendencies and differences  More accurate assessment of effect size If we want the best estimates and best inference given not-so-normal situations, standard methods simply don’t work too well We now have the methods and computing capacity to take a more robust approach to data analysis, and should not be afraid to use them.