What is it and why could it be inappropriate? WINSORIZING Kyle Allen & Matthew Whitledge May 7, 2013.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
I OWA S TATE U NIVERSITY Department of Animal Science Using Basic Graphical and Statistical Procedures (Chapter in the 8 Little SAS Book) Animal Science.
Transformations & Data Cleaning
Statistical Techniques I EXST7005 Start here Measures of Dispersion.
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Experimental design and analyses of experimental data Lesson 2 Fitting a model to data and estimating its parameters.
EPI 809/Spring Probability Distribution of Random Error.
Psychology 202b Advanced Psychological Statistics, II February 8, 2011.
Psychology 202b Advanced Psychological Statistics, II February 10, 2011.
Regression Diagnostics Using Residual Plots in SAS to Determine the Appropriateness of the Model.
Chapter 11 Multiple Regression.
Quantitative Business Analysis for Decision Making Simple Linear Regression.
1 (Student’s) T Distribution. 2 Z vs. T Many applications involve making conclusions about an unknown mean . Because a second unknown, , is present,
Statistics for the Social Sciences Psychology 340 Spring 2005 Course Review.
One-Way ANOVA Independent Samples. Basic Design Grouping variable with 2 or more levels Continuous dependent/criterion variable H  :  1 =  2 =... =
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals.
Exploring Marketing Research William G. Zikmund
Bootstrapping applied to t-tests
Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of shape measures of relative standing.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
Correlation & Regression
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
The Data Analysis Plan. The Overall Data Analysis Plan Purpose: To tell a story. To construct a coherent narrative that explains findings, argues against.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Microeconometric Modeling William Greene Stern School of Business New York University.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Descriptive statistics Describing data with numbers: measures of variability.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Chapter 14 Inference for Regression AP Statistics 14.1 – Inference about the Model 14.2 – Predictions and Conditions.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
The Robust Approach Dealing with real data. Estimating Population Parameters Four properties are considered desirable in a population estimator:  Sufficiency.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTIONS 9.1, 9.3 Inference for slope (9.1)
Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
Outlier Treatment in HCSO Present and future. Outline Outlier detection – types, editing, estimation Description of the current method Alternatives Future.
Robust Estimators.
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for year old working males.
Correlation & Regression Analysis
[Topic 1-Regression] 1/37 1. Descriptive Tools, Regression, Panel Data.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Chapter 16 Multiple Regression and Correlation
Descriptive statistics Describing data with numbers: measures of variability.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
1/61: Topic 1.2 – Extensions of the Linear Regression Model Microeconometric Modeling William Greene Stern School of Business New York University New York.
Chapter 4: Basic Estimation Techniques
Basic Estimation Techniques
Statistical Data Analysis - Lecture10 26/03/03
Statistics in MSmcDESPOT
More on Specification and Data Issues
More on Specification and Data Issues
Basic Estimation Techniques
SA3202 Statistical Methods for Social Sciences
Stats Club Marnie Brennan
Multiple Linear Regression
Microeconometric Modeling
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Product moment correlation
Chapter 14 Inference for Regression
Relative valuation: Data ANALYSIS
More on Specification and Data Issues
Microeconometric Modeling
Introduction to Econometrics, 5th edition
Presentation transcript:

What is it and why could it be inappropriate? WINSORIZING Kyle Allen & Matthew Whitledge May 7, 2013

 What it isn’t…  Trimming  Truncating  Any other method that completely removes observations from the data  Term first used in 1960  John W. Tukey; W. J. Dixon  “Numerical value of a wild observation is untrustworthy”  However, its direction of deviation is important  Decreasing the magnitude of the deviation, retaining its direction WHAT IS WINSORIZING?

 Order the observations by value  X i1, X i2, …X i100, where i denotes the i th regressor  If Winsorizing at 1% and 99%, then  The value for X i1 will be replaced by the value for X i2  The value for X i100 will be replaced by the value for X i99 Another example:  X i1, X i2, …X i100  Winsorize at 10% (5% from bottom and 5% from the top)  Beginning Sample:  X i1, X i2, X i3, X i4, X i5, X i6,… X i95, X i96, X i97, X i98, X i99, X i100  Winsorized Sample  X i5, X i5, X i5, X i5, X i5, X i6,… X i95, X i96, X i96, X i96, X i96, X i96 WINSORIZING AN EXAMPLE Winsorized at 5% and 95% Obs.OriginalWinsorized X i X i X i X i X i5 6.3 X i6 77 X i7 7.1 X i8 7.2 X i9 -X i92 …… X i93 82 X i X i X i96 98 X i X i X i X i

 Are the observations really outliers?  Look at Cook’s D measure  Transform the variables  Take the log or square root of the variable  This shouldn’t be done only to increase significance  Median based estimations  Quantile regression  Median absolute deviation  Nonparametric methods WINSORIZING ALTERNATIVES

Lift Index Data  Workers perform lifting tasks  Each lift has an amount of stress associated with it  Measuring the number of days an employee missed based on the lift they were performing  206 observations WINSORIZING A SAS EXAMPLE

WINSORIZING SAS CODE  proc sgplot data=isqsdata.lilesmerge; scatter y=dayslost x=alr; scatter y=dayslost1 x=alr; run;  data isqsdata.lileswin; set isqsdata.lileswin; if subject = 6 then dayslost = 27; if subject = 35 then dayslost = 27; run;  proc qlim data=isqsdata.liles; model dayslost = alr; endogenous dayslost ~ censored(lb=0); run;  proc qlim data=isqsdata.lileswin; model dayslost1 = alr; endogenous dayslost1 ~ censored(lb=0); run;

WINSORIZING LOOK AT YOUR DATA

PROC GLIM (NON-WINSORIZED)

PROC GLIM (WINSORIZED)

 May impact significance  The standard errors will decrease  Depending on how symmetrical the data is, the mean may increase or decrease  For example, if there is an extremely positive outlier, it will decrease the mean  The significance will be determined by the proportionate change in the estimated coefficient, relative to the change in the standard error WINSORIZING IMPLICATIONS

 May be appropriate for  Ratios  Book to Market  Other measures in which the denominator can be extremely small  Never winsorize valid observations  Investment Returns  R&D expenditures  Truly exceptional observations  Large number of biological elements  Extremely low stress tolerances for mechanical implements  Model should produce data we could actually see WINSORIZING WHY COULD IT BE INAPPROPRIATE?

 Bibliography  Brillinger, David R. “John W. Tukey: His Life and Professional Contributions.” The Annals of Statistics. 30(2002):  Dixon, W. J. “Simplified Estimation from Censored Normal Samples.” The Annals of Mathematical Statistics. 31(1960):  Kafadar, Karen. “John Tukey and Robustness.” Proceedings of the Annual Meeting of the American Statistical Association  Kruskal, William, Thomas Ferguson, John W. Tukey, E. J. Gumbel, and F. J. Anscombe. “Discussion of the Papers of Messrs, Anscombe and Daniel.” Technometrics. 2(1960):  Tukey, John W. and Donald H. McLaughlin. “Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. The Indian Journal of Statistics. 25(1963):  Westfall, Peter H. and Kevin S. S. Henning. Understanding Advanced Statistical Methods. Boca Raton, FL: CRC Publishing, WINSORIZING BIBLIOGRAPHY