Presentation on theme: "What is it and why could it be inappropriate? WINSORIZING Kyle Allen & Matthew Whitledge May 7, 2013."— Presentation transcript:
What is it and why could it be inappropriate? WINSORIZING Kyle Allen & Matthew Whitledge May 7, 2013
What it isn’t… Trimming Truncating Any other method that completely removes observations from the data Term first used in 1960 John W. Tukey; W. J. Dixon “Numerical value of a wild observation is untrustworthy” However, its direction of deviation is important Decreasing the magnitude of the deviation, retaining its direction WHAT IS WINSORIZING?
Order the observations by value X i1, X i2, …X i100, where i denotes the i th regressor If Winsorizing at 1% and 99%, then The value for X i1 will be replaced by the value for X i2 The value for X i100 will be replaced by the value for X i99 Another example: X i1, X i2, …X i100 Winsorize at 10% (5% from bottom and 5% from the top) Beginning Sample: X i1, X i2, X i3, X i4, X i5, X i6,… X i95, X i96, X i97, X i98, X i99, X i100 Winsorized Sample X i5, X i5, X i5, X i5, X i5, X i6,… X i95, X i96, X i96, X i96, X i96, X i96 WINSORIZING AN EXAMPLE Winsorized at 5% and 95% Obs.OriginalWinsorized X i X i X i X i X i5 6.3 X i6 77 X i7 7.1 X i8 7.2 X i9 -X i92 …… X i93 82 X i X i X i96 98 X i X i X i X i
Are the observations really outliers? Look at Cook’s D measure Transform the variables Take the log or square root of the variable This shouldn’t be done only to increase significance Median based estimations Quantile regression Median absolute deviation Nonparametric methods WINSORIZING ALTERNATIVES
Lift Index Data Workers perform lifting tasks Each lift has an amount of stress associated with it Measuring the number of days an employee missed based on the lift they were performing 206 observations WINSORIZING A SAS EXAMPLE
WINSORIZING SAS CODE proc sgplot data=isqsdata.lilesmerge; scatter y=dayslost x=alr; scatter y=dayslost1 x=alr; run; data isqsdata.lileswin; set isqsdata.lileswin; if subject = 6 then dayslost = 27; if subject = 35 then dayslost = 27; run; proc qlim data=isqsdata.liles; model dayslost = alr; endogenous dayslost ~ censored(lb=0); run; proc qlim data=isqsdata.lileswin; model dayslost1 = alr; endogenous dayslost1 ~ censored(lb=0); run;
WINSORIZING LOOK AT YOUR DATA
PROC GLIM (NON-WINSORIZED)
PROC GLIM (WINSORIZED)
May impact significance The standard errors will decrease Depending on how symmetrical the data is, the mean may increase or decrease For example, if there is an extremely positive outlier, it will decrease the mean The significance will be determined by the proportionate change in the estimated coefficient, relative to the change in the standard error WINSORIZING IMPLICATIONS
May be appropriate for Ratios Book to Market Other measures in which the denominator can be extremely small Never winsorize valid observations Investment Returns R&D expenditures Truly exceptional observations Large number of biological elements Extremely low stress tolerances for mechanical implements Model should produce data we could actually see WINSORIZING WHY COULD IT BE INAPPROPRIATE?
Bibliography Brillinger, David R. “John W. Tukey: His Life and Professional Contributions.” The Annals of Statistics. 30(2002): Dixon, W. J. “Simplified Estimation from Censored Normal Samples.” The Annals of Mathematical Statistics. 31(1960): Kafadar, Karen. “John Tukey and Robustness.” Proceedings of the Annual Meeting of the American Statistical Association Kruskal, William, Thomas Ferguson, John W. Tukey, E. J. Gumbel, and F. J. Anscombe. “Discussion of the Papers of Messrs, Anscombe and Daniel.” Technometrics. 2(1960): Tukey, John W. and Donald H. McLaughlin. “Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. The Indian Journal of Statistics. 25(1963): Westfall, Peter H. and Kevin S. S. Henning. Understanding Advanced Statistical Methods. Boca Raton, FL: CRC Publishing, WINSORIZING BIBLIOGRAPHY