Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic 9: Remedies.

Similar presentations

Presentation on theme: "Topic 9: Remedies."— Presentation transcript:

1 Topic 9: Remedies

2 Outline Review diagnostics for residuals Discuss remedies
Nonlinear relationship Nonconstant variance Non-Normal distribution Outliers

3 Diagnostics for residuals
Look at residuals to find serious violations of the model assumptions nonlinear relationship nonconstant variance non-Normal errors presence of outliers a strongly skewed distribution

4 Recommendations for checking assumptions
Plot Y vs X (is it a linear relationship?) Look at distribution of residuals Plot residuals vs X, time, or any other potential explanatory variable Use the i=sm## in symbol statement to get smoothed curves

5 Plots of Residuals Plot residuals vs Look for nonrandom patterns
Time (order) X or predicted value (b0+b1X) Look for nonrandom patterns outliers (unusual observations)

6 Residuals vs Order Pattern in plot suggests dependent errors / lack of indep Pattern usually a linear or quadratic trend and/or cyclical If you are interested read KNNL pgs

7 Residuals vs X Can look for nonconstant variance
nonlinear relationship outliers somewhat address Normality of residuals

8 Tests for Normality H0: data are an i.i.d. sample from a Normal population Ha: data are not an i.i.d. sample from a Normal population KNNL (p 115) suggest a correlation test that requires a table look-up

9 Tests for Normality We have several choices for a significance testing procedure Proc univariate with the normal option provides four proc univariate normal; Shapiro-Wilk is a common choice

10 All P-values > 0.05…Do not reject H0
Tests for Normality Test Statistic p Value Shapiro-Wilk W Pr < W 0.8626 Kolmogorov-Smirnov D Pr > D >0.1500 Cramer-von Mises W-Sq Pr > W-Sq >0.2500 Anderson-Darling A-Sq Pr > A-Sq All P-values > 0.05…Do not reject H0

11 Other tests for model assumptions
Durbin-Watson test for serially correlated errors (KNNL p 114) Modified Levene test for homogeneity of variance (KNNL p ) Breusch-Pagan test for homogeneity of variance (KNNL p 118) For SAS commands see

12 Plots vs significance test
Plots are more likely to suggest a remedy Significance tests results are very dependent on the sample size; with sufficiently large samples we can reject most null hypotheses

13 Default graphics with SAS 9.3
proc reg data=toluca; model hours=lotsize; id lotsize; run;

14 Will discuss these diagnostics more in multiple regression
Provides rule of thumb limits Questionable observation (30,273)

15 Additional summaries Rstudent: Studentized residual…almost all should be between ± 2 Leverage: “Distance” of X from center…helps determine outlying X values in multivariable setting…outlying X values may be influential Cooks’D: Influence of ith case on all predicted values



18 Lack of fit When we have repeat observations at different values of X, we can do a significance test for nonlinearity Browse through KNNL Section 3.7 Details of approach discussed when we get to KNNL 17.9, p 762 Basic idea is to compare two models Gplot with a smooth is a better (i.e., simpler) approach

19 SAS code and output proc reg data=toluca;
model hours=lotsize / lackfit; run; Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 252378 105.88 <.0001 Error 23 54825 Lack of Fit 9 17245 0.71 0.6893 Pure Error 14 37581 Corrected Total 24 307203

20 Nonlinear relationships
We can model many nonlinear relationships with linear models, some have several explanatory variables (i.e., multiple linear regression) Y = β0 + β1X + β2X2 + e (quadratic) Y = β0 + β1log(X) + e

21 Nonlinear Relationships
Sometimes can transform a nonlinear equation into a linear equation Consider Y = β0exp(β1X) + e Can form linear model using log log(Y) = log(β0) + β1X + log(e) Note that we have changed our assumption about the error

22 Nonlinear Relationship
We can perform a nonlinear regression analysis KNNL Chapter 13 SAS PROC NLIN

23 Nonconstant variance Sometimes we model the way in which the error variance changes may be linearly related to X We can then use a weighted analysis KNNL 11.1 Use a weight statement in PROC REG

24 Non-Normal errors Transformations often help
Use a procedure that allows different distributions for the error term SAS PROC GENMOD

25 Generalized Linear Model
Possible distributions of Y: Binomial (Y/N or percentage data) Poisson (Count data) Gamma (exponential) Inverse gaussian Negative binomial Multinomial Specify a link function for E(Y)

26 Ladder of Reexpression (transformations)
1.5 p Transformation is xp 1.0 0.5 0.0 -0.5 -1.0

27 Circle of Transformations
X down, Y up X up, Y up Y X X up, Y down X down, Y down

28 Box-Cox Transformations
Also called power transformations These transformations adjust for non-Normality and nonconstant variance Y´ = Y or Y´ = (Y - 1)/ In the second form, the limit as  approaches zero is the (natural) log

29 Important Special Cases
 = 1, Y´ = Y1, no transformation  = .5, Y´ = Y1/2, square root  = -.5, Y´ = Y-1/2, one over square root  = -1, Y´ = Y-1 = 1/Y, inverse  = 0, Y´ = (natural) log of Y

30 Box-Cox Details We can estimate  by including it as a parameter in a non-linear model Y = β0 + β1X + e and using the method of maximum likelihood Details are in KNNL p SAS code is in

31 Box-Cox Solution Standardized transformed Y is K1(Y - 1) if  ≠ 0
K2log(Y) if  = 0 where K2 = ( Yi)1/n (the geometric mean) and K1 = 1/ ( K2 -1) Run regressions with X as explanatory variable estimated  minimizes SSE

32 Example data a1; input age plasma @@; cards;
4 6.23 ;


34 Box Cox Procedure *Procedure that will automatically find the Box-Cox transformation; proc transreg data=a1; model boxcox(plasma)=identity(age); run;

35 Lambda R-Square Log Like -2.50 0.76 -17.0444 -2.00 0.80 -12.3665
Transformation Information for BoxCox(plasma) Lambda R-Square Log Like * < * < - Best Lambda * - Confidence Interval + - Convenient Lambda

36 *The first part of the program gets the geometric mean;
data a2; set a1; lplasma=log(plasma); proc univariate data=a2 noprint; var lplasma; output out=a3 mean=meanl;

37 data a4; set a2; if _n_ eq 1 then set a3; keep age yl l; k2=exp(meanl); do l = -1.0 to 1.0 by .1; k1=1/(l*k2**(l-1)); yl=k1*(plasma**l -1); if abs(l) < 1E-8 then yl=k2*log(plasma); output; end;

38 proc sort data=a4 out=a4;
by l; proc reg data=a4 noprint outest=a5; model yl=age; data a5; set a5; n=25; p=2; sse=(n-p)*(_rmse_)**2; proc print data=a5; var l sse;

39 Obs l sse

40 symbol1 v=none i=join; proc gplot data=a5; plot sse*l; run;


42 data a1; set a1; tplasma = plasma**(-.5); tage = (age+.5)**(-.5); symbol1 v=circle i=sm50; proc gplot; plot tplasma*age; proc sort; by tage; proc gplot; plot tplasma*tage; run;



45 Background Reading Sections describe significance tests for assumptions (read it if you are interested). Box-Cox transformation is in Read sections 4.1, 4.2, 4.4, 4.5, and 4.6

Download ppt "Topic 9: Remedies."

Similar presentations

Ads by Google