Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 25 Multiple Regression Diagnostics (Sections )

Similar presentations


Presentation on theme: "Lecture 25 Multiple Regression Diagnostics (Sections )"— Presentation transcript:

1 Lecture 25 Multiple Regression Diagnostics (Sections 19.4-19.5)
Polynomial Models (Section 20.2)

2 19.4 Regression Diagnostics - II
The conditions required for the model assessment to apply must be checked. Is the error variable normally distributed? Is the regression function correctly specified as a linear function of x1,…,xk ( ) Plot the residuals versus x and Is the error variance constant? Are the errors independent? Can we identify outliers and influential observations? Is multicollinearity a problem? Draw a histogram of the residuals Plot the residuals versus y ^ Plot the residuals versus the time periods

3 Effects of Violated Assumptions
Curvature ( ): slopes no longer meaningful (Potential remedy: Transformations of responses and predictors) Violations of other assumptions: tests, p-values, CIs are no longer accurate. That is, inference is invalidated (Remedies may be difficult)

4 Influential Observation
Influential observation: An observation is influential if removing it would markedly change the results of the analysis. In order to be influential, a point must either be (i) an outlier in terms of the relationship between its y and x’s or (ii) have unusually distant x’s (high leverage) and not fall exactly into the relationship between y and x’s that the rest of the data follows.

5 Simple Linear Regression Example
Data in salary.jmp. Y=Weekly Salary, X=Years of Experience.

6 Identification of Influential Observations
Cook’s distance is a measure of the influence of a point – the effect that omitting the observation has on the estimated regression coefficients. Use Save Columns, Cook’s D Influence to obtain Cook’s Distance. Plot Cook’s Distances: Graph, Overlay Plot, put Cook’s D Influence in Y and leave X blank (plots Cook’

7 Cook’s Distance Rule of thumb: Observation with Cook’s Distance (Di) >1 has high influence. You should also be concerned about any observation that has Di<1 but has a much bigger Di than any other observation. Ex. 19.2:

8 Strategy for dealing with influential observations/outliers
Do the conclusions change when the obs. is deleted? If No. Proceed with the obs. Included. Study the obs to see if anything can be learned. If Yes. Is there reason to believe the case belongs to a population other than the one under investigation? If Yes. Omit the case and proceed. If No. Does the case have unusually “distant” independent variables. If Yes. Omit the case and proceed. Report conclusions for the reduced range of explanatory variables. If No. Not much can be said. More data are needed to resolve the questions.

9 Multicollinearity Multicollinearity: Condition in which independent variables are highly correlated. Exact collinearity: Y=Weight, X1=Height in inches, X2=Height in feet. Then provide the same predictions. Multicollinearity causes two kinds of difficulties: The t statistics appear to be too small. The b coefficients cannot be interpreted as “slopes”.

10 Multicollinearity Diagnostics
High correlation between independent variables Counterintuitive signs on regression coefficients Low values for t-statistics despite a significant overall fit, as measured by the F statistic.

11 Diagnostics: Multicollinearity
Example 19.2: Predicting house price (Xm19-02) A real estate agent believes that a house selling price can be predicted using the house size, number of bedrooms, and lot size. A random sample of 100 houses was drawn and data recorded. Analyze the relationship among the four variables

12 Diagnostics: Multicollinearity
The proposed model is PRICE = b0 + b1BEDROOMS + b2H-SIZE +b3LOTSIZE + e The model is valid, but no variable is significantly related to the selling price ?!

13 Diagnostics: Multicollinearity
Multicollinearity is found to be a problem. Multicollinearity causes two kinds of difficulties: The t statistics appear to be too small. The b coefficients cannot be interpreted as “slopes”.

14 Remedying Violations of the Required Conditions
Nonnormality or heteroscedasticity can be remedied using transformations on the y variable. The transformations can improve the linear relationship between the dependent variable and the independent variables. Many computer software systems allow us to make the transformations easily.

15 Reducing Nonnormality by Transformations
Transformations, Example. Reducing Nonnormality by Transformations A brief list of transformations y’ = log y (for y > 0) Use when the se increases with y, or Use when the error distribution is positively skewed y’ = y2 Use when the s2e is proportional to E(y), or Use when the error distribution is negatively skewed y’ = y1/2 (for y > 0) Use when the s2e is proportional to E(y) y’ = 1/y Use when s2e increases significantly when y increases beyond some critical value.

16 Durbin - Watson Test: Are the Errors Autocorrelated?
This test detects first order autocorrelation between consecutive residuals in a time series If autocorrelation exists the error variables are not independent

17 Positive First Order Autocorrelation
+ Residuals + + + Time + + + + Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Then, the value of d is small (less than 2).

18 Negative First Order Autocorrelation
Residuals + + + + + Time + + Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ. Then, the value of d is large (greater than 2).

19 Durbin-Watson Test in JMP
H0: No first-order autocorrelation. H1: First-order autocorrelation Use row diagnostics, Durbin-Watson test in JMP after fitting the model. Autocorrelation is an estimate of correlation between errors.

20 Testing the Existence of Autocorrelation, Example
Example 19.3 (Xm19-03) How does the weather affect the sales of lift tickets in a ski resort? Data of the past 20 years sales of tickets, along with the total snowfall and the average temperature during Christmas week in each year, was collected. The model hypothesized was TICKETS=b0+b1SNOWFALL+b2TEMPERATURE+e Regression analysis yielded the following results:

21 20.1 Introduction Regression analysis is one of the most commonly used techniques in statistics. It is considered powerful for several reasons: It can cover a variety of mathematical models linear relationships. non - linear relationships. nominal independent variables. It provides efficient methods for model building

22 Curvature: Midterm Problem 10

23 Remedy I: Transformations
Use Tukey’s Bulging Rule to choose a transformation.

24 Remedy II: Polynomial Models
y = b0 + b1x1+ b2x2 +…+ bpxp + e y = b0 + b1x + b2x2 + …+bpxp + e

25 Quadratic Regression

26 Polynomial Models with One Predictor Variable
First order model (p = 1) y = b0 + b1x + e Second order model (p=2) y = b0 + b1x + e b2x2 + e b2 < 0 b2 > 0

27 Polynomial Models with One Predictor Variable
Third order model (p = 3) y = b0 + b1x + b2x2 + e b3x3 + e b3 < 0 b3 > 0

28 Interaction Two independent variables x1 and x2 interact if the effect of x1 on y is influenced by the value of x2. Interaction can be brought into the multiple linear regression model by including the independent variable x1* x2. Example:

29 Interaction Cont. “Slope” for x1=E(y|x1+1,x2)-E(y|x1,x2)=
Is the expected income increase from an extra year of education higher for people with IQ 100 or with IQ 130 (or is it the same)?

30 Polynomial Models with Two Predictor Variables
First order model y = b0 + b1x1 + b2x2 + e First order model, two predictors,and interaction y = b0 + b1x1 + b2x b3x1x2 + e x1 The effect of one predictor variable on y is independent of the effect of the other predictor variable on y. The two variables interact to affect the value of y. [b0+b2(3)] +b1x1 X2 = 3 [b0+b2(3)] +[b1+b3(3)]x1 X2 = 3 [b0+b2(2)] +b1x1 X2 = 2 [b0+b2(2)] +[b1+b3(2)]x1 [b0+b2(1)] +b1x1 X2 = 1 X2 = 2 [b0+b2(1)] +[b1+b3(1)]x1 X2 =1 x1

31 Polynomial Models with Two Predictor Variables
Second order model y = b0 + b1x1 + b2x b3x12 + b4x22 + e Second order model with interaction y = b0 + b1x1 + b2x2 +b3x b4x22+ e X2 = 3 b5x1x2 + e X2 = 3 y = [b0+b2(3)+b4(32)]+ b1x1 + b3x12 + e X2 = 2 X2 = 2 X2 =1 y = [b0+b2(2)+b4(22)]+ b1x1 + b3x12 + e X2 =1 y = [b0+b2(1)+b4(12)]+ b1x1 + b3x12 + e x1


Download ppt "Lecture 25 Multiple Regression Diagnostics (Sections )"

Similar presentations


Ads by Google