 # Class 16: Thursday, Nov. 4 Note: I will e-mail you some info on the final project this weekend and will discuss in class on Tuesday.

## Presentation on theme: "Class 16: Thursday, Nov. 4 Note: I will e-mail you some info on the final project this weekend and will discuss in class on Tuesday."— Presentation transcript:

Class 16: Thursday, Nov. 4 Note: I will e-mail you some info on the final project this weekend and will discuss in class on Tuesday.

Predicting Emergency Calls to the AAA Club

R-Squared R-squared: As in simple linear regression, measures proportion of variability in Y explained by the regression of Y on these X’s. Between 0 and 1, nearer to 1 indicates more variability explained. Don’t get excited that R-squared has increased when you add more variables into the model. Adding another explanatory variable to the model will always increase R-squared. The right question to ask is not whether R-squared has increased when we add an explanatory variable to a model but whether or not R-squared has increased by a useful amount. The t-statistic and the associated p-value for the t- test for each coefficient answers this question.

Overall F-test Test of whether any of the predictors are useful: vs. at least one of does not equal zero. Tests whether the model provides better predictions than the sample mean of Y. p-value for the test: Prob>F in Analysis of Variance table. p-value = 0.005, strong evidence that at least one of the predictors is useful for predicting ERS for the New York AAA club.

Assumptions of Multiple Linear Regression Model 1.Linearity: 2.Constant variance: The standard deviation of Y for the subpopulation of units with is the same for all subpopulations. 3.Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations. 4.The observations are independent.

Assumptions for linear regression and their importance to inferences InferenceAssumptions that are important Point prediction, point estimation Linearity, independence Confidence interval for slope, hypothesis test for slope, confidence interval for mean response Linearity, constant variance, independence, normality (only if n<30) Prediction intervalLinearity, constant variance, independence, normality

Checking Linearity Plot residuals versus each of the explanatory variables. Each of these plots should look like random scatter, with no pattern in the mean of the residuals. If residual plots show a problem, then we could try to transform the x-variable and/or the y-variable.

Residual Plots in JMP After Fit Model, click red triangle next to Response, click Save Columns and click Residuals. Use Fit Y by X with Y=Residuals and X the explanatory variable of interest. Fit Line will draw a horizontal line with intercept zero. It is a property of the residuals from multiple linear regression that a least squares regression of the residuals on an explanatory variable has slope zero and intercept zero.

Residual by Predicted Plot Fit Model displays the Residual by Predicted Plot automatically in its output. The plot is a plot of the residuals versus the predicted Y’s, We can think of the predicted Y’s as summarizing all the information in the X’s. As usual we would like this plot to show random scatter. Pattern in the mean of the residuals as the predicted Y’s increase: Indicates problem with linearity. Look at residual plots versus each explanatory variable to isolate problem and consider transformations. Pattern in the spread of the residuals: Indicates problem with constant variance.

Checking Normality As with simple linear regression, make histogram of residuals and normal quantile plot of residuals. Normality appears to be violated: several points are outside the confidence bands. Distribution of Residuals is skewed to the right.

Transformations to Remedy Constant Variance and Normality Nonconstant Variance When the variance of Y| increases with, try transforming Y to log Y or Y to When the variance of Y| decreases with, try transforming Y to 1/Y or Y to Y 2 Nonnormality When the distribution of the residuals is skewed to the right, try transforming Y to log Y. When the distribution of the residuals is skewed to the left, try transforming Y to Y 2

Influential Points, High Leverage Points, Outliers As in simple linear regression, we identify high leverage and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats). High influence points: Cook’s distance > 1 High leverage points: Hat greater than (3*(# of explanatory variables + 1))/n is a point with high leverage. Use same guidelines for dealing with influential observations as in simple linear regression. Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero.

Scatterplot Matrix Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points. Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.

In order to evaluate benefits of a proposed irrigation scheme in Egypt, the relation of yield Y of wheat to rainfall is investigated over several years (see rainfall.JMP). How can regression analysis help? YearYield (Bu./Acre), YTotal Spring Rainfall, R Average Spring Temperature, T 196360856 1964501047 1965701153 1966701053 196780956 196850947 1969601244 1970401144

Simple Linear Regression of Yield on Rainfall Rainfall reduces yield!? Is irrigation a bad idea?

Interpretation of coefficient of rainfall: The change in the mean yield that is associated with a one inch increase in rainfall. Other important variables (lurking variables) are not held fixed and might tend to change as rainfall increases. Temperature tends to decrease as rainfall increases.

Controlling for Known Lurking Variables: Multiple Regression To evaluate the benefits of the irrigation scheme, we want to know how changes in rainfall are associated with changes in yield when all other important variables (lurking variables) such as temperature held fixed. Multiple regression provides this. Coefficient on rainfall in the multiple regression of yield on rainfall and temperature = change in the mean yield that is associated with a one inch increase in rainfall when temperature is held fixed.

Multiple Regression Analysis Rainfall is estimated to be beneficial once temperature is held fixed.

Download ppt "Class 16: Thursday, Nov. 4 Note: I will e-mail you some info on the final project this weekend and will discuss in class on Tuesday."

Similar presentations