Stat 112: Lecture 9 Notes Homework 3: Due next Thursday

Name: Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Uploaded: 2017-12-09T19:23:13+00:00
Duration: PTM10S10
Description: Stat 112: Lecture 9 Notes Homework 3: Due next Thursday

Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Prediction Intervals for Multiple Regression (Chapter 4.5) Multicollinearity (Chapter 4.6).

Summary of F tests Partial F tests are used to test whether a subset of the slopes in multiple regression are zero. The whole model F test (test of the useful of the model) tests whether the slopes on all variables in multiple regression are zero, i.e., it tests whether the multiple regression is more useful for prediction than just ignoring the X’s and using to predict Y. For testing whether one slope in multiple regression is zero, we can use the t-test. But in fact, the partial F test for one slope being zero is equivalent to the t-test (it gives the same p-values and the same decisions). Why use the F test to test whether two or more slopes are not both equal to zero rather than two t-tests? The F test is more powerful. This will be illustrated later in the lecture.

Prediction in Automobile Example
The design team is planning a new car with the following characteristics: horsepower = 200, weight = 4000 lb, cargo = 18 ft3, seating = 5 adults. What is a 95% prediction interval for the GPM1000 of this car?

Prediction with Multiple Regression Equation
Prediction interval for individual with x1,…,xK:

Finding Prediction Interval in JMP
Enter a line with the independent variables x1,…,xK for the new individual. Do not enter a y for the new individual. Fit the model. Because the new individual does not have a y, JMP will not include the new individual when calculating the least squares fit. Click red triangle next to response, click Save Columns: To find , click Predicted Values. Creates column with To find 95% PI, click Indiv Confid Interval. Creates column with lower and upper endpoints of 95% PI.

Prediction in Automobile Example
The design team is planning a new car with the following characteristics: horsepower = 200, weight = 4000 lb, cargo = 18 ft3, seating = 5 adults. From JMP, 95% prediction interval: (37.86, 52.31)

Multicollinearity DATA: A real estate agents wants to develop a model to predict the selling price of a home. The agent takes a random sample of 100 homes that were recently sold and records the selling price (y), the number of bedrooms (x1), the size in square feet (x2) and the lot size in square feet (x3). Data is in houseprice.JMP.

Note: These results illustrate how the F test is more powerful for testing whether
a group of slopes in multiple regression are all zero than individual t tests.

Multicollinearity Multicollinearity: Explanatory variables are highly correlated with each other. It is often hard to determine their individual regression coefficients. There is very little information in the data set to find out what would happen if we fix house size and change lot size.

Since house size and lot size are highly correlated, for fixed house size, lot sizes do not change much. The standard error for estimating the coefficient of lot sizes is large. Consequently the coefficient may not be significant. Similarly for the coefficient of house size. So, while it seems that at least one of the coefficients is significant (See ANOVA) you can not tell which one is the useful one.

Consequences of Multicollinearity
Standard errors of regression coefficients are large. As a result t statistics for testing the population regression coefficients are small. Regression coefficient estimates are unstable. Signs of coefficients may be opposite of what is intuitively reasonable (e.g., negative sign on lot size). Dropping or adding one variable in the regression causes large change in estimates of coefficients of other variables.

Detecting Multicollinearity
Pairwise correlations between explanatory variables are high. Large overall F-statistic for testing usefulness of predictors but small t statistics. Variance inflation factors

Using VIFs To obtain VIFs, after Fit Model, go to Parameter Estimates, right click, click Columns and click VIFs. Detecting multicollinearity with VIFs: Any individual VIF greater than 10 indicates multicollinearity. Average of all VIFs considerably greater than 1 also indicates multicollinearity.

Multicollinearity and Prediction
If interest is in predicting y, as long as pattern of multicollinearity continues for those observations where forecasts are desired (e.g., house size and lot size are either both high, both medium or both small), multicollinearity is not particularly problematic. If interest is in predicting y for observations where pattern of multicollinearity is different than that in sample (e.g., large house size, small lot size), no good solution (this would be extrapolation).

Problems caused by multicollinearity
If interest is in predicting y, as long as pattern of multicollinearity continues for those observations where forecasts are desired (e.g., house size and lot size are either both high, both medium or both small), multicollinearity is not particularly problematic. If interest is in obtaining individual regression coefficients, there is no good solution in face of multicollinearity. If interest is in predicting y for observations where pattern of multicollinearity is different than that in sample (e.g., large house size, small lot size), no good solution (this would be extrapolation).

Dealing with Multicollinearity
Suffer: If prediction within the range of the data is the only goal, not the interpretation of the coefficients, then leave the multicollinearity alone. Combine: In some cases, it may be possible to combine variables to reduce multicollinearity (see next slide) Omit a variable. Multicollinearity can be reduced by removing one of the highly correlated variables. However, if one wants to estimate the partial slope of one variable holding fixed the other variables, omitting a variable is not an option, as it changes the interpretation of the slope.

Combining horsepower and weight in cars data

Multiple Regression Example: California Test Score Data
The California Standardized Testing and Reporting (STAR) data set californiastar.JMP contains data on test performance, school characteristics and student demographic backgrounds from Average Test Score is the average of the reading and math scores for a standardized test administered to 5th grade students. One interesting question: What would be the causal effect of decreasing the student-teacher ratio by one student per teacher?

Multiple Regression and Causal Inference
Goal: Figure out what the causal effect on average test score would be of decreasing student-teacher ratio and keeping everything else in the world fixed. Lurking variable: A variable that is associated with both average test score and student-teacher ratio. In order to figure out whether a drop in student-teacher ratio causes higher test scores, we want to compare mean test scores among schools with different student-teacher ratios but the same values of the lurking variables, i.e. we want to hold the value of the lurking variable fixed. If we include all of the lurking variables in the multiple regression model, the coefficient on student-teacher ratio represents the change in the mean of test scores that is caused by a one unit increase in student-teacher ratio.

Omitted Variables Bias
Schools with many English learners tend to have worst resources. The multiple regression that shows how mean test score changes when student teacher ratio changes but percent of English learners is held fixed gives a better idea of the causal effect of the student-teacher ratio than the simple linear regression that does not hold percent of English learners fixed. Omitted variables bias: bias in estimating the causal effect of a variable from omitting a lurking variable from the multiple regression. Omitted variables bias of omitting percentage of English learners = -2.28-(-1.10)=-1.28.

Key Warning About Multiple Regression
Even if we have included many lurking variables in the multiple regression, we may have failed to include one or not have enough data to include one. There will then be omitted variables bias. The best way to study causal effects is to do a randomized experiment.

Stat 112: Lecture 9 Notes Homework 3: Due next Thursday

Similar presentations

Presentation on theme: "Stat 112: Lecture 9 Notes Homework 3: Due next Thursday"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stat 112: Lecture 9 Notes Homework 3: Due next Thursday

Similar presentations

Presentation on theme: "Stat 112: Lecture 9 Notes Homework 3: Due next Thursday"— Presentation transcript:

Similar presentations

About project

Feedback