Multiple Regression 13 We use least squares approach again ^ Y = b 0 + b 1 X 1 + b 2 X 2 + … + b k X k Find (b 0, b 1, b 2, …, b k ) to minimize the sum of squared residuals, where k is the number of independent variables.
Multiple Regression 14 Things that are pretty much the same as in simple regression: R 2 S t-tests Additional: F-test Adjusted R 2 Multiple Regression Output
SUMMARY OUTPUT Regression Statistics Multiple R0.709 R Square0.503 Adjusted R Square0.461FStat = 11.889 Standard Error17748SigF = 0.00000 Observations52 CoefficientsStandard Errort Stat Intercept16442.258874.131.85 Education23742.7512549.920.30 Education431321.119486.853.30 Education637762.7510147.963.72 Education881236.4213555.465.99
Multiple Regression 16 H 0 : 1 = 2 = …= k = 0 (None of the Xs help explain Y) H 1 : Not all s are 0 (At least one X is useful) H 0 : R 2 = 0 is an equivalent hypothesis
Multiple Regression 17 Allows for the comparison of models with different numbers of variables. “Penalizes" or adjusts the regular R 2 for the number of variables used.
For the simple education regression, the model using the Education8 variable which had an R2 of 27% and an adjusted R2 of 25.2%. For the multiple education regression, the model had an R2 of 50% and an adjusted R2 of 46.1%. This tells us that in comparison the multiple regression model is more explanatory than the simple regression model.
Multiple Regression 19 Explained variation = R 2, k dof Unexplained variation = 1 - R 2,n-k-1 dof
Have considered the education variables individually: 5 separate regressions. Have considered all education variables simultaneously: 1 regression. Need a method for considering various subsets and modeling of a general multiple variable model. With 11 variables, there are 2 11 -1=2047 possible regressions. It is generally impracticable to consider all possible combinations.
Model Selection11 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each of the remaining k-1 variables. 4. Determine the best model that includes the previous best variable and one new best variable. 5. If either the adjusted-R 2 declines, the standard error of the regression increases, the F-test fails, the t-statistic of the best variable is insignificant, or the coefficients are theoretically inconsistent, STOP, and use the previous best model. Repeat 2-4 until stopped or an all variable model has been reached.
The starting point for the best regression model is the R2. The model with the highest R2 is the best model if the following conditions are met. The t-statistics indicate that all of the coefficients are statistically significant. The coefficients and the model are consistent with a desired theoretical model. If the conditions are not met, select the next best model, highest R2.
The best candidate model is using the variable Age: R2 is 83% and the t-statistics of the coefficients are all statistically significant. The sign and scale of the coefficients are plausible. However, it is unlikely that the company would choose to construct as a matter of theory a salary model based on age. The next best candidate model is using the variable Executive: R2 is 76% and the t-statistics are all statistically significant. The sign and scale of the coefficients are plausible. There is no conceptual business objection to the model.
Including the Executive variable, add in turn each of the remaining variables in constructing a two- variable model. This requires ten separate regressions. This process is continued until the stopping criteria are met: adjusted-R2 declines, standard error increases, the F-test fails for all models, the t-statistics for the best model are not significant, the sign and scale of the coefficients are conceptually contradictory or an irreducible theoretical contradictions are reached. The maximum number of multiple regressions required for this process is: 11+10+9+…1 = 55. This is a vast improvement over 2047.