business analytics II ▌appendix – regression performance the R2 

business analytics II ▌appendix – regression performance the R2 
Managerial Economics & Decision Sciences Department Developed for business analytics II week 6 week 7 ▌appendix – regression performance week 8 the R2  multicollinearity, the R2 and the Ftest  week 3 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II

► statistics & econometrics
session seven - appendix regression performance Developed for business analytics II learning objectives ► statistics & econometrics  define and interpret the R2  understand the connection between multicollinearity and R2 ►  the vif and testparm commands readings ► (MSN)  Chapter 7 ► (CS)  Dubuque Hot Dog © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II

variations around the mean
Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ variations around the mean ► The diagram below shows a very simple case (just for illustrative purposes) of a linear regression of y (dependent variable) on x (independent variable). There are only three observations in the sample: the pairs (x1, y1), (x2, y2) and (x3, y3).  the mean of the dependent variable is  the regression line is which generates the pairs dependent variable y ( ) ► We can identify two types of variations around the mean:  dependent variable variation around its own mean: yi  model-based estimated variable variation around the mean: ► Remember that the linear regression is supposed to explained how the mean of the dependent variable depends on x regression line independent variable x © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 1

► For the overall regression it holds that:
Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ the R2 ► For the overall regression it holds that: ► The equation above says that the total variation (TSS) in the dependent variable is the sum of two components: one that is explained by the regression (MSS) and one that is unexplained by the regression model or residual variation (RSS).  variation of the dependent variable around the mean (TSS – total sum of squares)  variation of the model-based estimated variable around the mean (MSS – model sum of squares)  variation of the dependent variable around the model-based estimated variable (RSS – residual sum of squares) key concept: multicollinearity ► R2 is the fraction of the variation in the y variable, i.e. variation of y around its own mean, that is explained by the x-variables used in the regression, i.e. “explained by the regression model”. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 2

► Let’s look back at the Dubuque regression:
Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ the R2 ► Let’s look back at the Dubuque regression: Figure 1. Results for regression of MKTDUB on pdub, poscar, pbpreg and pbpbeef Source | SS df MS Number of obs = F(4, 108) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = MKTDUB | Coef. Std. Err t P>|t| [95% Conf. Interval] pdub | poscar | pbpreg | pbpbeef | _cons | Remark. The definition of R2  MSS/TSS seems to imply that 52.63% of the variation in the MKTDUB is explained by the variation generated by the independent variables. Pretty impressive!!! As a check (and a way to calculate R2 “manually”) notice the top left of the table numbers: MSS  , RSS  and TSS  and MSS/TSS  / = © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 3

► High R2 does not guarantee accurate predictions
Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ the R2 ► High R2 does not guarantee accurate predictions • R2 is only a relative, not an absolute measure of accuracy. If there is a lot of variation in y, i.e. TSS is large, then even a regression with a high R2 might have wide prediction intervals • “In-sample” vs. “Out-of-Sample” performance: the computed value of R2 is based on the observations included in your regression - R2 tells you how well the model fit the data used to run the regression and obtain the R2 - this may be useful if the model that generates future observations is the same as the model that generated your data - if this is not the case, then your model may have a high R2 but may be worthless for prediction ► High R2 is not a sign of accomplishment • Regressions with high R2 may be uninformative if your regression has picked up a well-known and uninteresting trend (regress annual per capita income on a time trend; have you actually learned anything?) • If we loosely interpret “trend,” we see that there are many situations where you can get a high R2 without learning anything (say you run the regression Rebounds  0  1·Height, then with 1 > 0 the recommendation is to grow taller!) © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 4

► Low R2, by itself, is not a sign of failure
Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ the R2 ► R2 will increase (for sure!) by adding randomly chosen variables as independent variables • Add the following variable to the hotdog.dta file: z taking values 1, 2, … , 113. • Surprise: R2 increases! Yet, variable z has nothing to do with (no explanatory power for) the MKTDUB. For this reason we should use an adjusted R2 that basically adjusts for the number of independent variables (k) that are included in the model: ► Low R2, by itself, is not a sign of failure • If there is not much variation in y, i.e. TSS is small, then even a regression with a low R2 might have small prediction intervals • Sometimes, even a little extra predictive power can be valuable - think of the stock market, or forecasting landfall of hurricanes • The regression may be very useful in learning about the deterministic portion of a model if you can obtain precise coefficient estimates. A coefficient may have a small standard error and be highly statistically significant even if R2 is low ► If deciding on how many independent variables to include you can use the adjusted R2 - nevertheless the specific questions motivating your analysis usually suggest better, more relevant tools than R2 (or the adjusted version of R2) © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 5

multicollinearity revisited
Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► Back to the Dubuque Hot Dogs regression (to be able to run the following commands make sure the initial regression is the last regression run). ► We saw that pbpreg and pbpbeef are likely to be correlated and that might induce inflated standard deviations for the coefficients. The command vif delivers a list of “variance inflation factors” for each coefficient: ► How are the variance inflation factors calculated? Figure 1. Results for vif command . vif Variable | VIF /VIF pbpreg | pbpbeef | poscar | pdub | Mean VIF | © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 6

Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► Let’s keep this table available for an easy reference: ► The dependent variables are obviously pdub, poscar, pbpreg and pbpbeef ► The 1/VIF for variable pdub is equal to 1  R2 where R2 is the R-squared from the regression of pdub on all the other dependent variables: pdub  c0  c1poscar  c2pbreg  c3pbpbeef VIF results for full regression Variable | VIF /VIF pbpreg | pbpbeef | poscar | pdub | Mean VIF | . regress pdub poscar pbpreg pbpbeef Source | SS df MS Number of obs = F( 3, 109) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = 1  R2 = 0.734 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 7

pbpreg  d0  d1pdub  d2poscar  d3pbpbeef
Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► Let’s keep this table available for an easy reference: ► The 1/VIF for variable pbpreg is equal to 1  R2 where R2 is the R-squared from the regression of pbpreg on all the other dependent variables: pbpreg  d0  d1pdub  d2poscar  d3pbpbeef VIF results for full regression Variable | VIF /VIF pbpreg | pbpbeef | poscar | pdub | Mean VIF | . regress pbpreg pdub poscar pbpbeef Source | SS df MS Number of obs = F( 3, 109) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = 1  R2 = 0.038 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 8

Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► The fact that we detect inflated standard deviations does not guarantee automatically detection of multicollinearity. To identify multicollinearity we use the F-test. ► The F-test tells us whether one or more variables adds predictive power to a regression: ► In plain language: you are basically testing whether these variables are no more related to y than junk variables. Remark. The F-test for a single variable returns the same significance level as the t-test ► The Ftest for a group of variables can be executed in STATA using the test or testparm command and listing the variables you wish to test after running a regression. hypothesis H0: all of the regression coefficients () on the variables you are testing equal 0 Ha: at least one of the regression coefficients () is different from 0 testparm xvar1 xvar2 … xvark Remark. After STATA command testparm you should list the variable for which you want to test whether their coefficients are all different from zero. Remark What if we include all the dependent variables in the list for the F-test? What are we testing?  We are actually testing the null H0: all coefficients in the regression are zero. If we reject the null then we know that at least one of the variable adds some predictive value. If we cannot reject the null then it means that we are really using variables with no explanatory/predictive power for the variation in the independent variable. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 9

Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► For the Dubuque Hot Dogs full regression (just the upper part of the table): ► You can also run: Source | SS df MS Number of obs = F( 4, 108) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = . testparm pdub poscar pbpreg pbpbeef ( 1) pdub = 0 ( 2) poscar = 0 ( 3) pbpreg = 0 ( 4) pbpbeef = 0 F( 4, 108) = Prob > F = hypothesis test decision ► The regression table provides implicitly the joint test that all regression coefficients are zero. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 10

business analytics II ▌appendix – regression performance the R2 

Similar presentations

Presentation on theme: "business analytics II ▌appendix – regression performance the R2 "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

business analytics II ▌appendix – regression performance the R2 

Similar presentations

Presentation on theme: "business analytics II ▌appendix – regression performance the R2 "— Presentation transcript:

Similar presentations

About project

Feedback