Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions
Adjusted R-square Pg 9-12 Pfeifer note Hours Hours Mean Standard Error Median7.08 Mode7.17 Standard Deviation Sample Variance Kurtosis Skewness Range13.08 Minimum2 Maximum15.08 Sum Count15 Our better method of forecasting hours would use a mean of 7.9 and standard deviation of 3.89 (and the t- distribution with 14 dof) The variation in Hours that regression will try to explain
Our better method of forecasting hours for job A would use a mean of and standard deviation of 2.77 (and the t-distribution with 13 dof) The variation in Hours regression leaves unexplained. MSFHours SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations15 ANOVA df Regression1 Residual13 Total14 Coefficients Intercept MSF Adjusted R-square Pg 9-12 Pfeifer note
Adjusted R-square Pg 9-12 Pfeifer note
From the Pfeifer note Adj R-square = 0.0 Adj R-square = 0.5 Adj R-square = 1.0 Standard error = 0 Standard error = s
Why Pfeifer says R2 is over-rated There is no standard for how large it should be. – In some situations an adjusted R 2 of 0.05 would be FANTASTIC. In others, an adjusted R 2 of 0.96 would be DISAPOINTING. It has no real use. – Unlike “standard error” which is needed to make probability forecasts. It is usually redundant – When comparing models, lower standard errors mean higher adj R 2 – The correlation coefficient (which shares the same sign as b) ≈ the square root of adj R 2.
The Coal Pile Example The firm needed a way to estimate the weight of a coal pile (based on it’s dimensions) WDhd SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations10 ANOVA df Regression3 Residual6 Total9 Coefficients Intercept D h d % of the variation in W is explained by this regression. We just used MULTIPLE regression.
The Coal Pile Example Engineer Bob calculated the Volume of each pile and used simple regression… 100% of the variation in W is explained by this regression. Standard error went from to 20.6 to 2.8!!! W Vol SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations10 ANOVA df Regression1 Residual8 Total9 Coefficients Intercept Vol
The Four Assumptions Sec 5 of Pfeifer note Sec 12.4 of EMBS
Our better method of forecasting hours for job A would use a mean of and standard deviation of 2.77 (and the t-distribution with 13 dof) MSFHours SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations15 ANOVA df Regression1 Residual13 Total14 Coefficients Intercept MSF The four assumptions Linearity Independence (all 15 points count equally) homoskedasticity Normality Sec 5 of Pfeifer note Sec 12.4 of EMBS
Hypotheses H0: P=0.5 (LTT, wunderdog) H0: Independence (supermarket job and response, treatment and heart attack, light and myopia, tosser and outcome) H0: μ=100 (IQ) H0: μ M = μ F (heights, weights, batting average) H0: μ compact = μ mid = μ large (displacement) P 13 of Pfeifer note Sec 12.5 of EMBS
H0: b=0 P 13 of Pfeifer note Sec 12.5 of EMBS
Testing b=0 is EASY!!! MSFHours CoefficientsStandard Errort StatP-valueLower 95%Upper 95% Intercept MSF The standard error of the coefficient The t-stat to test b=0. The 2-tailed p- value. P 13 of Pfeifer note Sec 12.5 of EMBS
Using Yes/No variable in Regression Car Class Displaceme ntFuel TypeHwy MPG 1 Midsize3.5R28 2 Midsize3R26 3 Large3P26 4 Large3.5P Compact6P20 59 Midsize2.5R30 60 Midsize2R32 Categorical Numerical n=60 Sec 8 of Pfeifer note Sec 13.7 of EMBS Does MPG “depend” on fuel type?
Fuel type (yes/no) and mpg (numerical) Un-stack the data so there are two columns of MPG data. Data Analysis, T-test two sample t-Test: Two-Sample Assuming Equal Variances PR Mean Variance Observations3624 Pooled Variance Hypothesized Mean Difference0 df58 t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Sec 8 of Pfeifer note Sec 13.7 of EMBS H0: μ P = μ R Or H0: μ P – μ R = 0
Using Yes/No variables in Regression 1.Convert the categorical variable into a 1/0 DUMMY Variable. – Use an if statement to do this. – It won’t matter which is assigned 1, which is assigned 0. – It doesn’t even matter what 2 numbers you assign to the two categories (regression will adjust) 2.Regress MPG (numerical) on DUMMY (1/0 numerical) 3.Test H0: b=0 using the regression output. Sec 8 of Pfeifer note Sec 13.7 of EMBS
Using Yes/No variables in Regression Fuel TypeDprem Hwy MPG R028 R026 P1 P P121 P125 P120 R030 R032 SUMMARY OUTPUT Regression Statistics Adj R Square Standard Error Observations60 ANOVA dfSSMSFSig F Regression E-04 Residual Total CoeffStd Errort StatP-value Intercept E-44 Dprem E-04 Sec 8 of Pfeifer note Sec 13.7 of EMBS
Regression with one Dummy variable H0: μ P = μ R Or H0: μ P – μ R = 0 Or H0: b = 0
What we learned today We learned about “adjusted R square” – The most over-rated statistic of all time. We learned the four assumptions required to use regression to make a probability forecast of Y│X. – And how to check each of them. We learned how to test H0: b=0. – And why this is such an important test. We learned how to use a yes/no variable in a regression. – Create a dummy variable.