Presentation on theme: "Multiple Regression W&W, Chapter 13, 15(3-4). Introduction Multiple regression is an extension of bivariate regression to take into account more than."— Presentation transcript:
Introduction Multiple regression is an extension of bivariate regression to take into account more than one independent variable. The most simple multivariate model can be written as: Y = 0 + 1 X 1 + 2 X 2 + We make the same assumptions about our error ( ) that we did in the bivariate case.
Example Suppose we examine the impact of fertilizer on crop yields, but this time we want to control for another factor that we think affects yield levels, rainfall. We collect the following data.
Data Y (yield)X 1 (fertilizer)X 2 (rainfall) 4010010 5020020 5030010 7040030 6550020 6560020 8070030
Multiple Regression Partial Slope Coefficients 1 is geometrically interpreted as the marginal effect of fertilizer (X 1 ) on yield (Y) holding rainfall (X 2 ) constant. The OLS model is estimated as: Y p = b 0 + b 1 X 1 + b 2 X 2 + e Solving for b 0, b 1, and b 2 becomes more complicated than it was in the bivariate model because we have to consider the relationships between X 1 and Y, X 2 and Y, and X 1 and X 2.
Finding the Slopes We would solve the following equations simultaneously for this problem: (X 1 -M x1 )(Y-M y ) = b 1 (X 1 -M x1 ) 2 + b 2 (X 1 -M x1 )(X 2 -M x2 ) (X 2 -M x2 )(Y-M y ) = b 1 (X 1 -M x1 )(X 2 -M x2 ) + b 2 (X 2 -M x2 ) 2 b 0 = M y – b 1 X 1 – b 2 X 2 These are called the normal or estimating equations.
Hypothesis Testing for We can calculate a confidence interval: = b +/- t /2 (se b ) Df = N – k – 1, where k=# of regressors We can also use a t-test for each independent variable to test the following hypotheses (as one or two tailed tests): H o : 1 = 0 H o : 2 = 0where t = b i /(se bi ) H A : 1 0 H A : 2 0
Dropping Regressors We may be tempted to throw out variables that are insignificant, but we might bias the remaining coefficients in the model. Such an omission of important variables is called omitted variable bias. If you have a strong theoretical reason to include a variable, then you should keep it in the model. One way to minimize such bias is to use randomized assignment of the treatment variables.
Interpreting the Coefficients In the bivariate regression model, the slope (b) represents a change in Y that accompanies a one unit change in X. In the multivariate regression model, each slope coefficient (b i ) represents the change in Y that accompanies a one unit change in the regressor (X i ) if all other regressors remain constant. This is like taking a partial derivative in calculus, which is why we refer to these as partial slope coefficients.
Partial Correlation Partial correlation calculates the correlation between Y and X i with the other regressors held constant: Partial r = b/(se b ) = t [b/(se b ) 2 + (n-k-1)] [t 2 + (n-k-1)]
Calculating Adjusted R 2 R 2 = SSR/SS Problem: R 2 increases as k increases, so some people advocate the use of the adjusted R 2 : R 2 A : (n-1)R 2 – k (n-k-1) We subtract k in the numerator as a “penalty” for increasing the size of k (# of regressors).
Stepwise Regression W&W talk about stepwise regression (pages 499-500). This is an atheoretical procedure that selects variables on the basis of how they increase R 2. Don’t use this technique because it is not theoretically driven and R 2 is a very problematic statistic (as you will learn later).
Standard error of the estimate A better measure of model fit is the standard error of the estimate: s = [ (Y-Y p ) 2 ]/[n-k-1] This is just the square root of the SSE, controlling for degrees of freedom. A model with a smaller standard error of the estimate is better. See Chris Achen’s Sage monograph on regression for a good discussion of this measure.
Multicollinearity An additional assumption we must make in multiple regression is that none of the independent variables are perfectly correlated with each other. In the simple multivariate model, for example: Y p = b 0 + b 1 X 1 + b 2 X 2 + e r 12 1
Multicollinearity With perfect multicollinearity, you cannot estimate the partial slope coefficients. To see why this is so, rewrite the estimate for b 1 in the model with two independent variables as: b 1 = [r y1 – r 12 r y2 ] s y [1 – r 12 2 ] s 1 r y1 = correlation between Y and X 1, r 12 = correlation between X 1 and X 2, r y2 = correlation between Y and X 2, s y = standard deviation of Y, s 1 = standard deviation of X 1
Multicollinearity We can see that if r 12 =1 or –1 you are dividing by zero, which is impossible. Often times, if r 12 is high, but not equal to one, you will get a good overall model fit (high R 2, significant F-statistic), but insignificant t-ratios. You should always examine the correlations between you independent variables to determine if this might be an issue.
Multicollinearity Multicollinearity does not bias our estimates, but if inflates the variance and thus the standard error of the parameters (or increases inefficiency). This is why we get insignificant t ratios, because we t = b/(se b ), and as we inflate se b, we depress the t-ratio making it less likely that we will reject the null hypothesis.
Standard error for b i We can calculate the standard error for b 1, for example, as: se 1 = s [ (X 1 -M x1 ) 2 ][1-R 1 2 ] Where R 1 = the multiple correlation of X 1 with all the other regressors. As R 1 increases, our standard error increases. Note that for bivariate regression, the term [1-R 1 2 ] drops out.