Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: variable misspecification i: omitted variable bias Original citation:

Similar presentations


Presentation on theme: "Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: variable misspecification i: omitted variable bias Original citation:"— Presentation transcript:

1 Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: variable misspecification i: omitted variable bias Original citation: Dougherty, C. (2012) EC220 - Introduction to econometrics (chapter 6). [Teaching Resource] © 2012 The Author This version available at: Available in LSE Learning Resources Online: May 2012 This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License. This license allows the user to remix, tweak, and build upon the work even for commercial purposes, as long as the user credits the author and licenses their new creations under the identical terms.

2 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression model in terms of explanatory variables. 1 Consequences of variable misspecification TRUE MODEL FITTED MODEL

3 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE To keep the analysis simple, we will assume that there are only two possibilities. Either Y depends only on X 2, or it depends on both X 2 and X 3. 2 Consequences of variable misspecification TRUE MODEL FITTED MODEL

4 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE If Y depends only on X 2, and we fit a simple regression model, we will not encounter any problems, assuming of course that the regression model assumptions are valid. 3 Consequences of variable misspecification TRUE MODEL FITTED MODEL Correct specification, no problems

5 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Likewise we will not encounter any problems if Y depends on both X 2 and X 3 and we fit the multiple regression. 4 Consequences of variable misspecification TRUE MODEL FITTED MODEL Correct specification, no problems Correct specification, no problems

6 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence we will examine the consequences of fitting a simple regression when the true model is multiple. 5 Consequences of variable misspecification TRUE MODEL FITTED MODEL Correct specification, no problems Correct specification, no problems

7 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In the next one we will do the opposite and examine the consequences of fitting a multiple regression when the true model is simple. 6 Consequences of variable misspecification TRUE MODEL FITTED MODEL Correct specification, no problems Correct specification, no problems

8 Consequences of variable misspecification TRUE MODEL FITTED MODEL VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid. 7 Correct specification, no problems Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid.

9 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE 8 In the present case, the omission of X 3 causes b 2 to be biased by the term highlighted in yellow. We will explain this first intuitively and then demonstrate it mathematically.

10 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 9 The intuitive reason is that, in addition to its direct effect  2, X 2 has an apparent indirect effect as a consequence of acting as a proxy for the missing X 3.

11 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 10 The strength of the proxy effect depends on two factors: the strength of the effect of X 3 on Y, which is given by  3, and the ability of X 2 to mimic X 3.

12 11 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 The ability of X 2 to mimic X 3 is determined by the slope coefficient obtained when X 3 is regressed on X 2, the term highlighted in yellow.

13 12 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now derive the expression for the bias mathematically. It is convenient to start by deriving an expression for the deviation of Y i about its sample mean. It can be expressed in terms of the deviations of X 2, X 3, and u about their sample means.

14 13 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Although Y really depends on X 3 as well as X 2, we make a mistake and regress Y on X 2 only. The slope coefficient is therefore as shown.

15 14 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We substitute for the Y deviations and simplify.

16 15 Hence we have demonstrated that b 2 has three components. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

17 16 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE To investigate biasedness or unbiasedness, we take the expected value of b 2. The first two terms are unaffected because they contain no random components. Thus we focus on the expectation of the error term.

18 17 X 2 is nonstochastic, so the denominator of the error term is nonstochastic and may be taken outside the expression for the expectation. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

19 18 In the numerator the expectation of a sum is equal to the sum of the expectations (first expected value rule). VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

20 19 In each product, the factor involving X 2 may be taken out of the expectation because X 2 is nonstochastic. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

21 20 By Assumption A.3, the expected value of u is 0. It follows that the expected value of the sample mean of u is also 0. Hence the expected value of the error term is 0. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

22 21 Thus we have shown that the expected value of b 2 is equal to the true value plus a bias term. Note: the definition of a bias is the difference between the expected value of an estimator and the true value of the parameter being estimated. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

23 22 As a consequence of the misspecification, the standard errors, t tests and F test are invalid. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

24 . reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | We will illustrate the bias using an educational attainment model. To keep the analysis simple, we will assume that in the true model S depends only on ASVABC and SM. The output above shows the corresponding regression using EAEF Data Set 21. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

25 . reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will run the regression a second time, omitting SM. Before we do this, we will try to predict the direction of the bias in the coefficient of ASVABC.

26 . reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE It is reasonable to suppose, as a matter of common sense, that  3 is positive. This assumption is strongly supported by the fact that its estimate in the multiple regression is positive and highly significant.

27 . reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE The correlation between ASVABC and SM is positive, so the numerator of the bias term must be positive. The denominator is automatically positive since it is a sum of squares and there is some variation in ASVABC. Hence the bias should be positive.. cor SM ASVABC (obs=540) | SM ASVABC SM| ASVABC|

28 . reg S ASVABC Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | _cons | Here is the regression omitting SM. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

29 . reg S ASVABC SM S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | reg S ASVABC S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | _cons | As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the difference may be due to pure chance, but part is attributable to the bias. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

30 . reg S SM Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] SM | _cons | Here is the regression omitting ASVABC instead of SM. We would expect b 3 to be upwards biased. We anticipate that  2 is positive and we know that both the numerator and the denominator of the other factor in the bias expression are positive. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

31 30 In this case the bias is quite dramatic. The coefficient of SM has more than doubled. The reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC, while  2 and  3 are similar in size, judging by their estimates. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE. reg S ASVABC SM S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | reg S SM S | Coef. Std. Err. t P>|t| [95% Conf. Interval] SM | _cons |

32 . reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg S ASVABC Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg S SM Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = Finally, we will investigate how R 2 behaves when a variable is omitted. In the simple regression of S on ASVABC, R 2 is 0.34, and in the simple regression of S on SM it is VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

33 . reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg S ASVABC Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg S SM Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE Does this imply that ASVABC explains 34% of the variance in S and SM 13%? No, because the multiple regression reveals that their joint explanatory power is 0.35, not 0.47.

34 . reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg S ASVABC Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg S SM Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its apparent explanatory power. Similarly, in the third regression, SM is partly acting as a proxy for ASVABC, again inflating its apparent explanatory power.

35 . reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | However, it is also possible for omitted variable bias to lead to a reduction in the apparent explanatory power of a variable. This will be demonstrated using a simple earnings function model, supposing the logarithm of hourly earnings to depend on S and EXP. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

36 . reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE If we omit EXP from the regression, the coefficient of S should be subject to a downward bias.  3 is likely to be positive. The numerator of the other factor in the bias term is negative since S and EXP are negatively correlated. The denominator is positive.. cor S EXP (obs=540) | S EXP S| EXP|

37 . reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP should be downwards biased.. cor S EXP (obs=540) | S EXP S| EXP|

38 . reg LGEARN S EXP LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | reg LGEARN S LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | _cons | reg LGEARN EXP LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] EXP | _cons | As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

39 . reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg LGEARN S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg LGEARN EXP Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = A comparison of R 2 for the three regressions shows that the sum of R 2 in the simple regressions is actually less than R 2 in the multiple regression. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

40 . reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg LGEARN S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg LGEARN EXP Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = This is because the apparent explanatory power of S in the second regression has been undermined by the downwards bias in its coefficient. The same is true for the apparent explanatory power of EXP in the third equation. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

41 Copyright Christopher Dougherty These slideshows may be downloaded by anyone, anywhere for personal use. Subject to respect for copyright and, where appropriate, attribution, they may be used as a resource for teaching an econometrics course. There is no need to refer to the author. The content of this slideshow comes from Section 6.2 of C. Dougherty, Introduction to Econometrics, fourth edition 2011, Oxford University Press. Additional (free) resources for both students and instructors may be downloaded from the OUP Online Resource Centre Individuals studying econometrics on their own and who feel that they might benefit from participation in a formal course should consider the London School of Economics summer school course EC212 Introduction to Econometrics or the University of London International Programmes distance learning course 20 Elements of Econometrics


Download ppt "Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: variable misspecification i: omitted variable bias Original citation:"

Similar presentations


Ads by Google