Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying.

Similar presentations


Presentation on theme: "© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying."— Presentation transcript:

1 © Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying the regression model in terms of explanatory variables. To keep the analysis simple, we will assume that there are only two possibilities. Either Y depends only on X 2, or it depends on both X 2 and X 3. A.If Y depends only on X 2, and we fit a simple regression model, we will not encounter any problems, assuming of course that the regression model assumptions are valid. B.Likewise we will not encounter any problems if Y depends on both X 2 and X 3 and we fit the multiple regression.

2 © Christopher Dougherty 1999–2006 A.In this sequence we will examine the consequences of fitting a simple regression when the true model is multiple.  The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid. B.In the next one we will do the opposite and examine the consequences of fitting a multiple regression when the true model is simple.  Including an irrelevant variable – In this case, the coefficients in general remain unbiased, but they are inefficient. The standard errors remain valid, but are needlessly large. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

3 © Christopher Dougherty 1999–2006 Consequences of Variable Misspecification True Model Fitted Model The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid. Correct specification, no problems Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

4 © Christopher Dougherty 1999–2006 In the present case, the omission of X 3 causes b 2 to be biased by the term highlighted in yellow. We will explain this first intuitively and then demonstrate it mathematically. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

5 © Christopher Dougherty 1999–2006 Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 The intuitive reason is that, in addition to its direct effect  2, X 2 has an apparent indirect effect as a consequence of acting as a proxy for the missing X 3. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

6 © Christopher Dougherty 1999–2006 Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 The strength of the proxy effect depends on two factors: the strength of the effect of X 3 on Y, which is given by  3, and the ability of X 2 to mimic X 3. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

7 © Christopher Dougherty 1999–2006 Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 The ability of X 2 to mimic X 3 is determined by the slope coefficient obtained when X 3 is regressed on X 2, the term highlighted in yellow. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

8 © Christopher Dougherty 1999–2006 We will now derive the expression for the bias mathematically. It is convenient to start by deriving an expression for the deviation of Y i about its sample mean. It can be expressed in terms of the deviations of X 2, X 3, and u about their sample means. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

9 © Christopher Dougherty 1999–2006 Although Y really depends on X 3 as well as X 2, we make a mistake and regress Y on X 2 only. The slope coefficient is therefore as shown. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

10 © Christopher Dougherty 1999–2006 We substitute for the Y deviations and simplify. Hence we have demonstrated that b 2 has three components. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

11 © Christopher Dougherty 1999–2006 To investigate biasedness or unbiasedness, we take the expected value of b 2. The first two terms are unaffected because they contain no random components. Thus we focus on the expectation of the error term. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

12 © Christopher Dougherty 1999–2006 X 2 is nonstochastic, so the denominator of the error term is nonstochastic and may be taken outside the expression for the expectation. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

13 © Christopher Dougherty 1999–2006 In the numerator the expectation of a sum is equal to the sum of the expectations (first expected value rule). VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

14 © Christopher Dougherty 1999–2006 In each product, the factor involving X 2 may be taken out of the expectation because X 2 is nonstochastic. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

15 © Christopher Dougherty 1999–2006 By Assumption A.3, the expected value of u is 0. It follows that the expected value of the sample mean of u is also 0. Hence the expected value of the deviation from mean for the disturbance term is also 0. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

16 © Christopher Dougherty 1999–2006 Thus we have shown that the expected value of b 2 is equal to the true value plus a bias term. Note 1: The definition of a bias is the difference between the expected value of an estimator and the true value of the parameter being estimated.) Note 2: The bias will be zero if the sample correlation between X2 and X3 is zero, or  3 = 0. As a consequence of the misspecification, the standard errors, t tests and F test are invalid. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

17 © Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543 -------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963 ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1328069.0097389 13.64 0.000.1136758.151938 SM |.1235071.0330837 3.73 0.000.0585178.1884963 _cons | 5.420733.4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ We will illustrate the bias using an educational attainment model. To keep the analysis simple, we will assume that in the true model S depends only on ASVABC and SM. The output above shows the corresponding regression using EAEF Data Set 21. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

18 © Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543 -------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963 ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1328069.0097389 13.64 0.000.1136758.151938 SM |.1235071.0330837 3.73 0.000.0585178.1884963 _cons | 5.420733.4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ We will run the regression a second time, omitting SM. Before we do this, we will try to predict the direction of the bias in the coefficient of ASVABC. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

19 © Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543 -------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963 ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1328069.0097389 13.64 0.000.1136758.151938 SM |.1235071.0330837 3.73 0.000.0585178.1884963 _cons | 5.420733.4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ It is reasonable to suppose, as a matter of common sense, that  3 is positive. This assumption is strongly supported by the fact that its estimate in the multiple regression is positive and highly significant. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

20 © Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543 -------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963 ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1328069.0097389 13.64 0.000.1136758.151938 SM |.1235071.0330837 3.73 0.000.0585178.1884963 _cons | 5.420733.4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------ The correlation between ASVABC and SM is positive, so the numerator of the bias term must be positive. The denominator is automatically positive since it is a sum of squares and there is some variation in ASVABC. Hence the bias should be positive.. cor SM ASVABC (obs=540) | SM ASVABC --------+------------------ SM| 1.0000 ASVABC| 0.4202 1.0000 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

21 © Christopher Dougherty 1999–2006. reg S ASVABC Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 274.19 Model | 1081.97059 1 1081.97059 Prob > F = 0.0000 Residual | 2123.01275 538 3.94612035 R-squared = 0.3376 -------------+------------------------------ Adj R-squared = 0.3364 Total | 3204.98333 539 5.94616574 Root MSE = 1.9865 ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.148084.0089431 16.56 0.000.1305165.1656516 _cons | 6.066225.4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ Here is the regression omitting SM. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

22 © Christopher Dougherty 1999–2006. reg S ASVABC SM ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1328069.0097389 13.64 0.000.1136758.151938 SM |.1235071.0330837 3.73 0.000.0585178.1884963 _cons | 5.420733.4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------. reg S ASVABC ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.148084.0089431 16.56 0.000.1305165.1656516 _cons | 6.066225.4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the difference may be due to pure chance, but part is attributable to the bias. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

23 © Christopher Dougherty 1999–2006. reg S SM Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 80.93 Model | 419.086251 1 419.086251 Prob > F = 0.0000 Residual | 2785.89708 538 5.17824736 R-squared = 0.1308 -------------+------------------------------ Adj R-squared = 0.1291 Total | 3204.98333 539 5.94616574 Root MSE = 2.2756 ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- SM |.3130793.0348012 9.00 0.000.2447165.3814422 _cons | 10.04688.4147121 24.23 0.000 9.232226 10.86153 ------------------------------------------------------------------------------ Here is the regression omitting ASVABC instead of SM. We would expect b 3 to be upwards biased. We anticipate that  2 is positive and we know that both the numerator and the denominator of the other factor in the bias expression are positive. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

24 © Christopher Dougherty 1999–2006 In this case the bias is quite dramatic. The coefficient of SM has more than doubled. The reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC, while  2 and  3 are similar in size, judging by their estimates (see equation on previous slide).. reg S ASVABC SM ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1328069.0097389 13.64 0.000.1136758.151938 SM |.1235071.0330837 3.73 0.000.0585178.1884963 _cons | 5.420733.4930224 10.99 0.000 4.452244 6.389222 ------------------------------------------------------------------------------. reg S SM ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- SM |.3130793.0348012 9.00 0.000.2447165.3814422 _cons | 10.04688.4147121 24.23 0.000 9.232226 10.86153 ------------------------------------------------------------------------------ VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

25 © Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 147.36 Model | 1135.67473 2 567.837363 Prob > F = 0.0000 Residual | 2069.30861 537 3.85346109 R-squared = 0.3543 -------------+------------------------------ Adj R-squared = 0.3519 Total | 3204.98333 539 5.94616574 Root MSE = 1.963. reg S ASVABC Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 274.19 Model | 1081.97059 1 1081.97059 Prob > F = 0.0000 Residual | 2123.01275 538 3.94612035 R-squared = 0.3376 -------------+------------------------------ Adj R-squared = 0.3364 Total | 3204.98333 539 5.94616574 Root MSE = 1.9865. reg S SM Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 80.93 Model | 419.086251 1 419.086251 Prob > F = 0.0000 Residual | 2785.89708 538 5.17824736 R-squared = 0.1308 -------------+------------------------------ Adj R-squared = 0.1291 Total | 3204.98333 539 5.94616574 Root MSE = 2.2756 Finally, we will investigate how R 2 behaves when a variable is omitted. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

26 © Christopher Dougherty 1999–2006 Finally, we will investigate how R 2 behaves when a variable is omitted. In the simple regression of S on ASVABC, R 2 is 0.34, and in the simple regression of S on SM it is 0.13. Does this imply that ASVABC explains 34% of the variance in S and SM 13%? No, because the multiple regression reveals that their joint explanatory power is 0.35, not 0.47. In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its apparent explanatory power. Similarly, in the third regression, SM is partly acting as a proxy for ASVABC, again inflating its apparent explanatory power. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

27 © Christopher Dougherty 1999–2006. reg LGEARN S EXP Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 100.86 Model | 50.9842581 2 25.492129 Prob > F = 0.0000 Residual | 135.723385 537.252743734 R-squared = 0.2731 -------------+------------------------------ Adj R-squared = 0.2704 Total | 186.707643 539.34639637 Root MSE =.50274 ------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S |.1235911.0090989 13.58 0.000.1057173.141465 EXP |.0350826.0050046 7.01 0.000.0252515.0449137 _cons |.5093196.1663823 3.06 0.002.1824796.8361596 ------------------------------------------------------------------------------ However, it is also possible for omitted variable bias to lead to a reduction in the apparent explanatory power of a variable. This will be demonstrated using a simple earnings function model, supposing the logarithm of hourly earnings to depend on S and EXP. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

28 © Christopher Dougherty 1999–2006. reg LGEARN S EXP Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 100.86 Model | 50.9842581 2 25.492129 Prob > F = 0.0000 Residual | 135.723385 537.252743734 R-squared = 0.2731 -------------+------------------------------ Adj R-squared = 0.2704 Total | 186.707643 539.34639637 Root MSE =.50274 ------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S |.1235911.0090989 13.58 0.000.1057173.141465 EXP |.0350826.0050046 7.01 0.000.0252515.0449137 _cons |.5093196.1663823 3.06 0.002.1824796.8361596 ------------------------------------------------------------------------------ If we omit EXP from the regression, the coefficient of S should be subject to a downward bias.  3 is likely to be positive. The numerator of the other factor in the bias term is negative since S and EXP are negatively correlated. The denominator is positive.. cor S EXP (obs=540) | S EXP --------+------------------ S| 1.0000 EXP| -0.2179 1.0000 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

29 © Christopher Dougherty 1999–2006. reg LGEARN S EXP Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 100.86 Model | 50.9842581 2 25.492129 Prob > F = 0.0000 Residual | 135.723385 537.252743734 R-squared = 0.2731 -------------+------------------------------ Adj R-squared = 0.2704 Total | 186.707643 539.34639637 Root MSE =.50274 ------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S |.1235911.0090989 13.58 0.000.1057173.141465 EXP |.0350826.0050046 7.01 0.000.0252515.0449137 _cons |.5093196.1663823 3.06 0.002.1824796.8361596 ------------------------------------------------------------------------------ For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP should be downwards biased.. cor S EXP (obs=540) | S EXP --------+------------------ S| 1.0000 EXP| -0.2179 1.0000 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

30 © Christopher Dougherty 1999–2006. reg LGEARN S EXP ------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S |.1235911.0090989 13.58 0.000.1057173.141465 EXP |.0350826.0050046 7.01 0.000.0252515.0449137 _cons |.5093196.1663823 3.06 0.002.1824796.8361596. reg LGEARN S ------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S |.1096934.0092691 11.83 0.000.0914853.1279014 _cons | 1.292241.1287252 10.04 0.000 1.039376 1.545107. reg LGEARN EXP ------------------------------------------------------------------------------ LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- EXP |.0202708.0056564 3.58 0.000.0091595.031382 _cons | 2.44941.0988233 24.79 0.000 2.255284 2.643537 As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

31 © Christopher Dougherty 1999–2006. reg LGEARN S EXP Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 100.86 Model | 50.9842581 2 25.492129 Prob > F = 0.0000 Residual | 135.723385 537.252743734 R-squared = 0.2731 -------------+------------------------------ Adj R-squared = 0.2704 Total | 186.707643 539.34639637 Root MSE =.50274. reg LGEARN S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 140.05 Model | 38.5643833 1 38.5643833 Prob > F = 0.0000 Residual | 148.14326 538.275359219 R-squared = 0.2065 -------------+------------------------------ Adj R-squared = 0.2051 Total | 186.707643 539.34639637 Root MSE =.52475. reg LGEARN EXP Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 12.84 Model | 4.35309315 1 4.35309315 Prob > F = 0.0004 Residual | 182.35455 538.338948978 R-squared = 0.0233 -------------+------------------------------ Adj R-squared = 0.0215 Total | 186.707643 539.34639637 Root MSE =.58219 A comparison of R 2 for the three regressions shows that the sum of R 2 in the simple regressions is actually less than R 2 in the multiple regression. This is because the apparent explanatory power of S in the second regression has been undermined by the downwards bias in its coefficient. The same is true for the apparent explanatory power of EXP in the third equation. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

32 Consequences of Variable Misspecification True Model Fitted Model Correct specification, no problems Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Now, we will investigate the effects of including an irrelevant variable in a regression model ( these are different from those of omitted variable misspecification.) In this case the coefficients in general remain unbiased, but they are inefficient. The standard errors remain valid, but are needlessly large. Coefficients are unbiased (in general), but inefficient. Standard errors are valid (in general)

33 © Christopher Dougherty 1999–2006 Rewrite the true model adding X 3 as an explanatory variable, with a coefficient of 0. Now the true model and the fitted model coincide. Hence b 2 will be an unbiased estimator of  2 and b 3 will be an unbiased estimator of 0. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

34 © Christopher Dougherty 1999–2006 However, the variance of b 2 will be larger than it would have been if the correct simple regression had been run because it includes the factor 1 / (1 – r 2 ), where r is the correlation between X 2 and X 3. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

35 © Christopher Dougherty 1999–2006 The estimator b 2 using the multiple regression model will therefore be less efficient than the alternative using the simple regression model. The intuitive reason for this is that the simple regression model exploits the information that X 3 should not be in the regression, while with the multiple regression model you find this out from the regression results. The standard errors remain valid, because the model is formally correctly specified, but they will tend to be larger than those obtained in a simple regression, reflecting the loss of efficiency. These are the results in general. Note that if X 2 and X 3 happen to be uncorrelated, there will be no loss of efficiency after all. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

36 © Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE Source | SS df MS Number of obs = 868 ---------+------------------------------ F( 2, 865) = 460.92 Model | 138.776549 2 69.3882747 Prob > F = 0.0000 Residual | 130.219231 865.150542464 R-squared = 0.5159 ---------+------------------------------ Adj R-squared = 0.5148 Total | 268.995781 867.310260416 Root MSE =.388 ------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- LGEXP |.2866813.0226824 12.639 0.000.2421622.3312003 LGSIZE |.4854698.0255476 19.003 0.000.4353272.5356124 _cons | 4.720269.2209996 21.359 0.000 4.286511 5.154027 ------------------------------------------------------------------------------ The analysis will be illustrated using a regression of LGFDHO, the logarithm of annual household expenditure on food eaten at home, on LGEXP, the logarithm of total annual household expenditure, and LGSIZE, the logarithm of the number of persons in the household. The source of the data was the 1995 US Consumer Expenditure Survey. The sample size was 868. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

37 © Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE LGHOUS Source | SS df MS Number of obs = 868 ---------+------------------------------ F( 3, 864) = 307.22 Model | 138.841976 3 46.2806586 Prob > F = 0.0000 Residual | 130.153805 864.150640978 R-squared = 0.5161 ---------+------------------------------ Adj R-squared = 0.5145 Total | 268.995781 867.310260416 Root MSE =.38812 ------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- LGEXP |.2673552.0370782 7.211 0.000.1945813.340129 LGSIZE |.4868228.0256383 18.988 0.000.4365021.5371434 LGHOUS |.0229611.0348408 0.659 0.510 -.0454214.0913436 _cons | 4.708772.2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ Now add LGHOUS, the logarithm of annual expenditure on housing services. It is safe to assume that LGHOUS is an irrelevant variable and, not surprisingly, its coefficient is not significantly different from zero. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

38 © Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE LGHOUS Source | SS df MS Number of obs = 868 ---------+------------------------------ F( 3, 864) = 307.22 Model | 138.841976 3 46.2806586 Prob > F = 0.0000 Residual | 130.153805 864.150640978 R-squared = 0.5161 ---------+------------------------------ Adj R-squared = 0.5145 Total | 268.995781 867.310260416 Root MSE =.38812 ------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- LGEXP |.2673552.0370782 7.211 0.000.1945813.340129 LGSIZE |.4868228.0256383 18.988 0.000.4365021.5371434 LGHOUS |.0229611.0348408 0.659 0.510 -.0454214.0913436 _cons | 4.708772.2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ It is however highly correlated with LGEXP (correlation coefficient 0.81), and also, to a lesser extent, with LGSIZE (correlation coefficient 0.33).. cor LGHOUS LGEXP LGSIZE (obs=869) | LGHOUS LGEXP LGSIZE --------+--------------------------- lGHOUS| 1.0000 LGEXP| 0.8137 1.0000 LGSIZE| 0.3256 0.4491 1.0000 VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

39 © Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE ------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- LGEXP |.2866813.0226824 12.639 0.000.2421622.3312003 LGSIZE |.4854698.0255476 19.003 0.000.4353272.5356124 _cons | 4.720269.2209996 21.359 0.000 4.286511 5.154027 ------------------------------------------------------------------------------. reg LGFDHO LGEXP LGSIZE LGHOUS ------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- LGEXP |.2673552.0370782 7.211 0.000.1945813.340129 LGSIZE |.4868228.0256383 18.988 0.000.4365021.5371434 LGHOUS |.0229611.0348408 0.659 0.510 -.0454214.0913436 _cons | 4.708772.2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ Its inclusion does not cause the coefficients of those variables to be biased. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

40 © Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE ------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- LGEXP |.2866813.0226824 12.639 0.000.2421622.3312003 LGSIZE |.4854698.0255476 19.003 0.000.4353272.5356124 _cons | 4.720269.2209996 21.359 0.000 4.286511 5.154027 ------------------------------------------------------------------------------. reg LGFDHO LGEXP LGSIZE LGHOUS ------------------------------------------------------------------------------ LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- LGEXP |.2673552.0370782 7.211 0.000.1945813.340129 LGSIZE |.4868228.0256383 18.988 0.000.4365021.5371434 LGHOUS |.0229611.0348408 0.659 0.510 -.0454214.0913436 _cons | 4.708772.2217592 21.234 0.000 4.273522 5.144022 ------------------------------------------------------------------------------ But it does increase their standard errors, particularly that of LGEXP, as you would expect, reflecting the loss of efficiency. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

41 © Christopher Dougherty 1999–2006 PROXY VARIABLES Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X 2,..., X k as shown above, and suppose that for some reason there are no data on X 2. As we have seen, a regression of Y on X 3,..., X k would yield biased estimates of the coefficients and invalid standard errors and tests.

42 © Christopher Dougherty 1999–2006 Sometimes, however, these problems can be reduced or eliminated by using a proxy variable in the place of X 2. A proxy variable is one that is hypothesized to be linearly related to the missing variable. In the present example, Z could act as a proxy for X 2. The validity of the proxy relationship must be justified on the basis of theory, common sense, or experience. It cannot be checked directly because there are no data on X 2. PROXY VARIABLES

43 © Christopher Dougherty 1999–2006 If a suitable proxy has been identified, the regression model can be rewritten as shown. PROXY VARIABLES

44 © Christopher Dougherty 1999–2006 We thus obtain a model with all variables observable. If the proxy relationship is an exact one, and we fit this relationship, most of the regression results will be rescued. 1.The estimates of the coefficients of X 3,..., X k will be the same as those that would have been obtained if it had been possible to regress Y on X 2,..., X k. 2.The standard errors and t statistics of the coefficients of X 3,..., X k will be the same as those that would have been obtained if it had been possible to regress Y on X 2,..., X k. 3.R 2 will be the same as it would have been if it had been possible to regress Y on X 2,..., X k. PROXY VARIABLES

45 © Christopher Dougherty 1999–2006 3.3. R 2 will be the same as it would have been if it had been possible to regress Y on X 2,..., X k. 4.The coefficient of Z will be an estimate of  2 , and so it will not be possible to obtain an estimate of  2, unless you are able to guess the value of . 5.However the t statistic for Z will be the same as that which would have been obtained for X 2 if it had been possible to regress Y on X 2,..., X k, and so you are able to assess the significance of X 2, even if you are not able to estimate its coefficient. PROXY VARIABLES

46 © Christopher Dougherty 1999–2006 6.It will not be possible to obtain an estimate of  1 since the intercept in the revised model is (  1 +  2 ), but usually  1 is of relatively little interest, anyway. It is more realistic to hypothesize that the relation between X 2 and Z is approximate, rather than exact. In that case the results listed above will hold approximately. However, if Z is a poor proxy for X 2, the results will effectively be subject to measurement error. Further, it is possible that some of the other X variables will try to act as proxies for X 2, and there will still be a problem of omitted variable bias.

47 © Christopher Dougherty 1999–2006 PROXY VARIABLES The use of a proxy variable will be illustrated with an educational attainment model. We will suppose that educational attainment depends jointly on cognitive ability and family background. As usual, ASVABC will be used as the measure of cognitive ability. However, there is no ‘family background’ variable in the data set. Indeed, it is difficult to conceive how such a variable might be defined.

48 © Christopher Dougherty 1999–2006 Instead, we will try to find a proxy. One obvious variable is the mother's educational attainment, SM. However, father's educational attainment, SF, may also be relevant. So we will hypothesize that the family background index depends on both. PROXY VARIABLES

49 © Christopher Dougherty 1999–2006 Thus we obtain a relationship expressing S as a function of ASVABC, SM, and SF. PROXY VARIABLES

50 © Christopher Dougherty 1999–2006. reg S ASVABC SM SF Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 3, 536) = 104.30 Model | 1181.36981 3 393.789935 Prob > F = 0.0000 Residual | 2023.61353 536 3.77539837 R-squared = 0.3686 -------------+------------------------------ Adj R-squared = 0.3651 Total | 3204.98333 539 5.94616574 Root MSE = 1.943 ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1257087.0098533 12.76 0.000.1063528.1450646 SM |.0492424.0390901 1.26 0.208 -.027546.1260309 SF |.1076825.0309522 3.48 0.001.04688.1684851 _cons | 5.370631.4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------ Here is the corresponding regression using EAEF Data Set 21. PROXY VARIABLES

51 © Christopher Dougherty 1999–2006. reg S ASVABC Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 274.19 Model | 1081.97059 1 1081.97059 Prob > F = 0.0000 Residual | 2123.01275 538 3.94612035 R-squared = 0.3376 -------------+------------------------------ Adj R-squared = 0.3364 Total | 3204.98333 539 5.94616574 Root MSE = 1.9865 ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.148084.0089431 16.56 0.000.1305165.1656516 _cons | 6.066225.4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ Here is the regression of S on ASVABC alone. PROXY VARIABLES

52 © Christopher Dougherty 1999–2006. reg S ASVABC SM SF ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1257087.0098533 12.76 0.000.1063528.1450646 SM |.0492424.0390901 1.26 0.208 -.027546.1260309 SF |.1076825.0309522 3.48 0.001.04688.1684851 _cons | 5.370631.4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------. reg S ASVABC ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.148084.0089431 16.56 0.000.1305165.1656516 _cons | 6.066225.4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ A comparison of the regressions indicates that the coefficient of ASVABC is biased upwards if we make no attempt to control for family background. PROXY VARIABLES

53 © Christopher Dougherty 1999–2006. reg S ASVABC SM SF ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.1257087.0098533 12.76 0.000.1063528.1450646 SM |.0492424.0390901 1.26 0.208 -.027546.1260309 SF |.1076825.0309522 3.48 0.001.04688.1684851 _cons | 5.370631.4882155 11.00 0.000 4.41158 6.329681 ------------------------------------------------------------------------------. reg S ASVABC ------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ASVABC |.148084.0089431 16.56 0.000.1305165.1656516 _cons | 6.066225.4672261 12.98 0.000 5.148413 6.984036 ------------------------------------------------------------------------------ This is what we should expect. Both SM and SF are likely to have positive effects on educational attainment, and they are both positively correlated with ASVABC.. cor ASVABC SM SF (obs=570) | ASVABC SM SF --------+--------------------------- ASVABC| 1.0000 SM| 0.4202 1.0000 SF| 0.4090 0.6241 1.0000 PROXY VARIABLES


Download ppt "© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying."

Similar presentations


Ads by Google