© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying.

Slides:



Advertisements
Similar presentations
Multiple Regression Analysis
Advertisements

CHOW TEST AND DUMMY VARIABLE GROUP TEST
EC220 - Introduction to econometrics (chapter 5)
EC220 - Introduction to econometrics (chapter 4)
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: slope dummy variables Original citation: Dougherty, C. (2012) EC220 -
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: interactive explanatory variables Original citation: Dougherty, C. (2012)
ELASTICITIES AND DOUBLE-LOGARITHMIC MODELS
HETEROSCEDASTICITY-CONSISTENT STANDARD ERRORS 1 Heteroscedasticity causes OLS standard errors to be biased is finite samples. However it can be demonstrated.
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
INTERPRETATION OF A REGRESSION EQUATION
MEASUREMENT ERROR 1 In this sequence we will investigate the consequences of measurement errors in the variables in a regression model. To keep the analysis.
Lecture 4 This week’s reading: Ch. 1 Today:
EC220 - Introduction to econometrics (chapter 2)
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.
Chapter 4 Multiple Regression.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: variable misspecification iii: consequences for diagnostics Original.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 2) Slideshow: testing a hypothesis relating to a regression coefficient (2010/2011.
1 INTERPRETATION OF A REGRESSION EQUATION The scatter diagram shows hourly earnings in 2002 plotted against years of schooling, defined as highest grade.
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
Chapter 4 – Nonlinear Models and Transformations of Variables.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: prediction Original citation: Dougherty, C. (2012) EC220 - Introduction.
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
EC220 - Introduction to econometrics (chapter 3)
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: semilogarithmic models Original citation: Dougherty, C. (2012) EC220.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: Chow test Original citation: Dougherty, C. (2012) EC220 - Introduction.
TOBIT ANALYSIS Sometimes the dependent variable in a regression model is subject to a lower limit or an upper limit, or both. Suppose that in the absence.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy variable classification with two categories Original citation:
1 PREDICTION In the previous sequence, we saw how to predict the price of a good or asset given the composition of its characteristics. In this sequence,
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: the effects of changing the reference category Original citation: Dougherty,
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy classification with more than two categories Original citation:
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 10) Slideshow: Tobit models Original citation: Dougherty, C. (2012) EC220 - Introduction.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 8) Slideshow: measurement error Original citation: Dougherty, C. (2012) EC220 - Introduction.
1 TWO SETS OF DUMMY VARIABLES The explanatory variables in a regression model may include multiple sets of dummy variables. This sequence provides an example.
1 PROXY VARIABLES Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X 2,..., X k as shown above, and suppose that for.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.
© Christopher Dougherty 1999–2006 The denominator has been rewritten a little more carefully, making it explicit that the summation of the squared deviations.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression.
Simple regression model: Y =  1 +  2 X + u 1 We have seen that the regression coefficients b 1 and b 2 are random variables. They provide point estimates.
. reg LGEARN S WEIGHT85 Source | SS df MS Number of obs = F( 2, 537) = Model |
Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.
POSSIBLE DIRECT MEASURES FOR ALLEVIATING MULTICOLLINEARITY 1 What can you do about multicollinearity if you encounter it? We will discuss some possible.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: exercise 4.5 Original citation: Dougherty, C. (2012) EC220 - Introduction.
(1)Combine the correlated variables. 1 In this sequence, we look at four possible indirect methods for alleviating a problem of multicollinearity. POSSIBLE.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
1 Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: exercise 6.13 Original citation: Dougherty, C. (2012) EC220 - Introduction.
RAMSEY’S RESET TEST OF FUNCTIONAL MISSPECIFICATION 1 Ramsey’s RESET test of functional misspecification is intended to provide a simple indicator of evidence.
1 HETEROSCEDASTICITY: WEIGHTED AND LOGARITHMIC REGRESSIONS This sequence presents two methods for dealing with the problem of heteroscedasticity. We will.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
SEMILOGARITHMIC MODELS 1 This sequence introduces the semilogarithmic model and shows how it may be applied to an earnings function. The dependent variable.
© Christopher Dougherty 1999–2006 A.1: The model is linear in parameters and correctly specified. A.2: There does not exist an exact linear relationship.
6.4*The table gives the results of multiple and simple regressions of LGFDHO, the logarithm of annual household expenditure on food eaten at home, on LGEXP,
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL The output above shows the result of regressing EARNINGS, hourly earnings in dollars, on S, years.
1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,
1 REPARAMETERIZATION OF A MODEL AND t TEST OF A LINEAR RESTRICTION Linear restrictions can also be tested using a t test. This involves the reparameterization.
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE In this sequence we will investigate the consequences of including an irrelevant variable.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Presentation transcript:

© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying the regression model in terms of explanatory variables. To keep the analysis simple, we will assume that there are only two possibilities. Either Y depends only on X 2, or it depends on both X 2 and X 3. A.If Y depends only on X 2, and we fit a simple regression model, we will not encounter any problems, assuming of course that the regression model assumptions are valid. B.Likewise we will not encounter any problems if Y depends on both X 2 and X 3 and we fit the multiple regression.

© Christopher Dougherty 1999–2006 A.In this sequence we will examine the consequences of fitting a simple regression when the true model is multiple.  The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid. B.In the next one we will do the opposite and examine the consequences of fitting a multiple regression when the true model is simple.  Including an irrelevant variable – In this case, the coefficients in general remain unbiased, but they are inefficient. The standard errors remain valid, but are needlessly large. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 Consequences of Variable Misspecification True Model Fitted Model The omission of a relevant explanatory variable causes the regression coefficients to be biased and the standard errors to be invalid. Correct specification, no problems Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 In the present case, the omission of X 3 causes b 2 to be biased by the term highlighted in yellow. We will explain this first intuitively and then demonstrate it mathematically. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 The intuitive reason is that, in addition to its direct effect  2, X 2 has an apparent indirect effect as a consequence of acting as a proxy for the missing X 3. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 The strength of the proxy effect depends on two factors: the strength of the effect of X 3 on Y, which is given by  3, and the ability of X 2 to mimic X 3. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 Y X3X3 X2X2 direct effect of X 2, holding X 3 constant effect of X 3 apparent effect of X 2, acting as a mimic for X 3 22 33 The ability of X 2 to mimic X 3 is determined by the slope coefficient obtained when X 3 is regressed on X 2, the term highlighted in yellow. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 We will now derive the expression for the bias mathematically. It is convenient to start by deriving an expression for the deviation of Y i about its sample mean. It can be expressed in terms of the deviations of X 2, X 3, and u about their sample means. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 Although Y really depends on X 3 as well as X 2, we make a mistake and regress Y on X 2 only. The slope coefficient is therefore as shown. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 We substitute for the Y deviations and simplify. Hence we have demonstrated that b 2 has three components. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 To investigate biasedness or unbiasedness, we take the expected value of b 2. The first two terms are unaffected because they contain no random components. Thus we focus on the expectation of the error term. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 X 2 is nonstochastic, so the denominator of the error term is nonstochastic and may be taken outside the expression for the expectation. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 In the numerator the expectation of a sum is equal to the sum of the expectations (first expected value rule). VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 In each product, the factor involving X 2 may be taken out of the expectation because X 2 is nonstochastic. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 By Assumption A.3, the expected value of u is 0. It follows that the expected value of the sample mean of u is also 0. Hence the expected value of the deviation from mean for the disturbance term is also 0. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 Thus we have shown that the expected value of b 2 is equal to the true value plus a bias term. Note 1: The definition of a bias is the difference between the expected value of an estimator and the true value of the parameter being estimated.) Note 2: The bias will be zero if the sample correlation between X2 and X3 is zero, or  3 = 0. As a consequence of the misspecification, the standard errors, t tests and F test are invalid. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | We will illustrate the bias using an educational attainment model. To keep the analysis simple, we will assume that in the true model S depends only on ASVABC and SM. The output above shows the corresponding regression using EAEF Data Set 21. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | We will run the regression a second time, omitting SM. Before we do this, we will try to predict the direction of the bias in the coefficient of ASVABC. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | It is reasonable to suppose, as a matter of common sense, that  3 is positive. This assumption is strongly supported by the fact that its estimate in the multiple regression is positive and highly significant. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | The correlation between ASVABC and SM is positive, so the numerator of the bias term must be positive. The denominator is automatically positive since it is a sum of squares and there is some variation in ASVABC. Hence the bias should be positive.. cor SM ASVABC (obs=540) | SM ASVABC SM| ASVABC| VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg S ASVABC Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | _cons | Here is the regression omitting SM. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg S ASVABC SM S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | reg S ASVABC S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | _cons | As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the difference may be due to pure chance, but part is attributable to the bias. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg S SM Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] SM | _cons | Here is the regression omitting ASVABC instead of SM. We would expect b 3 to be upwards biased. We anticipate that  2 is positive and we know that both the numerator and the denominator of the other factor in the bias expression are positive. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 In this case the bias is quite dramatic. The coefficient of SM has more than doubled. The reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC, while  2 and  3 are similar in size, judging by their estimates (see equation on previous slide).. reg S ASVABC SM S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | _cons | reg S SM S | Coef. Std. Err. t P>|t| [95% Conf. Interval] SM | _cons | VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg S ASVABC SM Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg S ASVABC Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg S SM Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = Finally, we will investigate how R 2 behaves when a variable is omitted. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006 Finally, we will investigate how R 2 behaves when a variable is omitted. In the simple regression of S on ASVABC, R 2 is 0.34, and in the simple regression of S on SM it is Does this imply that ASVABC explains 34% of the variance in S and SM 13%? No, because the multiple regression reveals that their joint explanatory power is 0.35, not In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its apparent explanatory power. Similarly, in the third regression, SM is partly acting as a proxy for ASVABC, again inflating its apparent explanatory power. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | However, it is also possible for omitted variable bias to lead to a reduction in the apparent explanatory power of a variable. This will be demonstrated using a simple earnings function model, supposing the logarithm of hourly earnings to depend on S and EXP. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | If we omit EXP from the regression, the coefficient of S should be subject to a downward bias.  3 is likely to be positive. The numerator of the other factor in the bias term is negative since S and EXP are negatively correlated. The denominator is positive.. cor S EXP (obs=540) | S EXP S| EXP| VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP should be downwards biased.. cor S EXP (obs=540) | S EXP S| EXP| VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGEARN S EXP LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | EXP | _cons | reg LGEARN S LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] S | _cons | reg LGEARN EXP LGEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval] EXP | _cons | As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGEARN S EXP Source | SS df MS Number of obs = F( 2, 537) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg LGEARN S Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = reg LGEARN EXP Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = A comparison of R 2 for the three regressions shows that the sum of R 2 in the simple regressions is actually less than R 2 in the multiple regression. This is because the apparent explanatory power of S in the second regression has been undermined by the downwards bias in its coefficient. The same is true for the apparent explanatory power of EXP in the third equation. VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of Variable Misspecification True Model Fitted Model Correct specification, no problems Correct specification, no problems Coefficients are biased (in general). Standard errors are invalid. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE Now, we will investigate the effects of including an irrelevant variable in a regression model ( these are different from those of omitted variable misspecification.) In this case the coefficients in general remain unbiased, but they are inefficient. The standard errors remain valid, but are needlessly large. Coefficients are unbiased (in general), but inefficient. Standard errors are valid (in general)

© Christopher Dougherty 1999–2006 Rewrite the true model adding X 3 as an explanatory variable, with a coefficient of 0. Now the true model and the fitted model coincide. Hence b 2 will be an unbiased estimator of  2 and b 3 will be an unbiased estimator of 0. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

© Christopher Dougherty 1999–2006 However, the variance of b 2 will be larger than it would have been if the correct simple regression had been run because it includes the factor 1 / (1 – r 2 ), where r is the correlation between X 2 and X 3. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

© Christopher Dougherty 1999–2006 The estimator b 2 using the multiple regression model will therefore be less efficient than the alternative using the simple regression model. The intuitive reason for this is that the simple regression model exploits the information that X 3 should not be in the regression, while with the multiple regression model you find this out from the regression results. The standard errors remain valid, because the model is formally correctly specified, but they will tend to be larger than those obtained in a simple regression, reflecting the loss of efficiency. These are the results in general. Note that if X 2 and X 3 happen to be uncorrelated, there will be no loss of efficiency after all. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE Source | SS df MS Number of obs = F( 2, 865) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] LGEXP | LGSIZE | _cons | The analysis will be illustrated using a regression of LGFDHO, the logarithm of annual household expenditure on food eaten at home, on LGEXP, the logarithm of total annual household expenditure, and LGSIZE, the logarithm of the number of persons in the household. The source of the data was the 1995 US Consumer Expenditure Survey. The sample size was 868. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE LGHOUS Source | SS df MS Number of obs = F( 3, 864) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] LGEXP | LGSIZE | LGHOUS | _cons | Now add LGHOUS, the logarithm of annual expenditure on housing services. It is safe to assume that LGHOUS is an irrelevant variable and, not surprisingly, its coefficient is not significantly different from zero. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE LGHOUS Source | SS df MS Number of obs = F( 3, 864) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] LGEXP | LGSIZE | LGHOUS | _cons | It is however highly correlated with LGEXP (correlation coefficient 0.81), and also, to a lesser extent, with LGSIZE (correlation coefficient 0.33).. cor LGHOUS LGEXP LGSIZE (obs=869) | LGHOUS LGEXP LGSIZE lGHOUS| LGEXP| LGSIZE| VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] LGEXP | LGSIZE | _cons | reg LGFDHO LGEXP LGSIZE LGHOUS LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] LGEXP | LGSIZE | LGHOUS | _cons | Its inclusion does not cause the coefficients of those variables to be biased. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

© Christopher Dougherty 1999–2006. reg LGFDHO LGEXP LGSIZE LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] LGEXP | LGSIZE | _cons | reg LGFDHO LGEXP LGSIZE LGHOUS LGFDHO | Coef. Std. Err. t P>|t| [95% Conf. Interval] LGEXP | LGSIZE | LGHOUS | _cons | But it does increase their standard errors, particularly that of LGEXP, as you would expect, reflecting the loss of efficiency. VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE

© Christopher Dougherty 1999–2006 PROXY VARIABLES Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X 2,..., X k as shown above, and suppose that for some reason there are no data on X 2. As we have seen, a regression of Y on X 3,..., X k would yield biased estimates of the coefficients and invalid standard errors and tests.

© Christopher Dougherty 1999–2006 Sometimes, however, these problems can be reduced or eliminated by using a proxy variable in the place of X 2. A proxy variable is one that is hypothesized to be linearly related to the missing variable. In the present example, Z could act as a proxy for X 2. The validity of the proxy relationship must be justified on the basis of theory, common sense, or experience. It cannot be checked directly because there are no data on X 2. PROXY VARIABLES

© Christopher Dougherty 1999–2006 If a suitable proxy has been identified, the regression model can be rewritten as shown. PROXY VARIABLES

© Christopher Dougherty 1999–2006 We thus obtain a model with all variables observable. If the proxy relationship is an exact one, and we fit this relationship, most of the regression results will be rescued. 1.The estimates of the coefficients of X 3,..., X k will be the same as those that would have been obtained if it had been possible to regress Y on X 2,..., X k. 2.The standard errors and t statistics of the coefficients of X 3,..., X k will be the same as those that would have been obtained if it had been possible to regress Y on X 2,..., X k. 3.R 2 will be the same as it would have been if it had been possible to regress Y on X 2,..., X k. PROXY VARIABLES

© Christopher Dougherty 1999– R 2 will be the same as it would have been if it had been possible to regress Y on X 2,..., X k. 4.The coefficient of Z will be an estimate of  2 , and so it will not be possible to obtain an estimate of  2, unless you are able to guess the value of . 5.However the t statistic for Z will be the same as that which would have been obtained for X 2 if it had been possible to regress Y on X 2,..., X k, and so you are able to assess the significance of X 2, even if you are not able to estimate its coefficient. PROXY VARIABLES

© Christopher Dougherty 1999– It will not be possible to obtain an estimate of  1 since the intercept in the revised model is (  1 +  2 ), but usually  1 is of relatively little interest, anyway. It is more realistic to hypothesize that the relation between X 2 and Z is approximate, rather than exact. In that case the results listed above will hold approximately. However, if Z is a poor proxy for X 2, the results will effectively be subject to measurement error. Further, it is possible that some of the other X variables will try to act as proxies for X 2, and there will still be a problem of omitted variable bias.

© Christopher Dougherty 1999–2006 PROXY VARIABLES The use of a proxy variable will be illustrated with an educational attainment model. We will suppose that educational attainment depends jointly on cognitive ability and family background. As usual, ASVABC will be used as the measure of cognitive ability. However, there is no ‘family background’ variable in the data set. Indeed, it is difficult to conceive how such a variable might be defined.

© Christopher Dougherty 1999–2006 Instead, we will try to find a proxy. One obvious variable is the mother's educational attainment, SM. However, father's educational attainment, SF, may also be relevant. So we will hypothesize that the family background index depends on both. PROXY VARIABLES

© Christopher Dougherty 1999–2006 Thus we obtain a relationship expressing S as a function of ASVABC, SM, and SF. PROXY VARIABLES

© Christopher Dougherty 1999–2006. reg S ASVABC SM SF Source | SS df MS Number of obs = F( 3, 536) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | SF | _cons | Here is the corresponding regression using EAEF Data Set 21. PROXY VARIABLES

© Christopher Dougherty 1999–2006. reg S ASVABC Source | SS df MS Number of obs = F( 1, 538) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | _cons | Here is the regression of S on ASVABC alone. PROXY VARIABLES

© Christopher Dougherty 1999–2006. reg S ASVABC SM SF S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | SF | _cons | reg S ASVABC S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | _cons | A comparison of the regressions indicates that the coefficient of ASVABC is biased upwards if we make no attempt to control for family background. PROXY VARIABLES

© Christopher Dougherty 1999–2006. reg S ASVABC SM SF S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | SM | SF | _cons | reg S ASVABC S | Coef. Std. Err. t P>|t| [95% Conf. Interval] ASVABC | _cons | This is what we should expect. Both SM and SF are likely to have positive effects on educational attainment, and they are both positively correlated with ASVABC.. cor ASVABC SM SF (obs=570) | ASVABC SM SF ASVABC| SM| SF| PROXY VARIABLES