Presentation on theme: "5. Endogenous right hand side variables"— Presentation transcript:
15. Endogenous right hand side variables 5.1 The problem of endogeneity bias5.2 The basic idea underlying the use of instrumental variables5.3 When the endogenous right hand side variable is continuous5.4 When the endogenous right hand side variable is binary
25.1 Endogeneity bias Consider a simple OLS regression: Yit = a0 + a1 X1it + uitRecall that our estimate of a1 will be unbiased only if we can assume that X1it is uncorrelated with the error term (uit)We have discussed two ways to help ensure that this assumption is trueFirst, we should control for any observable variables that affect Yit and which are correlated with X1it. For example, we should control for X2it if X2it affects Yit and X2it is correlated with X1it (see Chapter 2):Yit = a0 + a1 X1it + a2 X2it + uit
35.1 Endogeneity biasSecond, if we have panel data, we can control for any unobservable firm-specific characteristics (ui) that affect Yit and which are correlated with the X variables.From Chapter 4:Yit = a0 + a1 X1it + a2 X2it + ui + eitWe control for the correlations between ui and the X variables by estimating fixed effects models.Our estimates of a1 and a2 are unbiased if the X variables are uncorrelated with eit. In this case, we say that the X variables are “exogenous”.
45.1 Endogeneity biasUnfortunately, multiple regression and fixed effects models do not always ensure that the X variables are uncorrelated with the error term:if we do not observe all the variables that affect Y and that are correlated with X, multiple regression will not solve the problem.if we do not have panel data, the fixed effects models cannot be estimated.even if we have panel data, the Y and X variables may display little variation over time in which case the fixed effects models can be unreliable (Zhou, 2001).even if we have panel data and the Y and X variables display sufficient variation over time, the unobservable variables that are correlated with X may not be constant over time in which case the fixed effects models will not solve the problem.
5For example, suppose that Y2it is an endogenous explanatory variable: A variable is more likely to be correlated with the error term if it is “endogenous”“Endogenous” means that the variable is determined within the economic model that we are trying to estimate.For example, suppose that Y2it is an endogenous explanatory variable:Y1it = a0 + a1 Y2it + a2 Xit + uit (1)Y2it = b0 + b1 Xit + b2 Zit + vit (2)Equations (1) and (2) have a “triangular” structure since Y2it is assumed to affect Y1it, but Y1it is assumed not to affect Y2itGiven this triangular structure, the OLS estimate of a1 in equation (1) is unbiased only if vit is uncorrelated with uitIf vit is correlated with uit, then Y2it is correlated with uit which means that the OLS estimate of a1 would be biasedTo avoid this bias, we must estimate equation (1) “instrumental variables” (IV) regression rather than OLS.
6Equations (1) and (2) are called “structural” equations because they describe the economic relationship between Y1it and Y2itWe can obtain a “reduced-form” equation by substituting eq. (2) into eq. (1):Y1it = a0 + a1 (b0 + b1 Xit + b2 Zit + vit) + a2 Xit + uitIn this “reduced-form” equation, all the explanatory variables (Xit and Zit) are exogenousThe basic idea underlying IV regression is to remove vit from the Y1it model so that our estimate of a1 is unbiased.
75.2 The basic idea underlying the use of instrumental variables Note that vit is removed from the Y1it model if we use the predicted rather than the actual values of Y2it on the right hand side.We predict Y2it using all the exogenous variables in the system (in our example, we use the two exogenous variables Xit and Zit)
85.2 The basic ideaWe then use the predicted rather than the actual values of Y2it when estimating the Y1it model:The a1 estimate is biased in eq. (3) but it is unbiased in eq. (4) because the vit term has been removed.
9In eq. (4) the estimated coefficient for the Zit variable is We already know the value of from eq. (2):ThereforeIt is important to note that the coefficient can be estimated only if there is at least one exogenous variable in the structural model for Y2it that is excluded from the structural model for Y1itThis is the Zit variable in eq. (2)
10In eq. (4) the coefficient is “just” identified because there is only one exogenous variable (Zit) that is in the Y2it model and that is excluded from the Y1it model
11Suppose we had included Zit in both models In this case, the coefficient cannot be identified because we estimate andIn other words, we cannot determine whether the effect of Zit on Y1it is a main effect (a3) or an indirect effect through Y2it (a1b2)Here we say that the system of equations is “under-identified”
12Suppose we had included two exogenous variables in the Y2it model and we excluded both these variables from the Y1it modelNow we have estimates of , , , and .ThereforeHere we say that the system of equations is “over-identified”In this example, the system is “triangular” because there are two equations and one endogenous right-hand side variable
135.3 When the endogenous right hand side variable is continuous When the models have a triangular structure, the models can be estimated using the ivregress commandThe models can be estimated using 2SLS or LIML or GMM2SLS is more commonly used in practice
145.3.1 Estimating triangular models using 2SLS (ivregress) Go to MySiteOpen up the housing.dta file which provides data from 50 U.S. states (1980 Census)use "J:\phd\housing.dta", clearpct_population_urban = the % of the population that lives in urban areasfamily_income = median annual family incomehousing_value = median value of private housingrent = median monthly housing rental paymentsregion1 – region 4 = dummy variables for four regions in the U.S.
15Suppose we want to estimate the following: rent = a0 + a1 pct_population_urban a2 housing_value + uhousing_value = b0 + b1 family_income b2 region2 + b3 region3 + b4 region4 + vThis is a triangular system because there are two equations and one endogenous right hand side variable (housing_value)If u and v are correlated, the OLS estimate of a2 will be biased in the rent model
16If we ignore the endogeneity problem and estimate the rent model using simple OLS: reg rent housing_value pct_population_urbanTo take account of the potential endogeneity problem we use the ivregress command:ivregress estimator depvar1 [varlist1] (depvar2 = varlistiv)estimator is either 2sls or liml or gmmdepvar1 is the dependent variable for the model which has an endogenous regressorvarlist1 are the exogenous variables in the model that has the endogenous regressordepvar2 is the endogenous regressorvarlistiv are the exogenous variables that are believed to affect the endogenous regressor
17The models that we want to estimate are: rent = a0 + a1 pct_population_urban + a2 housing_value + uhousing_value = b0 + b1 family_income + b2 region2 + b3 region3 + b4 region4 + vThe rent model has an endogenous regressor:ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4)ivregress liml rent pct_population_urban (housing_value = family_income region2 region3 region4)ivregress gmm rent pct_population_urban (housing_value = family_income region2 region3 region4)The housing_value model can be estimated using OLS as there are no endogenous regressorsreg housing_value family_income region2 region3 region4
18We should test whether: our chosen instruments are exogenous (i.e., they should be uncorrelated with the error term) andit is valid to exclude some of them from the model that has the endogenous regressor.If they are not exogenous or they should not be excluded, they are not valid instruments.
19The tests for instrument validity are also known as tests of “over-identifying” restrictions because the tests can only be performed if the model with the endogenous regressor is overidentifiedthe tests assume that at least one of the chosen instruments is valid (unfortunately this assumption cannot be tested)In our example, the instrumented housing_value variable is overidentified because four of the exogenous variables (family_income region2 region3 region4) are excluded from the rent model.If we had excluded only one of these variables, the instrumented housing_value variable would have been “just” identified in which case it would not be possible to test for instrument validity.
20We obtain the tests for instrument validity by typing estat overid after we run ivregress ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4)estat overidThese tests are statistically significant, which means the chosen instruments are not valid.
21This is not surprising because we did not have good reason to assume that they are exogenous and validly excluded from the rent model.For example:family_income is endogenous if family incomes depend on housing values and rentsWhy would this be true?rents may be different across the four regions, so the region dummies should not be excluded from the rent model
22However, the conclusions are the same as in our previous example: We obtain different statistics for the tests of instrument validity if the models are estimated using LIML or GMMHowever, the conclusions are the same as in our previous example:ivregress liml rent pct_population_urban (housing_value = family_income region2 region3 region4)estat overidivregress gmm rent pct_population_urban (housing_value = family_income region2 region3 region4)
23Note that we cannot test for instrument validity when the endogenous regressor is just identified This is because the test statistics are obtained under the assumption that at least one of the instruments is validFor example:ivregress 2sls rent pct_population_urban (housing_value = family_income)estat overidivregress liml rent pct_population_urban (housing_value = family_income)ivregress gmm rent pct_population_urban (housing_value = family_income)
24We can also test whether the coefficient of the “endogenous” regressor is biased under OLS. We obtain two Hausman tests for endogeneity bias by typing estat endogenous after we run ivregressivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4)estat endogenous(The Durbin statistic uses an estimate of the error term’s variance assuming that the variable being tested is exogenous whereas the Wu-Hausman statistic assumes that the variable being tested is endogenous)Given these results, we may be tempted to reject the hypothesis that housing_value is exogenousHowever, the Hausman tests for endogeneity bias are only reliable if the chosen instruments are valid. In our example they are not, and so we cannot draw conclusions about the potential for endogeneity bias.
25Class exercise 5aUsing the fees.dta file, estimate the following models for audit fees and company size:lnaf = a0 + a1 lnta + a2 big6 + ulnta = b0 + b1 ln_age + b2 listed + vwhere lnaf is the log of audit fees, lnta is the log of total assets, ln_age is the log of the company’s age in years, listed is a dummy variable indicating whether the company’s shares are publicly traded on a market.Is the instrumented lnta variable over-identified, just-identified, or under-identified? Explain.Estimate the audit fee model using 2SLS.Test the validity of the chosen instrumental variables.Test whether the lnta variable is affected by endogeneity bias.Verify that the test for instrument validity is not available if you change the model so that it is just-identified.
26The key to estimating IV models is to find one or more “exogenous” variables that explains the endogenous regressor and that can be safely excluded from the main equation.Unfortunately, most accounting studies that use IV regression do not attempt to justify why their chosen instruments are exogenous or why they can be excluded from the structural model.As a result, Larcker and Rusticus (2010) criticize the way in which accounting studies have applied IV regressionA key problem is that the IV results can be very sensitive to the researcher’s choice of which variables to exclude from the structural model and, in many studies, these variables have been chosen in a very arbitrary way
30Larcker and Rusticus (2010) recommend that researchers justify their chosen instruments using theory or economic intuitionthe estat overid test should not be used to select instruments on purely statistical grounds particularly as the test is invalid if all of the chosen instruments are also invalidWhen testing instrument validity (estat overid) and endogeneity bias (estat endog), it is also important to consider your sample size:in large samples, the tests may reject a null hypothesis that is “nearly true”.in small samples, the tests may fail to reject a null hypothesis that is “very false”.
315.3.2 Estimating simultaneous equations using 3SLS (reg3) So far we have been examining a triangular system. For example, Y2it affects Y1it but Y1it does not affect Y2itY1it = a0 + a1 Y2it + a2 Xit + a3 Z2it + uitY2it = b b2 Xit + b3 Z1it + vitIn a simultaneous system, both dependent variables affect each otherY2it = b0 + b1 Y1it + b2 Xit + b3 Z1it + vit
32In this case, the OLS estimates are biased because: Y1it = a0 + a1 Y2it + a2 Xit + a3 Z2it + uitY2it = b0 + b1 Y1it + b2 Xit + b3 Z1it + vitIn this case, the OLS estimates are biased because:Eq. (1) shows that uit affects Y1it while eq. (2) shows that Y1it affects Y2it. As a result, it must be true that uit is correlated with Y2it in eq. (1). Therefore, the OLS estimate of a1 would be biased in eq. (1).Eq. (2) shows that vit affects Y2it while eq. (1) shows that Y2it affects Y1it. As a result, it must be true that vit is correlated with Y1it in eq. (2). Therefore, the OLS estimate of b1 would be biased in eq. (2).
33For example, it seems reasonable to argue that housing values depend on rents as well as rents depending on housing values:rent = a0 + a1 housing_value + a2 pct_population_urban + uhousing_value = b0 + b1 rent + b2 family_income + b3 region2 + b4 region3 + b5 region4 + vNote that for identification, each equation must contain at least one exogenous variable that is not included in the other equation. These are:pct_population_urban in the rent modelfamily_income, region2 - region4 in the housing_value model
34We estimate this kind of model using the reg3 command reg3 (depvar1 varlist1) (depvar2 varlist2)use "J:\phd\housing.dta", clearreg3 (rent= housing_value pct_population_urban) (housing_value = rent family_income region2 region3 region4)Unfortunately, the overid and endog commands are not currently available with reg3
355.4 When the endogenous right hand side variable is binary So far we have been dealing with the case where the endogenous regressor is continuous.We may want to estimate a model in which the endogenous regressor is binary.This brings us to a special class of models which are known as “self-selection” or “Heckman” models. “Selectivity” = “Endogeneity” where the endogenous regressor is binaryThe basic idea is similar to the instrumental variable techniques that we have already discussed.
36Examples of endogenous binary variables in accounting: Companies decide whether to use hedge contracts (Barton, 2001; Pincus and Rajgopal, 2002).Companies decide whether to grant stock options (Core and Guay, 1999).Companies decide whether to hire Big 5 or non-Big 5 auditors (e.g., Chaney et al., 2004).Governments decide whether to fully or partially privatize (Guedhami and Pittman, 2006).Companies decide whether to follow international financial reporting strategy (Leuz and Verrecchia, 2000).Companies decide whether to recognize financial instruments at fair value or disclose (Ahmed et al., 2006).Companies decide whether or not to go private (Engel et al., 2002).
37Selection modelConcerns about selectivity arise when the RHS dummy variable (D) is endogenous:Endogeneity results in bias if E(u | D) ≠ 0.If u and v are correlated, then E(u | D) ≠ 0, in which case the OLS estimate of the effect of D on Y would be biased.
38Selection modelThe intuition underlying Heckman is to estimate and then control for E(u | D). First model the choice of D:Z is a vector of exogenous variables that affect D but have no direct effect on Y.
40Selection modelEstimate E(u | D) and include it as a control variable on the RHS of the Y model:E(u | D) = IMR where captures the correlation between u and v while is the standard deviation of u and:
41Selection modelThe MILLS variable is added as a “control for selectivity” in the Y model:The OLS estimate of the effect of D on Y is now unbiased because E(ε | D) = 0.The D and Y models can be estimated in two-steps or estimated jointly using maximum likelihood (ML)ML yields separate estimates of and .The two-step yields an estimate of .Under the null of no selectivity bias, = 0 and = 0.
42Class exercise 5bWe are going to look at a fictional dataset on 2,000 women.use "J:\phd\heckman.dta", clearsum age education married children wageSuppose we believe that older and more highly educated women earn higher wages. Why would it be wrong to estimate the following model?reg wage age educationEstimate a probit model to test whether women are more likely to be employed if they are married, have children, are older and more highly educated.
435.4 When the endogenous right hand side variable is binary (heckman) It is easy to estimate the two-step Heckman model in STATA:heckman depvar1 [varlist1], select (depvar2 = varlist1), twostepwhere depvar1 is the dependent variable in the main equation and depvar2 is the dependent variable in the selection modelGoing back to our dataset on female wages:heckman wage education age, select(emp= married children education age) twostep
45The 657 censored observations are the women who are not in employment. The Wald chi2 tests the overall significance of the model.Women’s wages are higher if they are older and more highly educatedThe probit model of employment is exactly the same as what we had beforeWomen are more likely to be in employment if they are married, have children, are more highly educated or older.
46The lamba variable is simply the IMR that was estimated from the emp model The IMR coefficient is 4.00 and statistically significantthere is statistically significant evidence of a selection effect.The IMR coefficient is the product of rho and sigma ()Thus, 4.00 = 0.67 * 5.95
47Class exercise 5cEstimate the following audit fee models separately for Big 6 and Non-Big 6 audit clients:lnaf = a0 + a1 lnta + u (1)lnaf = a0 + a1 lnsales + u (2)where lnaf = log of audit fees, lnta = log of total assets, lnsales = log of salesUse the heckman command to “control” for endogeneity with respect to the company’s selected auditor. Your auditor choice models are as follows:big6 = b0 + b1 lnsales + b2 lnta + vnbig6 = c0 + c1 lnsales + c2 lnta + wwhere big6 = 1 (big6 = 0) if the company chooses a Big 6 (Non-Big 6) auditor; and nbig6 = 1 (nbig6 = 0) if the company chooses a Non-Big 6 (Big 6) auditor.
48Class exercise 5cWhat exclusion restrictions are you imposing in equations (1) and (2)?Is there statistically significant evidence of selectivity?For the two different specifications of the audit fee model:what are the signs of the MILLS coefficients?what are the signs of rho?
49Treatment effects model In exercise 5c, we estimated the audit fee models separately for the Big 6 and non-Big 6 audit clientsTo do this, we use the heckman commandSuppose that we want to estimate one audit fee model with Big 6 on the right hand side of the equation (i.e., we assume that the X coefficients have the same slope in the two equations)
50Treatment effects model We can estimate this model using the treatreg commandtreatreg lnaf lnta, treat (big6= lnta lnsales) twosteptreatreg lnaf lnsales, treat (big6= lnta lnsales) twostepIf we don’t specify the twostep option we will get the ML estimatessometimes the ML model will not converge due to a nonconcave likelihood functiontreatreg lnaf lnta, treat (big6= lnta lnsales)
51Treatment effects model The results for both the treatment effects and Heckman models can be very sensitive to the model specification.For example, the Big 6 fee premium can easily flip signs from positive to negative:treatreg lnaf lnta, treat (big6= lnta lnsales) twosteptreatreg lnaf lnta lnsales, treat (big6= lnta lnsales) twostepNote that there are no exclusion restrictions (Z variables) in the second specification since lnta and lnsales appear in both the first stage and second stage models
52Exclusion restrictions Francis, Lennox, Francis & Wang (2012) argue that many accounting studies have estimated the Heckman and treatment effects models incorrectlyIt is well recognized (in economics) that exogenous Z variables from the first stage choice model need to be validly excluded from the second stage outcome regression (Little, 1985; Little and Rubin, 1987; Manning et al., 1987).Accounting studies have generally failed to: (a) impose exclusion restrictions, or (b) provide compelling grounds for the validity of the exclusion restrictions.
53Exclusion restrictions Economists recognize that it is important to justify why the Z’s can be validly excluded from the Y model.For example, Angrist (1990) examines how military service affects the earnings of veteran soldiers after they are discharged from the army.This involves a selection issue because individuals join the military if they have poor wage offers in other types of job.Angrist (1990) tackles the selectivity issue using data from the Vietnam era, when military service was partly determined by a draft lottery.
54Exclusion restrictions D = military serviceZ = Random lotteryY = civilian earnings
55Exclusion restrictions Angrist and Evans (1998) test whether child bearing reduces female participation in the labor marketSelectivity is an issue because women are more likely to have children rather than enter the labor market if their wage offers would be low (i.e., lower opportunity cost).Use the gender of the second child as instrument for the decision to have a third child.
56Angrist and Evans (1998): Exclusion restriction D = decision to have a third childZ = Sex composition of first two childrenY = female participation in labor market
57Exclusion restrictions In accounting, many studies fail to justify why Z has no direct impact on Y.Many studies do not report results for the D model, so the reader cannot evaluate the power of the Z variables for identifying selectivity.Some studies estimate models in which there are no nominated Z variables.
58Exclusion restrictions When there are no exclusion restrictions, identification of the MILLS coefficients relies on the assumed non-linearityMILLS will capture any misspecification of the functional relation between X and Y (e.g., non-linearity) in addition to any selectivity bias.
59Exclusion restrictions Little (1985): Relying on nonlinearities to identify selectivity bias is “unappealing” because it is very difficult to distinguish empirically between selectivity and misspecification of the model’s functional form.STATA manual: “Theoretically, one does not need such identifying variables, but without them, one is depending on functional form to identify the model. It would be difficult to take such results seriously since the functional-form assumptions have no firm basis in theory.”A failure to nominate any Z variables can worsen the problems of multicollinearity (Manning et al., 1987; Puhani, 2000; Leung and Yu, 2000).
60Example: Chaney, Jeter and Shivakumar (2004) D = BIG5(company hires a Big 5 or non-Big 5 auditor)Y = Audit feesZ = null set
61Example: Leuz and Verrecchia (2000) D = IR97(international reporting)Z = ROA, Capital intensity, UK/US listing.Y = Cost of capital
65Leuz and Verrecchia (2000)Are the tests for selectivity bias powerful?Are the results sensitive to functional form? (see the free float variable).LV do not report results using OLSLV do not report whether their results are sensitive to alternative model specifications.
66Going forwardResearchers need to be aware that Heckman and treatment effects models can provide results that are extremely fragile. Sensitivity primarily affects the RHS variable that is assumed to be endogenous (D) and the IMRs.Studies need to discuss:why the Z’s are exogenouswhy the Z’s have no direct effect on Ywhether the Z’s are powerful predictors of DThe signs and significance of the IMRs alone do not provide compelling evidence as to the direction or existence of selectivity bias.Selection studies should routinely report tests for multicollinearity problems.
67SummaryWhen the endogenous regressor is continuous, you can “control” for endogeneity using the ivregress or reg3 commands.When the endogenous regressor is binary, you can “control” for endogeneity using the heckman or treatreg commands.If you want to control for endogeneity, it is vitally important that you have a good justification for your chosen exclusion restrictions.Choosing arbitrary exclusion restrictions will probably give you garbage results.