5. Endogenous right hand side variables

Presentation on theme: "5. Endogenous right hand side variables"— Presentation transcript:

5. Endogenous right hand side variables
5.1 The problem of endogeneity bias 5.2 The basic idea underlying the use of instrumental variables 5.3 When the endogenous right hand side variable is continuous 5.4 When the endogenous right hand side variable is binary

5.1 Endogeneity bias Consider a simple OLS regression:
Yit = a0 + a1 X1it + uit Recall that our estimate of a1 will be unbiased only if we can assume that X1it is uncorrelated with the error term (uit) We have discussed two ways to help ensure that this assumption is true First, we should control for any observable variables that affect Yit and which are correlated with X1it. For example, we should control for X2it if X2it affects Yit and X2it is correlated with X1it (see Chapter 2): Yit = a0 + a1 X1it + a2 X2it + uit

5.1 Endogeneity bias Second, if we have panel data, we can control for any unobservable firm-specific characteristics (ui) that affect Yit and which are correlated with the X variables. From Chapter 4: Yit = a0 + a1 X1it + a2 X2it + ui + eit We control for the correlations between ui and the X variables by estimating fixed effects models. Our estimates of a1 and a2 are unbiased if the X variables are uncorrelated with eit. In this case, we say that the X variables are “exogenous”.

5.1 Endogeneity bias Unfortunately, multiple regression and fixed effects models do not always ensure that the X variables are uncorrelated with the error term: if we do not observe all the variables that affect Y and that are correlated with X, multiple regression will not solve the problem. if we do not have panel data, the fixed effects models cannot be estimated. even if we have panel data, the Y and X variables may display little variation over time in which case the fixed effects models can be unreliable (Zhou, 2001). even if we have panel data and the Y and X variables display sufficient variation over time, the unobservable variables that are correlated with X may not be constant over time in which case the fixed effects models will not solve the problem.

For example, suppose that Y2it is an endogenous explanatory variable:
A variable is more likely to be correlated with the error term if it is “endogenous” “Endogenous” means that the variable is determined within the economic model that we are trying to estimate. For example, suppose that Y2it is an endogenous explanatory variable: Y1it = a0 + a1 Y2it + a2 Xit + uit (1) Y2it = b0 + b1 Xit + b2 Zit + vit (2) Equations (1) and (2) have a “triangular” structure since Y2it is assumed to affect Y1it, but Y1it is assumed not to affect Y2it Given this triangular structure, the OLS estimate of a1 in equation (1) is unbiased only if vit is uncorrelated with uit If vit is correlated with uit, then Y2it is correlated with uit which means that the OLS estimate of a1 would be biased To avoid this bias, we must estimate equation (1) “instrumental variables” (IV) regression rather than OLS.

Equations (1) and (2) are called “structural” equations because they describe the economic relationship between Y1it and Y2it We can obtain a “reduced-form” equation by substituting eq. (2) into eq. (1): Y1it = a0 + a1 (b0 + b1 Xit + b2 Zit + vit) + a2 Xit + uit In this “reduced-form” equation, all the explanatory variables (Xit and Zit) are exogenous The basic idea underlying IV regression is to remove vit from the Y1it model so that our estimate of a1 is unbiased.

5.2 The basic idea underlying the use of instrumental variables
Note that vit is removed from the Y1it model if we use the predicted rather than the actual values of Y2it on the right hand side. We predict Y2it using all the exogenous variables in the system (in our example, we use the two exogenous variables Xit and Zit)

5.2 The basic idea We then use the predicted rather than the actual values of Y2it when estimating the Y1it model: The a1 estimate is biased in eq. (3) but it is unbiased in eq. (4) because the vit term has been removed.

In eq. (4) the estimated coefficient for the Zit variable is
We already know the value of from eq. (2): Therefore It is important to note that the coefficient can be estimated only if there is at least one exogenous variable in the structural model for Y2it that is excluded from the structural model for Y1it This is the Zit variable in eq. (2)

In eq. (4) the coefficient is “just” identified because there is only one exogenous variable (Zit) that is in the Y2it model and that is excluded from the Y1it model

Suppose we had included Zit in both models
In this case, the coefficient cannot be identified because we estimate and In other words, we cannot determine whether the effect of Zit on Y1it is a main effect (a3) or an indirect effect through Y2it (a1b2) Here we say that the system of equations is “under-identified”

Suppose we had included two exogenous variables in the Y2it model and we excluded both these variables from the Y1it model Now we have estimates of , , , and . Therefore Here we say that the system of equations is “over-identified” In this example, the system is “triangular” because there are two equations and one endogenous right-hand side variable

5.3 When the endogenous right hand side variable is continuous
When the models have a triangular structure, the models can be estimated using the ivregress command The models can be estimated using 2SLS or LIML or GMM 2SLS is more commonly used in practice

5.3.1 Estimating triangular models using 2SLS (ivregress)
Go to MySite Open up the housing.dta file which provides data from 50 U.S. states (1980 Census) use "J:\phd\housing.dta", clear pct_population_urban = the % of the population that lives in urban areas family_income = median annual family income housing_value = median value of private housing rent = median monthly housing rental payments region1 – region 4 = dummy variables for four regions in the U.S.

Suppose we want to estimate the following:
rent = a0 + a1 pct_population_urban a2 housing_value + u housing_value = b0 + b1 family_income b2 region2 + b3 region3 + b4 region4 + v This is a triangular system because there are two equations and one endogenous right hand side variable (housing_value) If u and v are correlated, the OLS estimate of a2 will be biased in the rent model

If we ignore the endogeneity problem and estimate the rent model using simple OLS:
reg rent housing_value pct_population_urban To take account of the potential endogeneity problem we use the ivregress command: ivregress estimator depvar1 [varlist1] (depvar2 = varlistiv) estimator is either 2sls or liml or gmm depvar1 is the dependent variable for the model which has an endogenous regressor varlist1 are the exogenous variables in the model that has the endogenous regressor depvar2 is the endogenous regressor varlistiv are the exogenous variables that are believed to affect the endogenous regressor

The models that we want to estimate are:
rent = a0 + a1 pct_population_urban + a2 housing_value + u housing_value = b0 + b1 family_income + b2 region2 + b3 region3 + b4 region4 + v The rent model has an endogenous regressor: ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) ivregress liml rent pct_population_urban (housing_value = family_income region2 region3 region4) ivregress gmm rent pct_population_urban (housing_value = family_income region2 region3 region4) The housing_value model can be estimated using OLS as there are no endogenous regressors reg housing_value family_income region2 region3 region4

We should test whether:
our chosen instruments are exogenous (i.e., they should be uncorrelated with the error term) and it is valid to exclude some of them from the model that has the endogenous regressor. If they are not exogenous or they should not be excluded, they are not valid instruments.

The tests for instrument validity are also known as tests of “over-identifying” restrictions because the tests can only be performed if the model with the endogenous regressor is overidentified the tests assume that at least one of the chosen instruments is valid (unfortunately this assumption cannot be tested) In our example, the instrumented housing_value variable is overidentified because four of the exogenous variables (family_income region2 region3 region4) are excluded from the rent model. If we had excluded only one of these variables, the instrumented housing_value variable would have been “just” identified in which case it would not be possible to test for instrument validity.

We obtain the tests for instrument validity by typing estat overid after we run ivregress
ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) estat overid These tests are statistically significant, which means the chosen instruments are not valid.

This is not surprising because we did not have good reason to assume that they are exogenous and validly excluded from the rent model. For example: family_income is endogenous if family incomes depend on housing values and rents Why would this be true? rents may be different across the four regions, so the region dummies should not be excluded from the rent model

However, the conclusions are the same as in our previous example:
We obtain different statistics for the tests of instrument validity if the models are estimated using LIML or GMM However, the conclusions are the same as in our previous example: ivregress liml rent pct_population_urban (housing_value = family_income region2 region3 region4) estat overid ivregress gmm rent pct_population_urban (housing_value = family_income region2 region3 region4)

Note that we cannot test for instrument validity when the endogenous regressor is just identified
This is because the test statistics are obtained under the assumption that at least one of the instruments is valid For example: ivregress 2sls rent pct_population_urban (housing_value = family_income) estat overid ivregress liml rent pct_population_urban (housing_value = family_income) ivregress gmm rent pct_population_urban (housing_value = family_income)

We can also test whether the coefficient of the “endogenous” regressor is biased under OLS.
We obtain two Hausman tests for endogeneity bias by typing estat endogenous after we run ivregress ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) estat endogenous (The Durbin statistic uses an estimate of the error term’s variance assuming that the variable being tested is exogenous whereas the Wu-Hausman statistic assumes that the variable being tested is endogenous) Given these results, we may be tempted to reject the hypothesis that housing_value is exogenous However, the Hausman tests for endogeneity bias are only reliable if the chosen instruments are valid. In our example they are not, and so we cannot draw conclusions about the potential for endogeneity bias.

Class exercise 5a Using the fees.dta file, estimate the following models for audit fees and company size: lnaf = a0 + a1 lnta + a2 big6 + u lnta = b0 + b1 ln_age + b2 listed + v where lnaf is the log of audit fees, lnta is the log of total assets, ln_age is the log of the company’s age in years, listed is a dummy variable indicating whether the company’s shares are publicly traded on a market. Is the instrumented lnta variable over-identified, just-identified, or under-identified? Explain. Estimate the audit fee model using 2SLS. Test the validity of the chosen instrumental variables. Test whether the lnta variable is affected by endogeneity bias. Verify that the test for instrument validity is not available if you change the model so that it is just-identified.

The key to estimating IV models is to find one or more “exogenous” variables that explains the endogenous regressor and that can be safely excluded from the main equation. Unfortunately, most accounting studies that use IV regression do not attempt to justify why their chosen instruments are exogenous or why they can be excluded from the structural model. As a result, Larcker and Rusticus (2010) criticize the way in which accounting studies have applied IV regression A key problem is that the IV results can be very sensitive to the researcher’s choice of which variables to exclude from the structural model and, in many studies, these variables have been chosen in a very arbitrary way

Larcker and Rusticus (2010) recommend that researchers justify their chosen instruments using theory or economic intuition the estat overid test should not be used to select instruments on purely statistical grounds particularly as the test is invalid if all of the chosen instruments are also invalid When testing instrument validity (estat overid) and endogeneity bias (estat endog), it is also important to consider your sample size: in large samples, the tests may reject a null hypothesis that is “nearly true”. in small samples, the tests may fail to reject a null hypothesis that is “very false”.

5.3.2 Estimating simultaneous equations using 3SLS (reg3)
So far we have been examining a triangular system. For example, Y2it affects Y1it but Y1it does not affect Y2it Y1it = a0 + a1 Y2it + a2 Xit + a3 Z2it + uit Y2it = b b2 Xit + b3 Z1it + vit In a simultaneous system, both dependent variables affect each other Y2it = b0 + b1 Y1it + b2 Xit + b3 Z1it + vit

In this case, the OLS estimates are biased because:
Y1it = a0 + a1 Y2it + a2 Xit + a3 Z2it + uit Y2it = b0 + b1 Y1it + b2 Xit + b3 Z1it + vit In this case, the OLS estimates are biased because: Eq. (1) shows that uit affects Y1it while eq. (2) shows that Y1it affects Y2it. As a result, it must be true that uit is correlated with Y2it in eq. (1). Therefore, the OLS estimate of a1 would be biased in eq. (1). Eq. (2) shows that vit affects Y2it while eq. (1) shows that Y2it affects Y1it. As a result, it must be true that vit is correlated with Y1it in eq. (2). Therefore, the OLS estimate of b1 would be biased in eq. (2).

For example, it seems reasonable to argue that housing values depend on rents as well as rents depending on housing values: rent = a0 + a1 housing_value + a2 pct_population_urban + u housing_value = b0 + b1 rent + b2 family_income + b3 region2 + b4 region3 + b5 region4 + v Note that for identification, each equation must contain at least one exogenous variable that is not included in the other equation. These are: pct_population_urban in the rent model family_income, region2 - region4 in the housing_value model

We estimate this kind of model using the reg3 command
reg3 (depvar1 varlist1) (depvar2 varlist2) use "J:\phd\housing.dta", clear reg3 (rent= housing_value pct_population_urban) (housing_value = rent family_income region2 region3 region4) Unfortunately, the overid and endog commands are not currently available with reg3

5.4 When the endogenous right hand side variable is binary
So far we have been dealing with the case where the endogenous regressor is continuous. We may want to estimate a model in which the endogenous regressor is binary. This brings us to a special class of models which are known as “self-selection” or “Heckman” models. “Selectivity” = “Endogeneity” where the endogenous regressor is binary The basic idea is similar to the instrumental variable techniques that we have already discussed.

Examples of endogenous binary variables in accounting:
Companies decide whether to use hedge contracts (Barton, 2001; Pincus and Rajgopal, 2002). Companies decide whether to grant stock options (Core and Guay, 1999). Companies decide whether to hire Big 5 or non-Big 5 auditors (e.g., Chaney et al., 2004). Governments decide whether to fully or partially privatize (Guedhami and Pittman, 2006). Companies decide whether to follow international financial reporting strategy (Leuz and Verrecchia, 2000). Companies decide whether to recognize financial instruments at fair value or disclose (Ahmed et al., 2006). Companies decide whether or not to go private (Engel et al., 2002).

Selection model Concerns about selectivity arise when the RHS dummy variable (D) is endogenous: Endogeneity results in bias if E(u | D) ≠ 0. If u and v are correlated, then E(u | D) ≠ 0, in which case the OLS estimate of the effect of D on Y would be biased.

Selection model The intuition underlying Heckman is to estimate and then control for E(u | D). First model the choice of D: Z is a vector of exogenous variables that affect D but have no direct effect on Y.

Selection model D Z Y

Selection model Estimate E(u | D) and include it as a control variable on the RHS of the Y model: E(u | D) =  IMR where  captures the correlation between u and v while  is the standard deviation of u and:

Selection model The MILLS variable is added as a “control for selectivity” in the Y model: The OLS estimate of the effect of D on Y is now unbiased because E(ε | D) = 0. The D and Y models can be estimated in two-steps or estimated jointly using maximum likelihood (ML) ML yields separate estimates of  and . The two-step yields an estimate of . Under the null of no selectivity bias,  = 0 and  = 0.

Class exercise 5b We are going to look at a fictional dataset on 2,000 women. use "J:\phd\heckman.dta", clear sum age education married children wage Suppose we believe that older and more highly educated women earn higher wages. Why would it be wrong to estimate the following model? reg wage age education Estimate a probit model to test whether women are more likely to be employed if they are married, have children, are older and more highly educated.

5.4 When the endogenous right hand side variable is binary (heckman)
It is easy to estimate the two-step Heckman model in STATA: heckman depvar1 [varlist1], select (depvar2 = varlist1), twostep where depvar1 is the dependent variable in the main equation and depvar2 is the dependent variable in the selection model Going back to our dataset on female wages: heckman wage education age, select(emp= married children education age) twostep

The 657 censored observations are the women who are not in employment.
The Wald chi2 tests the overall significance of the model. Women’s wages are higher if they are older and more highly educated The probit model of employment is exactly the same as what we had before Women are more likely to be in employment if they are married, have children, are more highly educated or older.

The lamba variable is simply the IMR that was estimated from the emp model
The IMR coefficient is 4.00 and statistically significant there is statistically significant evidence of a selection effect. The IMR coefficient is the product of rho and sigma () Thus, 4.00 = 0.67 * 5.95

Class exercise 5c Estimate the following audit fee models separately for Big 6 and Non-Big 6 audit clients: lnaf = a0 + a1 lnta + u (1) lnaf = a0 + a1 lnsales + u (2) where lnaf = log of audit fees, lnta = log of total assets, lnsales = log of sales Use the heckman command to “control” for endogeneity with respect to the company’s selected auditor. Your auditor choice models are as follows: big6 = b0 + b1 lnsales + b2 lnta + v nbig6 = c0 + c1 lnsales + c2 lnta + w where big6 = 1 (big6 = 0) if the company chooses a Big 6 (Non-Big 6) auditor; and nbig6 = 1 (nbig6 = 0) if the company chooses a Non-Big 6 (Big 6) auditor.

Class exercise 5c What exclusion restrictions are you imposing in equations (1) and (2)? Is there statistically significant evidence of selectivity? For the two different specifications of the audit fee model: what are the signs of the MILLS coefficients? what are the signs of rho?

Treatment effects model
In exercise 5c, we estimated the audit fee models separately for the Big 6 and non-Big 6 audit clients To do this, we use the heckman command Suppose that we want to estimate one audit fee model with Big 6 on the right hand side of the equation (i.e., we assume that the X coefficients have the same slope in the two equations)

Treatment effects model
We can estimate this model using the treatreg command treatreg lnaf lnta, treat (big6= lnta lnsales) twostep treatreg lnaf lnsales, treat (big6= lnta lnsales) twostep If we don’t specify the twostep option we will get the ML estimates sometimes the ML model will not converge due to a nonconcave likelihood function treatreg lnaf lnta, treat (big6= lnta lnsales)

Treatment effects model
The results for both the treatment effects and Heckman models can be very sensitive to the model specification. For example, the Big 6 fee premium can easily flip signs from positive to negative: treatreg lnaf lnta, treat (big6= lnta lnsales) twostep treatreg lnaf lnta lnsales, treat (big6= lnta lnsales) twostep Note that there are no exclusion restrictions (Z variables) in the second specification since lnta and lnsales appear in both the first stage and second stage models

Exclusion restrictions
Francis, Lennox, Francis & Wang (2012) argue that many accounting studies have estimated the Heckman and treatment effects models incorrectly It is well recognized (in economics) that exogenous Z variables from the first stage choice model need to be validly excluded from the second stage outcome regression (Little, 1985; Little and Rubin, 1987; Manning et al., 1987). Accounting studies have generally failed to: (a) impose exclusion restrictions, or (b) provide compelling grounds for the validity of the exclusion restrictions.

Exclusion restrictions
Economists recognize that it is important to justify why the Z’s can be validly excluded from the Y model. For example, Angrist (1990) examines how military service affects the earnings of veteran soldiers after they are discharged from the army. This involves a selection issue because individuals join the military if they have poor wage offers in other types of job. Angrist (1990) tackles the selectivity issue using data from the Vietnam era, when military service was partly determined by a draft lottery.

Exclusion restrictions
D = military service Z = Random lottery Y = civilian earnings

Exclusion restrictions
Angrist and Evans (1998) test whether child bearing reduces female participation in the labor market Selectivity is an issue because women are more likely to have children rather than enter the labor market if their wage offers would be low (i.e., lower opportunity cost). Use the gender of the second child as instrument for the decision to have a third child.

Angrist and Evans (1998): Exclusion restriction
D = decision to have a third child Z = Sex composition of first two children Y = female participation in labor market

Exclusion restrictions
In accounting, many studies fail to justify why Z has no direct impact on Y. Many studies do not report results for the D model, so the reader cannot evaluate the power of the Z variables for identifying selectivity. Some studies estimate models in which there are no nominated Z variables.

Exclusion restrictions
When there are no exclusion restrictions, identification of the MILLS coefficients relies on the assumed non-linearity MILLS will capture any misspecification of the functional relation between X and Y (e.g., non-linearity) in addition to any selectivity bias.

Exclusion restrictions
Little (1985): Relying on nonlinearities to identify selectivity bias is “unappealing” because it is very difficult to distinguish empirically between selectivity and misspecification of the model’s functional form. STATA manual: “Theoretically, one does not need such identifying variables, but without them, one is depending on functional form to identify the model. It would be difficult to take such results seriously since the functional-form assumptions have no firm basis in theory.” A failure to nominate any Z variables can worsen the problems of multicollinearity (Manning et al., 1987; Puhani, 2000; Leung and Yu, 2000).

Example: Chaney, Jeter and Shivakumar (2004)
D = BIG5 (company hires a Big 5 or non-Big 5 auditor) Y = Audit fees Z = null set

Example: Leuz and Verrecchia (2000)
D = IR97 (international reporting) Z = ROA, Capital intensity, UK/US listing. Y = Cost of capital

Leuz and Verrecchia (2000) Is it valid to assume that ROA, Capital intensity, and UK/US listing have no direct effect on the cost of capital? Are these Z variables really exogenous?

Leuz and Verrecchia (2000) Are the tests for selectivity bias powerful? Are the results sensitive to functional form? (see the free float variable). LV do not report results using OLS LV do not report whether their results are sensitive to alternative model specifications.

Going forward Researchers need to be aware that Heckman and treatment effects models can provide results that are extremely fragile. Sensitivity primarily affects the RHS variable that is assumed to be endogenous (D) and the IMRs. Studies need to discuss: why the Z’s are exogenous why the Z’s have no direct effect on Y whether the Z’s are powerful predictors of D The signs and significance of the IMRs alone do not provide compelling evidence as to the direction or existence of selectivity bias. Selection studies should routinely report tests for multicollinearity problems.

Summary When the endogenous regressor is continuous, you can “control” for endogeneity using the ivregress or reg3 commands. When the endogenous regressor is binary, you can “control” for endogeneity using the heckman or treatreg commands. If you want to control for endogeneity, it is vitally important that you have a good justification for your chosen exclusion restrictions. Choosing arbitrary exclusion restrictions will probably give you garbage results.