Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 5. Endogenous right hand side variables 5.1 The problem of endogeneity bias 5.1 The problem of endogeneity bias 5.2 The basic idea underlying the use.

Similar presentations


Presentation on theme: "1 5. Endogenous right hand side variables 5.1 The problem of endogeneity bias 5.1 The problem of endogeneity bias 5.2 The basic idea underlying the use."— Presentation transcript:

1 1 5. Endogenous right hand side variables 5.1 The problem of endogeneity bias 5.1 The problem of endogeneity bias 5.2 The basic idea underlying the use of instrumental variables 5.2 The basic idea underlying the use of instrumental variables 5.3 When the endogenous right hand side variable is continuous 5.3 When the endogenous right hand side variable is continuous 5.4 When the endogenous right hand side variable is binary 5.4 When the endogenous right hand side variable is binary

2 2 5.1 Endogeneity bias Consider a simple OLS regression: Consider a simple OLS regression: Y it = a 0 + a 1 X 1it + u it Y it = a 0 + a 1 X 1it + u it Recall that our estimate of a 1 will be unbiased only if we can assume that X 1it is uncorrelated with the error term (u it ) Recall that our estimate of a 1 will be unbiased only if we can assume that X 1it is uncorrelated with the error term (u it ) We have discussed two ways to help ensure that this assumption is true We have discussed two ways to help ensure that this assumption is true First, we should control for any observable variables that affect Y it and which are correlated with X 1it. For example, we should control for X 2it if X 2it affects Y it and X 2it is correlated with X 1it (see Chapter 2): First, we should control for any observable variables that affect Y it and which are correlated with X 1it. For example, we should control for X 2it if X 2it affects Y it and X 2it is correlated with X 1it (see Chapter 2): Y it = a 0 + a 1 X 1it + a 2 X 2it + u it Y it = a 0 + a 1 X 1it + a 2 X 2it + u it

3 3 5.1 Endogeneity bias Second, if we have panel data, we can control for any unobservable firm-specific characteristics (u i ) that affect Y it and which are correlated with the X variables. Second, if we have panel data, we can control for any unobservable firm-specific characteristics (u i ) that affect Y it and which are correlated with the X variables. From Chapter 4: From Chapter 4: Y it = a 0 + a 1 X 1it + a 2 X 2it + u i + e it Y it = a 0 + a 1 X 1it + a 2 X 2it + u i + e it We control for the correlations between u i and the X variables by estimating fixed effects models. We control for the correlations between u i and the X variables by estimating fixed effects models. Our estimates of a 1 and a 2 are unbiased if the X variables are uncorrelated with e it. In this case, we say that the X variables are exogenous. Our estimates of a 1 and a 2 are unbiased if the X variables are uncorrelated with e it. In this case, we say that the X variables are exogenous.

4 4 5.1 Endogeneity bias Unfortunately, multiple regression and fixed effects models do not always ensure that the X variables are uncorrelated with the error term: Unfortunately, multiple regression and fixed effects models do not always ensure that the X variables are uncorrelated with the error term: if we do not observe all the variables that affect Y and that are correlated with X, multiple regression will not solve the problem. if we do not observe all the variables that affect Y and that are correlated with X, multiple regression will not solve the problem. if we do not have panel data, the fixed effects models cannot be estimated. if we do not have panel data, the fixed effects models cannot be estimated. even if we have panel data, the Y and X variables may display little variation over time in which case the fixed effects models can be unreliable (Zhou, 2001). even if we have panel data, the Y and X variables may display little variation over time in which case the fixed effects models can be unreliable (Zhou, 2001). even if we have panel data and the Y and X variables display sufficient variation over time, the unobservable variables that are correlated with X may not be constant over time in which case the fixed effects models will not solve the problem. even if we have panel data and the Y and X variables display sufficient variation over time, the unobservable variables that are correlated with X may not be constant over time in which case the fixed effects models will not solve the problem.

5 5 A variable is more likely to be correlated with the error term if it is endogenous A variable is more likely to be correlated with the error term if it is endogenous Endogenous means that the variable is determined within the economic model that we are trying to estimate. Endogenous means that the variable is determined within the economic model that we are trying to estimate. For example, suppose that Y 2it is an endogenous explanatory variable: For example, suppose that Y 2it is an endogenous explanatory variable: Y 1it = a 0 + a 1 Y 2it + a 2 X it + u it (1) Y 1it = a 0 + a 1 Y 2it + a 2 X it + u it (1) Y 2it = b 0 + b 1 X it + b 2 Z it + v it (2) Y 2it = b 0 + b 1 X it + b 2 Z it + v it (2) Equations (1) and (2) have a triangular structure since Y 2it is assumed to affect Y 1it, but Y 1it is assumed not to affect Y 2it Equations (1) and (2) have a triangular structure since Y 2it is assumed to affect Y 1it, but Y 1it is assumed not to affect Y 2it Given this triangular structure, the OLS estimate of a 1 in equation (1) is unbiased only if v it is uncorrelated with u it Given this triangular structure, the OLS estimate of a 1 in equation (1) is unbiased only if v it is uncorrelated with u it If v it is correlated with u it, then Y 2it is correlated with u it which means that the OLS estimate of a 1 would be biased If v it is correlated with u it, then Y 2it is correlated with u it which means that the OLS estimate of a 1 would be biased To avoid this bias, we must estimate equation (1) instrumental variables (IV) regression rather than OLS. To avoid this bias, we must estimate equation (1) instrumental variables (IV) regression rather than OLS.

6 6 Equations (1) and (2) are called structural equations because they describe the economic relationship between Y 1it and Y 2it Equations (1) and (2) are called structural equations because they describe the economic relationship between Y 1it and Y 2it We can obtain a reduced-form equation by substituting eq. (2) into eq. (1): We can obtain a reduced-form equation by substituting eq. (2) into eq. (1): Y 1it = a 0 + a 1 (b 0 + b 1 X it + b 2 Z it + v it ) + a 2 X it + u it Y 1it = a 0 + a 1 (b 0 + b 1 X it + b 2 Z it + v it ) + a 2 X it + u it In this reduced-form equation, all the explanatory variables (X it and Z it ) are exogenous In this reduced-form equation, all the explanatory variables (X it and Z it ) are exogenous The basic idea underlying IV regression is to remove v it from the Y 1it model so that our estimate of a 1 is unbiased. The basic idea underlying IV regression is to remove v it from the Y 1it model so that our estimate of a 1 is unbiased.

7 7 Note that v it is removed from the Y 1it model if we use the predicted rather than the actual values of Y 2it on the right hand side. Note that v it is removed from the Y 1it model if we use the predicted rather than the actual values of Y 2it on the right hand side. We predict Y 2it using all the exogenous variables in the system (in our example, we use the two exogenous variables X it and Z it ) We predict Y 2it using all the exogenous variables in the system (in our example, we use the two exogenous variables X it and Z it ) 5.2 The basic idea underlying the use of instrumental variables

8 8 5.2 The basic idea We then use the predicted rather than the actual values of Y 2it when estimating the Y 1it model: We then use the predicted rather than the actual values of Y 2it when estimating the Y 1it model: The a 1 estimate is biased in eq. (3) but it is unbiased in eq. (4) because the v it term has been removed. The a 1 estimate is biased in eq. (3) but it is unbiased in eq. (4) because the v it term has been removed.

9 9 In eq. (4) the estimated coefficient for the Z it variable is In eq. (4) the estimated coefficient for the Z it variable is We already know the value of from eq. (2): We already know the value of from eq. (2): Therefore Therefore It is important to note that the coefficient can be estimated only if there is at least one exogenous variable in the structural model for Y 2it that is excluded from the structural model for Y 1it It is important to note that the coefficient can be estimated only if there is at least one exogenous variable in the structural model for Y 2it that is excluded from the structural model for Y 1it This is the Z it variable in eq. (2) This is the Z it variable in eq. (2)

10 10 In eq. (4) the coefficient is just identified because there is only one exogenous variable ( Z it ) that is in the Y 2it model and that is excluded from the Y 1it model In eq. (4) the coefficient is just identified because there is only one exogenous variable ( Z it ) that is in the Y 2it model and that is excluded from the Y 1it model

11 11 Suppose we had included Z it in both models Suppose we had included Z it in both models In this case, the coefficient cannot be identified because we estimate and In this case, the coefficient cannot be identified because we estimate and In other words, we cannot determine whether the effect of Z it on Y 1it is a main effect (a 3 ) or an indirect effect through Y 2it (a 1 b 2 ) In other words, we cannot determine whether the effect of Z it on Y 1it is a main effect (a 3 ) or an indirect effect through Y 2it (a 1 b 2 ) Here we say that the system of equations is under- identified Here we say that the system of equations is under- identified

12 12 Suppose we had included two exogenous variables in the Y 2it model and we excluded both these variables from the Y 1it model Suppose we had included two exogenous variables in the Y 2it model and we excluded both these variables from the Y 1it model Now we have estimates of,,, and. Now we have estimates of,,, and. Therefore Therefore Here we say that the system of equations is over- identified Here we say that the system of equations is over- identified In this example, the system is triangular because there are two equations and one endogenous right-hand side variable In this example, the system is triangular because there are two equations and one endogenous right-hand side variable

13 When the endogenous right hand side variable is continuous When the models have a triangular structure, the models can be estimated using the ivregress command When the models have a triangular structure, the models can be estimated using the ivregress command The models can be estimated using 2SLS or LIML or GMM The models can be estimated using 2SLS or LIML or GMM 2SLS is more commonly used in practice 2SLS is more commonly used in practice

14 Estimating triangular models using 2SLS (ivregress) Go to MySite Go to MySite Open up the housing.dta file which provides data from 50 U.S. states (1980 Census) use "J:\phd\housing.dta", clear pct_population_urban = the % of the population that lives in urban areas family_income = median annual family income housing_value = median value of private housing rent = median monthly housing rental payments region1 – region 4 = dummy variables for four regions in the U.S.

15 15 Suppose we want to estimate the following: Suppose we want to estimate the following: rent = a 0 + a 1 pct_population_urban + a 2 housing_value + u housing_value = b 0 + b 1 family_income + b 2 region2 + b 3 region3 + b 4 region4 + v This is a triangular system because there are two equations and one endogenous right hand side variable (housing_value) If u and v are correlated, the OLS estimate of a 2 will be biased in the rent model

16 16 If we ignore the endogeneity problem and estimate the rent model using simple OLS: If we ignore the endogeneity problem and estimate the rent model using simple OLS: reg rent housing_value pct_population_urban To take account of the potential endogeneity problem we use the ivregress command: To take account of the potential endogeneity problem we use the ivregress command: ivregress estimator depvar 1 [varlist 1 ] (depvar 2 = varlist iv ) estimator is either 2sls or liml or gmm depvar 1 is the dependent variable for the model which has an endogenous regressor varlist 1 are the exogenous variables in the model that has the endogenous regressor depvar 2 is the endogenous regressor varlist iv are the exogenous variables that are believed to affect the endogenous regressor

17 17 The models that we want to estimate are: rent = a 0 + a 1 pct_population_urban + a 2 housing_value + u housing_value = b 0 + b 1 family_income + b 2 region2 + b 3 region3 + b 4 region4 + v The rent model has an endogenous regressor: The rent model has an endogenous regressor: ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) ivregress liml rent pct_population_urban (housing_value = family_income region2 region3 region4) ivregress gmm rent pct_population_urban (housing_value = family_income region2 region3 region4) The housing_value model can be estimated using OLS as there are no endogenous regressors reg housing_value family_income region2 region3 region4

18 18 We should test whether: We should test whether: our chosen instruments are exogenous (i.e., they should be uncorrelated with the error term) and our chosen instruments are exogenous (i.e., they should be uncorrelated with the error term) and it is valid to exclude some of them from the model that has the endogenous regressor. it is valid to exclude some of them from the model that has the endogenous regressor. If they are not exogenous or they should not be excluded, they are not valid instruments. If they are not exogenous or they should not be excluded, they are not valid instruments.

19 19 The tests for instrument validity are also known as tests of over-identifying restrictions because the tests can only be performed if the model with the endogenous regressor is overidentified The tests for instrument validity are also known as tests of over-identifying restrictions because the tests can only be performed if the model with the endogenous regressor is overidentified the tests assume that at least one of the chosen instruments is valid (unfortunately this assumption cannot be tested) the tests assume that at least one of the chosen instruments is valid (unfortunately this assumption cannot be tested) In our example, the instrumented housing_value variable is overidentified because four of the exogenous variables (family_income region2 region3 region4) are excluded from the rent model. In our example, the instrumented housing_value variable is overidentified because four of the exogenous variables (family_income region2 region3 region4) are excluded from the rent model. If we had excluded only one of these variables, the instrumented housing_value variable would have been just identified in which case it would not be possible to test for instrument validity. If we had excluded only one of these variables, the instrumented housing_value variable would have been just identified in which case it would not be possible to test for instrument validity.

20 20 We obtain the tests for instrument validity by typing estat overid after we run ivregress We obtain the tests for instrument validity by typing estat overid after we run ivregress ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) estat overid estat overid These tests are statistically significant, which means the chosen instruments are not valid. These tests are statistically significant, which means the chosen instruments are not valid.

21 21 This is not surprising because we did not have good reason to assume that they are exogenous and validly excluded from the rent model. This is not surprising because we did not have good reason to assume that they are exogenous and validly excluded from the rent model. For example: For example: family_income is endogenous if family incomes depend on housing values and rents family_income is endogenous if family incomes depend on housing values and rents Why would this be true?Why would this be true? rents may be different across the four regions, so the region dummies should not be excluded from the rent model rents may be different across the four regions, so the region dummies should not be excluded from the rent model

22 We obtain different statistics for the tests of instrument validity if the models are estimated using LIML or GMM We obtain different statistics for the tests of instrument validity if the models are estimated using LIML or GMM However, the conclusions are the same as in our previous example: However, the conclusions are the same as in our previous example: ivregress liml rent pct_population_urban (housing_value = family_income region2 region3 region4) ivregress liml rent pct_population_urban (housing_value = family_income region2 region3 region4) estat overid estat overid ivregress gmm rent pct_population_urban (housing_value = family_income region2 region3 region4) ivregress gmm rent pct_population_urban (housing_value = family_income region2 region3 region4) estat overid estat overid 22

23 Note that we cannot test for instrument validity when the endogenous regressor is just identified Note that we cannot test for instrument validity when the endogenous regressor is just identified This is because the test statistics are obtained under the assumption that at least one of the instruments is valid This is because the test statistics are obtained under the assumption that at least one of the instruments is valid For example: For example: ivregress 2sls rent pct_population_urban (housing_value = family_income) ivregress 2sls rent pct_population_urban (housing_value = family_income) estat overid estat overid ivregress liml rent pct_population_urban (housing_value = family_income) ivregress liml rent pct_population_urban (housing_value = family_income) estat overid estat overid ivregress gmm rent pct_population_urban (housing_value = family_income) ivregress gmm rent pct_population_urban (housing_value = family_income) estat overid estat overid 23

24 24 We can also test whether the coefficient of the endogenous regressor is biased under OLS. We can also test whether the coefficient of the endogenous regressor is biased under OLS. We obtain two Hausman tests for endogeneity bias by typing estat endogenous after we run ivregress We obtain two Hausman tests for endogeneity bias by typing estat endogenous after we run ivregress ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4) estat endogenous estat endogenous (The Durbin statistic uses an estimate of the error terms variance assuming that the variable being tested is exogenous whereas the Wu-Hausman statistic assumes that the variable being tested is endogenous) (The Durbin statistic uses an estimate of the error terms variance assuming that the variable being tested is exogenous whereas the Wu-Hausman statistic assumes that the variable being tested is endogenous) Given these results, we may be tempted to reject the hypothesis that housing_value is exogenous Given these results, we may be tempted to reject the hypothesis that housing_value is exogenous However, the Hausman tests for endogeneity bias are only reliable if the chosen instruments are valid. In our example they are not, and so we cannot draw conclusions about the potential for endogeneity bias. However, the Hausman tests for endogeneity bias are only reliable if the chosen instruments are valid. In our example they are not, and so we cannot draw conclusions about the potential for endogeneity bias.

25 25 Class exercise 5a Using the fees.dta file, estimate the following models for audit fees and company size: Using the fees.dta file, estimate the following models for audit fees and company size: lnaf = a 0 + a 1 lnta + a 2 big6 + u lnta = b 0 + b 1 ln_age + b 2 listed + v where lnaf is the log of audit fees, lnta is the log of total assets, ln_age is the log of the companys age in years, listed is a dummy variable indicating whether the companys shares are publicly traded on a market. Is the instrumented lnta variable over-identified, just-identified, or under-identified? Explain. Is the instrumented lnta variable over-identified, just-identified, or under-identified? Explain. Estimate the audit fee model using 2SLS. Estimate the audit fee model using 2SLS. Test the validity of the chosen instrumental variables. Test the validity of the chosen instrumental variables. Test whether the lnta variable is affected by endogeneity bias. Test whether the lnta variable is affected by endogeneity bias. Verify that the test for instrument validity is not available if you change the model so that it is just-identified. Verify that the test for instrument validity is not available if you change the model so that it is just-identified.

26 26 The key to estimating IV models is to find one or more exogenous variables that explains the endogenous regressor and that can be safely excluded from the main equation. The key to estimating IV models is to find one or more exogenous variables that explains the endogenous regressor and that can be safely excluded from the main equation. Unfortunately, most accounting studies that use IV regression do not attempt to justify why their chosen instruments are exogenous or why they can be excluded from the structural model. Unfortunately, most accounting studies that use IV regression do not attempt to justify why their chosen instruments are exogenous or why they can be excluded from the structural model. As a result, Larcker and Rusticus (2010) criticize the way in which accounting studies have applied IV regression As a result, Larcker and Rusticus (2010) criticize the way in which accounting studies have applied IV regression A key problem is that the IV results can be very sensitive to the researchers choice of which variables to exclude from the structural model and, in many studies, these variables have been chosen in a very arbitrary way A key problem is that the IV results can be very sensitive to the researchers choice of which variables to exclude from the structural model and, in many studies, these variables have been chosen in a very arbitrary way

27 27

28 28

29 29

30 30 Larcker and Rusticus (2010) recommend that researchers justify their chosen instruments using theory or economic intuition Larcker and Rusticus (2010) recommend that researchers justify their chosen instruments using theory or economic intuition the estat overid test should not be used to select instruments on purely statistical grounds particularly as the test is invalid if all of the chosen instruments are also invalid the estat overid test should not be used to select instruments on purely statistical grounds particularly as the test is invalid if all of the chosen instruments are also invalid When testing instrument validity (estat overid) and endogeneity bias (estat endog), it is also important to consider your sample size: When testing instrument validity (estat overid) and endogeneity bias (estat endog), it is also important to consider your sample size: in large samples, the tests may reject a null hypothesis that is nearly true. in large samples, the tests may reject a null hypothesis that is nearly true. in small samples, the tests may fail to reject a null hypothesis that is very false. in small samples, the tests may fail to reject a null hypothesis that is very false.

31 Estimating simultaneous equations using 3SLS (reg3) So far we have been examining a triangular system. For example, Y 2it affects Y 1it but Y 1it does not affect Y 2it So far we have been examining a triangular system. For example, Y 2it affects Y 1it but Y 1it does not affect Y 2it Y 1it = a 0 + a 1 Y 2it + a 2 X it + a 3 Z 2it + u it Y 1it = a 0 + a 1 Y 2it + a 2 X it + a 3 Z 2it + u it Y 2it = b 0 + b 2 X it + b 3 Z 1it + v it Y 2it = b 0 + b 2 X it + b 3 Z 1it + v it In a simultaneous system, both dependent variables affect each other In a simultaneous system, both dependent variables affect each other Y 1it = a 0 + a 1 Y 2it + a 2 X it + a 3 Z 2it + u it Y 1it = a 0 + a 1 Y 2it + a 2 X it + a 3 Z 2it + u it Y 2it = b 0 + b 1 Y 1it + b 2 X it + b 3 Z 1it + v it Y 2it = b 0 + b 1 Y 1it + b 2 X it + b 3 Z 1it + v it

32 32 Y 1it = a 0 + a 1 Y 2it + a 2 X it + a 3 Z 2it + u it Y 1it = a 0 + a 1 Y 2it + a 2 X it + a 3 Z 2it + u it Y 2it = b 0 + b 1 Y 1it + b 2 X it + b 3 Z 1it + v it Y 2it = b 0 + b 1 Y 1it + b 2 X it + b 3 Z 1it + v it In this case, the OLS estimates are biased because: In this case, the OLS estimates are biased because: Eq. (1) shows that u it affects Y 1it while eq. (2) shows that Y 1it affects Y 2it. As a result, it must be true that u it is correlated with Y 2it in eq. (1). Therefore, the OLS estimate of a 1 would be biased in eq. (1). Eq. (1) shows that u it affects Y 1it while eq. (2) shows that Y 1it affects Y 2it. As a result, it must be true that u it is correlated with Y 2it in eq. (1). Therefore, the OLS estimate of a 1 would be biased in eq. (1). Eq. (2) shows that v it affects Y 2it while eq. (1) shows that Y 2it affects Y 1it. As a result, it must be true that v it is correlated with Y 1it in eq. (2). Therefore, the OLS estimate of b 1 would be biased in eq. (2). Eq. (2) shows that v it affects Y 2it while eq. (1) shows that Y 2it affects Y 1it. As a result, it must be true that v it is correlated with Y 1it in eq. (2). Therefore, the OLS estimate of b 1 would be biased in eq. (2).

33 33 For example, it seems reasonable to argue that housing values depend on rents as well as rents depending on housing values: rent = a 0 + a 1 housing_value + a 2 pct_population_urban + u housing_value = b 0 + b 1 rent + b 2 family_income + b 3 region2 + b 4 region3 + b 5 region4 + v Note that for identification, each equation must contain at least one exogenous variable that is not included in the other equation. These are: pct_population_urban in the rent model family_income, region2 - region4 in the housing_value model

34 34 We estimate this kind of model using the reg3 command reg3 (depvar 1 varlist 1 ) (depvar 2 varlist 2 ) use "J:\phd\housing.dta", clear reg3 (rent= housing_value pct_population_urban) (housing_value = rent family_income region2 region3 region4) Unfortunately, the overid and endog commands are not currently available with reg3 Unfortunately, the overid and endog commands are not currently available with reg3

36 36 Examples of endogenous binary variables in accounting: Examples of endogenous binary variables in accounting: Companies decide whether to use hedge contracts (Barton, 2001; Pincus and Rajgopal, 2002). Companies decide whether to use hedge contracts (Barton, 2001; Pincus and Rajgopal, 2002). Companies decide whether to grant stock options (Core and Guay, 1999). Companies decide whether to grant stock options (Core and Guay, 1999). Companies decide whether to hire Big 5 or non-Big 5 auditors (e.g., Chaney et al., 2004). Companies decide whether to hire Big 5 or non-Big 5 auditors (e.g., Chaney et al., 2004). Governments decide whether to fully or partially privatize (Guedhami and Pittman, 2006). Governments decide whether to fully or partially privatize (Guedhami and Pittman, 2006). Companies decide whether to follow international financial reporting strategy (Leuz and Verrecchia, 2000). Companies decide whether to follow international financial reporting strategy (Leuz and Verrecchia, 2000). Companies decide whether to recognize financial instruments at fair value or disclose (Ahmed et al., 2006). Companies decide whether to recognize financial instruments at fair value or disclose (Ahmed et al., 2006). Companies decide whether or not to go private (Engel et al., 2002). Companies decide whether or not to go private (Engel et al., 2002).

37 37 Selection model Concerns about selectivity arise when the RHS dummy variable (D) is endogenous: Concerns about selectivity arise when the RHS dummy variable (D) is endogenous: Endogeneity results in bias if E(u | D) 0. Endogeneity results in bias if E(u | D) 0. If u and v are correlated, then E(u | D) 0, in which case the OLS estimate of the effect of D on Y would be biased.

38 Selection model The intuition underlying Heckman is to estimate and then control for E(u | D). First model the choice of D: The intuition underlying Heckman is to estimate and then control for E(u | D). First model the choice of D: Z is a vector of exogenous variables that affect D but have no direct effect on Y. Z is a vector of exogenous variables that affect D but have no direct effect on Y. 38

39 39 Selection model D ZY

40 40 Selection model Estimate E(u | D) and include it as a control variable on the RHS of the Y model: Estimate E(u | D) and include it as a control variable on the RHS of the Y model: E(u | D) = IMR where captures the correlation between u and v while is the standard deviation of u and: E(u | D) = IMR where captures the correlation between u and v while is the standard deviation of u and:

41 41 Selection model The MILLS variable is added as a control for selectivity in the Y model: The MILLS variable is added as a control for selectivity in the Y model: The OLS estimate of the effect of D on Y is now unbiased because E(ε | D) = 0. The OLS estimate of the effect of D on Y is now unbiased because E(ε | D) = 0. The D and Y models can be estimated in two- steps or estimated jointly using maximum likelihood (ML) The D and Y models can be estimated in two- steps or estimated jointly using maximum likelihood (ML) ML yields separate estimates of and. ML yields separate estimates of and. The two-step yields an estimate of. The two-step yields an estimate of. Under the null of no selectivity bias, = 0 and = 0. Under the null of no selectivity bias, = 0 and = 0.

42 42 Class exercise 5b We are going to look at a fictional dataset on 2,000 women. use "J:\phd\heckman.dta", clear sum age education married children wage Suppose we believe that older and more highly educated women earn higher wages. Why would it be wrong to estimate the following model? reg wage age education Estimate a probit model to test whether women are more likely to be employed if they are married, have children, are older and more highly educated.

43 When the endogenous right hand side variable is binary (heckman) It is easy to estimate the two-step Heckman model in STATA: It is easy to estimate the two-step Heckman model in STATA: heckman depvar 1 [varlist 1 ], select (depvar 2 = varlist 1 ), twostep where depvar 1 is the dependent variable in the main equation and depvar 2 is the dependent variable in the selection model where depvar 1 is the dependent variable in the main equation and depvar 2 is the dependent variable in the selection model Going back to our dataset on female wages: Going back to our dataset on female wages: heckman wage education age, select(emp= married children education age) twostep

44 44

45 45 Womens wages are higher if they are older and more highly educated Womens wages are higher if they are older and more highly educated The probit model of employment is exactly the same as what we had before The probit model of employment is exactly the same as what we had before Women are more likely to be in employment if they are married, have children, are more highly educated or older. Women are more likely to be in employment if they are married, have children, are more highly educated or older. The 657 censored observations are the women who are not in employment. The 657 censored observations are the women who are not in employment. The Wald chi2 tests the overall significance of the model. The Wald chi2 tests the overall significance of the model.

46 46 The lamba variable is simply the IMR that was estimated from the emp model The lamba variable is simply the IMR that was estimated from the emp model The IMR coefficient is 4.00 and statistically significant The IMR coefficient is 4.00 and statistically significant there is statistically significant evidence of a selection effect. there is statistically significant evidence of a selection effect. The IMR coefficient is the product of rho and sigma () The IMR coefficient is the product of rho and sigma ( ) Thus, 4.00 = 0.67 * 5.95 Thus, 4.00 = 0.67 * 5.95

47 47 Class exercise 5c Estimate the following audit fee models separately for Big 6 and Non-Big 6 audit clients: Estimate the following audit fee models separately for Big 6 and Non-Big 6 audit clients: lnaf = a 0 + a 1 lnta + u (1) lnaf = a 0 + a 1 lnta + u (1) lnaf = a 0 + a 1 lnsales + u (2) lnaf = a 0 + a 1 lnsales + u (2) where lnaf = log of audit fees, lnta = log of total assets, lnsales = log of sales where lnaf = log of audit fees, lnta = log of total assets, lnsales = log of sales Use the heckman command to control for endogeneity with respect to the companys selected auditor. Your auditor choice models are as follows: Use the heckman command to control for endogeneity with respect to the companys selected auditor. Your auditor choice models are as follows: big6 = b 0 + b 1 lnsales + b 2 lnta + v big6 = b 0 + b 1 lnsales + b 2 lnta + v nbig6 = c 0 + c 1 lnsales + c 2 lnta + w nbig6 = c 0 + c 1 lnsales + c 2 lnta + w where big6 = 1 (big6 = 0) if the company chooses a Big 6 (Non- Big 6) auditor; and nbig6 = 1 (nbig6 = 0) if the company chooses a Non-Big 6 (Big 6) auditor. where big6 = 1 (big6 = 0) if the company chooses a Big 6 (Non- Big 6) auditor; and nbig6 = 1 (nbig6 = 0) if the company chooses a Non-Big 6 (Big 6) auditor.

48 48 Class exercise 5c What exclusion restrictions are you imposing in equations (1) and (2)? What exclusion restrictions are you imposing in equations (1) and (2)? Is there statistically significant evidence of selectivity? Is there statistically significant evidence of selectivity? For the two different specifications of the audit fee model: For the two different specifications of the audit fee model: what are the signs of the MILLS coefficients? what are the signs of the MILLS coefficients? what are the signs of rho? what are the signs of rho?

49 49 Treatment effects model In exercise 5c, we estimated the audit fee models separately for the Big 6 and non-Big 6 audit clients In exercise 5c, we estimated the audit fee models separately for the Big 6 and non-Big 6 audit clients To do this, we use the heckman command To do this, we use the heckman command Suppose that we want to estimate one audit fee model with Big 6 on the right hand side of the equation (i.e., we assume that the X coefficients have the same slope in the two equations)

50 50 Treatment effects model We can estimate this model using the treatreg command We can estimate this model using the treatreg command treatreg lnaf lnta, treat (big6= lnta lnsales) twostep treatreg lnaf lnta, treat (big6= lnta lnsales) twostep treatreg lnaf lnsales, treat (big6= lnta lnsales) twostep treatreg lnaf lnsales, treat (big6= lnta lnsales) twostep If we dont specify the twostep option we will get the ML estimates If we dont specify the twostep option we will get the ML estimates sometimes the ML model will not converge due to a nonconcave likelihood function sometimes the ML model will not converge due to a nonconcave likelihood function treatreg lnaf lnta, treat (big6= lnta lnsales) treatreg lnaf lnta, treat (big6= lnta lnsales)

51 51 Treatment effects model The results for both the treatment effects and Heckman models can be very sensitive to the model specification. The results for both the treatment effects and Heckman models can be very sensitive to the model specification. For example, the Big 6 fee premium can easily flip signs from positive to negative: For example, the Big 6 fee premium can easily flip signs from positive to negative: treatreg lnaf lnta, treat (big6= lnta lnsales) twostep treatreg lnaf lnta, treat (big6= lnta lnsales) twostep treatreg lnaf lnta lnsales, treat (big6= lnta lnsales) twostep treatreg lnaf lnta lnsales, treat (big6= lnta lnsales) twostep Note that there are no exclusion restrictions (Z variables) in the second specification since lnta and lnsales appear in both the first stage and second stage models Note that there are no exclusion restrictions (Z variables) in the second specification since lnta and lnsales appear in both the first stage and second stage models

52 52 Exclusion restrictions Francis, Lennox, Francis & Wang (2012) argue that many accounting studies have estimated the Heckman and treatment effects models incorrectly Francis, Lennox, Francis & Wang (2012) argue that many accounting studies have estimated the Heckman and treatment effects models incorrectly It is well recognized (in economics) that exogenous Z variables from the first stage choice model need to be validly excluded from the second stage outcome regression (Little, 1985; Little and Rubin, 1987; Manning et al., 1987). It is well recognized (in economics) that exogenous Z variables from the first stage choice model need to be validly excluded from the second stage outcome regression (Little, 1985; Little and Rubin, 1987; Manning et al., 1987). Accounting studies have generally failed to: (a) impose exclusion restrictions, or (b) provide compelling grounds for the validity of the exclusion restrictions. Accounting studies have generally failed to: (a) impose exclusion restrictions, or (b) provide compelling grounds for the validity of the exclusion restrictions.

53 53 Exclusion restrictions Economists recognize that it is important to justify why the Zs can be validly excluded from the Y model. Economists recognize that it is important to justify why the Zs can be validly excluded from the Y model. For example, Angrist (1990) examines how military service affects the earnings of veteran soldiers after they are discharged from the army. For example, Angrist (1990) examines how military service affects the earnings of veteran soldiers after they are discharged from the army. This involves a selection issue because individuals join the military if they have poor wage offers in other types of job. This involves a selection issue because individuals join the military if they have poor wage offers in other types of job. Angrist (1990) tackles the selectivity issue using data from the Vietnam era, when military service was partly determined by a draft lottery. Angrist (1990) tackles the selectivity issue using data from the Vietnam era, when military service was partly determined by a draft lottery.

54 54 Exclusion restrictions D = military service Z = Random lottery Y = civilian earnings

55 55 Exclusion restrictions Angrist and Evans (1998) test whether child bearing reduces female participation in the labor market Angrist and Evans (1998) test whether child bearing reduces female participation in the labor market Selectivity is an issue because women are more likely to have children rather than enter the labor market if their wage offers would be low (i.e., lower opportunity cost). Selectivity is an issue because women are more likely to have children rather than enter the labor market if their wage offers would be low (i.e., lower opportunity cost). Use the gender of the second child as instrument for the decision to have a third child. Use the gender of the second child as instrument for the decision to have a third child.

56 56 Angrist and Evans (1998): Exclusion restriction D = decision to have a third child Z = Sex composition of first two children Y = female participation in labor market

57 57 Exclusion restrictions In accounting, many studies fail to justify why Z has no direct impact on Y. In accounting, many studies fail to justify why Z has no direct impact on Y. Many studies do not report results for the D model, so the reader cannot evaluate the power of the Z variables for identifying selectivity. Many studies do not report results for the D model, so the reader cannot evaluate the power of the Z variables for identifying selectivity. Some studies estimate models in which there are no nominated Z variables. Some studies estimate models in which there are no nominated Z variables.

58 58 Exclusion restrictions When there are no exclusion restrictions, identification of the MILLS coefficients relies on the assumed non- linearity When there are no exclusion restrictions, identification of the MILLS coefficients relies on the assumed non- linearity MILLS will capture any misspecification of the functional relation between X and Y (e.g., non-linearity) in addition to any selectivity bias. MILLS will capture any misspecification of the functional relation between X and Y (e.g., non-linearity) in addition to any selectivity bias.

59 59 Exclusion restrictions Little (1985): Relying on nonlinearities to identify selectivity bias is unappealing because it is very difficult to distinguish empirically between selectivity and misspecification of the models functional form. Little (1985): Relying on nonlinearities to identify selectivity bias is unappealing because it is very difficult to distinguish empirically between selectivity and misspecification of the models functional form. STATA manual: Theoretically, one does not need such identifying variables, but without them, one is depending on functional form to identify the model. It would be difficult to take such results seriously since the functional-form assumptions have no firm basis in theory. STATA manual: Theoretically, one does not need such identifying variables, but without them, one is depending on functional form to identify the model. It would be difficult to take such results seriously since the functional-form assumptions have no firm basis in theory. A failure to nominate any Z variables can worsen the problems of multicollinearity (Manning et al., 1987; Puhani, 2000; Leung and Yu, 2000). A failure to nominate any Z variables can worsen the problems of multicollinearity (Manning et al., 1987; Puhani, 2000; Leung and Yu, 2000).

60 60 Example: Chaney, Jeter and Shivakumar (2004) D = BIG5 (company hires a Big 5 or non-Big 5 auditor) Z = null set Y = Audit fees

61 61 Example: Leuz and Verrecchia (2000) D = IR97 (international reporting) Z = ROA, Capital intensity, UK/US listing. Y = Cost of capital

62 62

63 63 Leuz and Verrecchia (2000) Is it valid to assume that ROA, Capital intensity, and UK/US listing have no direct effect on the cost of capital? Is it valid to assume that ROA, Capital intensity, and UK/US listing have no direct effect on the cost of capital? Are these Z variables really exogenous? Are these Z variables really exogenous?

64 64

65 65 Leuz and Verrecchia (2000) Are the tests for selectivity bias powerful? Are the tests for selectivity bias powerful? Are the results sensitive to functional form? (see the free float variable). Are the results sensitive to functional form? (see the free float variable). LV do not report results using OLS LV do not report results using OLS LV do not report whether their results are sensitive to alternative model specifications. LV do not report whether their results are sensitive to alternative model specifications.

66 66 Going forward Researchers need to be aware that Heckman and treatment effects models can provide results that are extremely fragile. Sensitivity primarily affects the RHS variable that is assumed to be endogenous (D) and the IMRs. Researchers need to be aware that Heckman and treatment effects models can provide results that are extremely fragile. Sensitivity primarily affects the RHS variable that is assumed to be endogenous (D) and the IMRs. Studies need to discuss: Studies need to discuss: why the Zs are exogenous why the Zs are exogenous why the Zs have no direct effect on Y why the Zs have no direct effect on Y whether the Zs are powerful predictors of D whether the Zs are powerful predictors of D The signs and significance of the IMRs alone do not provide compelling evidence as to the direction or existence of selectivity bias. The signs and significance of the IMRs alone do not provide compelling evidence as to the direction or existence of selectivity bias. Selection studies should routinely report tests for multicollinearity problems. Selection studies should routinely report tests for multicollinearity problems.

67 67 Summary When the endogenous regressor is continuous, you can control for endogeneity using the ivregress or reg3 commands. When the endogenous regressor is continuous, you can control for endogeneity using the ivregress or reg3 commands. When the endogenous regressor is binary, you can control for endogeneity using the heckman or treatreg commands. When the endogenous regressor is binary, you can control for endogeneity using the heckman or treatreg commands. If you want to control for endogeneity, it is vitally important that you have a good justification for your chosen exclusion restrictions. If you want to control for endogeneity, it is vitally important that you have a good justification for your chosen exclusion restrictions. Choosing arbitrary exclusion restrictions will probably give you garbage results. Choosing arbitrary exclusion restrictions will probably give you garbage results.


Download ppt "1 5. Endogenous right hand side variables 5.1 The problem of endogeneity bias 5.1 The problem of endogeneity bias 5.2 The basic idea underlying the use."

Similar presentations


Ads by Google