Instrumental Variables

Presentation on theme: "Instrumental Variables"— Presentation transcript:

Instrumental Variables
Saralyn J Miller EDU 7314

Overview of Presentation
Understanding IV History Defined Assumptions Endogeneity Exogenous Variable - Instrument Angrist example paralleled with an education example Statistical Understanding of IV Present 2 equations Card Example Overview of article Replicate his study in R In-class Example Other Examples of IV in Education

History of IV Historically IV has mostly been used by economists and statisticians (Angrist & Kreuger, 2001). Philip G. Wright (econometrician) vs. Sewell Wright (biologist) (Wright, 1928). Philip had written about the problem of endogenous variation in previous papers. Sewell had discovered the use of an instrument, but the variables were already exogenous, so the analysis was unnecessary. Stylometric analysis of their writing (Stock & Trebbi, 2003 Authors found Philip to be the writer and founder of IV 1940’s IV was rediscovered 1953 Theil introduced the two stage least squares method for computing IV

Instrumental Variables Defined
Causality is difficult to prove, even in experimental research. In education, randomization is what is used to determine causality. However, we can’t always randomize or create a true experiment. The IV method is a quasi-experimental research method used to estimate causal relationships.

Regression Assumption
One of the assumptions of the error term in a regression analysis is that the error must be independent and identically distributed. Error variance is the same for all values. Error is not related to other error values. Error is normally distributed. Use IV when the independent variable is correlated with unobservable error. 3 reasons why this assumption might be violated: Omitted variable bias: When an unobservable variable is capturing some of the dependent variable and this unobservable variable is not in your model. Instead, the variables you have included are picking up some of the unobserved and the unobserved needs to be accounted for on it’s own. In other words, there are other variables that can explain the outcome measure and your variable is picking up some of this explanation (omitted variable bias). Measurement error – causation is not determined due to error in the collection of the data Reverse Causality – direction of causality is not determined.

Endogeneity When an independent variable correlates with unobservable error we call this endogeneity. Endogenous variables: variables that are correlated with error term. You can’t say that the independent variables cause the dependent variable. Often the factors that affect an outcome depend on that outcome (reverse causality). Example The more shots Kobe Bryant takes, the lower the percentage of wins for the Lakers. Does an increase in shots that Kobe takes cause the Lakers to lose? Or does the loss of the game and the fact that teammates are not making shots cause Kobe to take more shots? (http://drbseconomicblog.blogspot.com/2009/01/kobe-and-reverse-causality.html )

Endogeneity Sometimes in a linear model some of the variables are endogenous, meaning the regressors or variables are correlated with the error term. Ex: Effect of military service on future earnings (Angrist, 1990). Military service is endogenous. Does the military cause a soldier’s future earnings to be a certain amount when he or she leaves the service? Or are there certain characteristics of those that join the military that influence future earnings? An individual’s choice to enter the service might be indicative of the individual’s expected future earnings. There are some individuals that choose to go into the military because their expected future earnings are low. Therefore, their enrollment is related to the fact that those that join the service might on average have lower future earnings. Also, veterans have certain observed and unobserved characteristics that affect their decision to enroll and these could be related to earnings.

What do we do when you have an endogenous variable?
An exogenous variable or instrument can “fix” endogeneity. These variables are correlated with the regressors, but are uncorrelated with the error term. We call these exogenous variables instruments. Ex: Since determining earnings is dependent on other things such as expected earnings, Angrist (1990) used the Vietnam draft as an instrument. It is correlated with entering the service, but is not correlated with earnings. The draft system is exogenous.

Qualities of an Instrument – Exogenous Variable
It must be correlated with the independent variable. It must be uncorrelated with the error of the dependent variable. Assumption of IV: Instrument must be exogenous.

Example Joshua Angrist’s 1990 work.
He analyzed the difference in earnings between veterans and non-veterans. But analyzing this difference does not tell us the causal impact of military service on future earnings. In education – we “fix” this problem by randomly placing students into treatment and control conditions. We can’t always randomize. What if we gave students a choice on whether they wanted to attend tutoring sessions (Reardon, 2010) because we could not randomly assign students to a condition?

Example Continued A young person’s decision to enter the military could be affected by his/her expectations of future earnings. This is an endogeneity problem: does military service affect future earnings or does the prospect of future earnings affect the decision to enter the military? Veterans have observed and unobserved characteristics that affect their reason for entering the military. We cannot control for the unobserved characteristics. Tutoring session example (Reardon, 2010): A student’s decision to attend tutoring could be affected by his/her expectations of how it will affect academic achievement. Does tutoring affect achievement or does the prospect of future grades affect the decision to go to tutoring?

What did Angrist do? He used the Vietnam draft lottery as an instrument (exogenous variable). The draft lottery is correlated with serving in the military. The draft lottery is only correlated with future earnings of military personnel through enrollment in the military. Tutoring session could use a lottery system too. The lottery would be correlated with those that go to tutoring. The lottery would be correlated with future grades only through attendance to the tutoring program.

Problem What about those who were drafted and avoided the draft?
Or those who were not drafted, but felt compelled to fight anyway? What about the students who were picked for the lottery, but chose not to go because they didn’t think it would help? Or those that were not picked, but really felt like they needed the help?

Answer The IV method recognizes that those described previously cannot be included in the sample. It is not an average treatment effect for the whole sample, but is a local average treatment effect (LATE) Military earnings example only tells you the treatment effect on those who pulled a “bad” number and served and those who pulled a “good” number and did not serve. Tutoring example: only tells you the treatment effect on those who were picked for tutoring and attended and those who were not picked for tutoring and did not attend. Therefore we are only measuring a treatment effect for compliers, which makes this method less generalizable.

LATE Estimates can be biased when not a binary choice, but an ordered choice (use LIV to correct). There is not usually a theoretical model that the relationships are based on except when a natural experiment is created. Only generalizable to those that benefit from the instrument. Advantages Can be used to estimate a causal relationship when randomization is not applicable.

Statistical Understanding of IV
Think of IV models as 2 separate equations. Y is the outcome variable K is the variable related to the instrument IV is the instrument related to K e is the error

Typical Regression Exogenous Endogenous DV X1 X2 e1

Instrumental Variable Regression
Exogenous Endogenous X1 X2 e1

How do we find a good instrument and test the instrument’s validity?
You can use theory and past research to provide evidence for an instrument. Hausman test Check correlation between independent variable and instrument.

Example in R – Card data Explanation of Card (1993) study
Replicate study using Card data (Card, 1993; Hamersma, 2009).

Using Geographic Variation in College Proximity to Estimate the Return to Schooling (Card, 1993)
Does level of education or number of years of schooling effect wages or earnings? You would think yes! BUT, the studies that show earnings gains are controversial because educational levels are NOT randomly assigned. Individuals choose their level of education. Education is endogenous. The effect of schooling is difficult to determine and you cannot randomly assign some children to school. The author needs an exogenous variable. Card uses geographic differences in the proximity to a college. Overall finding: When college proximity is used as an instrument in place of education, the author finds that the return to education is approximately 50% higher than the OLS estimate.

Why is Education Endogenous to Earnings?
Ability bias – if some individuals have an ability that explains earnings despite education, then those that earn higher schooling will have an upward-biased level of earnings (IQ). Measurement error- All of the data was student reported. We could argue that there is a negative correlation between earnings error and observed schooling.

Is College Proximity Exogenous?
Card proposes college proximity as an exogenous variable. College proximity needs to be related to wages, but only through education. If you are poor, the likelihood of attending college increases if you live near one, so proximity is related to education. He checked this by looking at the effect of college proximity on predicted education given other demographic variables. Biggest effect was men with low chance of continuing education. (if you live near a college, then there is a lower cost of higher education so there is a bigger effect on education outcomes of poorer children)

Recap We’re trying to predict the effect of schooling on wages.
Education is our key independent variable that is endogenous. Wage (log of wages) is our dependent variable. College proximity is our exogenous instrument.

Variables Used in Card analysis
lwage = log(wages) educ = years of schooling, 1976 exper = age – educ – 6 expersq black = 1 if black south = 1 if in south, 1976 smsa = 1 if in metropolitan area, 1976 reg661-reg668 = 1 for region lived in, 1966 smsa66 = 1 if in metropolitan area, 1966 nearc4 = 1 if near 4 year college, 1966

3 Step Process for Replicating Card’s Findings (Card, 1992; Hamersma, 2009)
###Load Stata file### library(foreign) card.data<-read.dta("card.dta") attach(card.data) head(card.data) id nearc2 nearc4 educ age fatheduc motheduc weight momdad14 sinmom14 step14 NA NA reg661 reg662 reg663 reg664 reg665 reg666 reg667 reg668 reg669 south66 black smsa south smsa66 wage enroll kww iq married libcrd14 exper lwage expersq NA

Step 1: OLS Estimate without Instrument We find education is SSD, but we can make the case that it is endogenous. m1<-lm(lwage~educ+exper+expersq+black+south+smsa+reg661+reg662+reg663+reg664+reg665+reg666+reg667+reg668+smsa66) summary(m1) Call: lm(formula = lwage ~ educ + exper + expersq + black + south + smsa + reg661 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + smsa66) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** educ < 2e-16 *** exper < 2e-16 *** expersq e-13 *** black < 2e-16 *** south e-08 *** smsa e-11 *** reg ** reg reg reg reg reg reg reg *** smsa Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 2994 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 15 and 2994 DF, p-value: < 2.2e-16

What do we know so far? Education is the key variable and is SSD, but education is endogenous and is not accounting for individual ability. Card uses college proximity as an instrument to correct endogenous scenario. College proximity is correlated with wages, but only through education We want to check to see if college proximity is correlated with education.

Step 2: Is college proximity an exogenous determinant of wages?
m2<-lm(educ~exper+expersq+black+south+smsa+reg661+reg662+reg663+reg664+reg665+reg666+reg667+reg668+smsa66+nearc4) summary(m2) Call: lm(formula = educ ~ exper + expersq + black + south + smsa + reg661 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + smsa66 + nearc4) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** exper < 2e-16 *** expersq black < 2e-16 *** south smsa *** reg reg * reg reg reg * reg * reg * reg smsa nearc *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 2994 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 15 and 2994 DF, p-value: < 2.2e-16

Step 2: Is college proximity an exogenous determinant of wages?
m3<-lm(lwage~exper+expersq+black+south+smsa+reg661+reg662+reg663+reg664+reg665+reg666+reg667+reg668+smsa66+nearc4) summary(m3) Call: lm(formula = lwage ~ exper + expersq + black + south + smsa + reg661 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + smsa66 + nearc4) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** exper e-15 *** expersq e-11 *** black < 2e-16 *** south e-08 *** smsa e-14 *** reg ** reg reg reg reg reg reg reg ** smsa nearc * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 2994 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 15 and 2994 DF, p-value: < 2.2e-16

Step 3: Does education effect wages when college proximity is used as the instrument?
library(AER) m4<-ivreg(lwage~educ+exper+expersq+black+south+smsa+reg661+reg662+reg663+reg664+reg665+reg666+reg667+reg668+smsa66|nearc4+exper+expersq+black+south+smsa+reg661+reg662+reg663+reg664+reg665+reg666+reg667+reg668+smsa66) summary(m4) Call: ivreg(formula = lwage ~ educ + exper + expersq + black + south + smsa + reg661 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + smsa66 | nearc4 + exper + expersq + black + south + smsa + reg661 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + smsa66) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** educ * exper e-06 *** expersq e-12 *** black ** south e-07 *** smsa *** reg ** reg reg reg reg reg reg reg *** smsa Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 2994 degrees of freedom Multiple R-Squared: , Adjusted R-squared: Wald test: on 15 and 2994 DF, p-value: < 2.2e-16

Compare OLS to IV Estimator
lm(formula = lwage ~ educ + exper + expersq + black + south + smsa + reg661 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + smsa66) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** educ < 2e-16 *** exper < 2e-16 *** expersq e-13 *** black < 2e-16 *** south e-08 *** smsa e-11 *** reg ** reg reg reg reg reg reg reg *** smsa --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 2994 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 15 and 2994 DF, p-value: < 2.2e-16 ivreg(formula = lwage ~ educ + exper + expersq + black + south + smsa + reg661 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + smsa66 | nearc4 + exper + expersq + black + south + smsa + reg661 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + smsa66)  Residuals: Min Q Median Q Max  Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** educ * exper e-06 *** expersq e-12 *** black ** south e-07 *** smsa *** reg ** reg reg reg reg reg reg reg *** smsa --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: on 2994 degrees of freedom Multiple R-Squared: , Adjusted R-squared: Wald test: on 15 and 2994 DF, p-value: < 2.2e-16 Effect of education increased from to Card (1993): “The implied instrumental variables estimates of the earnings gain per year of additional schooling at 10-14% are substantially above the earnings gains estimated by a conventional ordinary least squares procedure (7.3%)”

Example 2 Does cigarette smoking have an effect on child birth weight (Wooldridge, 2002)? What is the dependent variable? What is the independent variable? Do we have an endogeneity problem? This examples uses cigarette prices as the exogenous variable or as the instrument in the analysis

Insert Data into R bwght<-read.dta("bwght.dta") head(bwght) faminc cigtax cigprice bwght fatheduc motheduc parity male white cigs NA lbwght bwghtlbs packs lfaminc attach(bwght)

Step 1: What is the first regression analysis we should calculate?

Step 2: Check the instrument Are cigarette prices correlated with number of cigarettes smoked per day while pregnant?

What did we find?

Other Examples of IV (Angrist & Kreuger, 2001)

IV in Educational Research
Tutoring voucher system Remediation programs Schooling effects Effects of absences on achievement Effects of attendance on earnings Effects of class size on achievement Effects of hours spent in algebra on math achievement

References Angrist, J. (1990). Lifetime earnings and the vietname era draft lottery: Evidence from social security administrative records. American Economic Review, 80(3), Angrist, J. D. & Kreuger, J. D. (2001). Instrumental variables and the search for identification: From supply and demand to natural experiments. Journal of Economic Perspectives, 15(4), Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling. NBER Working Paper Series, 4483, 1-37 Retrieved from ??. Bauchet, J. (2009). Of instrumental variables and sample definition. Financial Access Initiative. Retrieved November 1, 2010, from Hamersma, S. (2009). Homework # 2: ECO 7427 answer key. Retrieved from Reardon, S. (2010, March). Using instrumental variables in educational research. Presentation at Society for Research on Educational Effectiveness. Retrieved from Shepherd, B. (2008). Session 1: Dealing with endogeneity. Retrieved from Stock, J. H. & Trebbi, F. (2003). Retrospective: Who invented instrumental variable regression? Journal of Economic Perspectives, 17(3), Wilson, B. (2009). Kobe and reverse causality. Brooks Wilson’s Economics Blog. Retrieved November 1, 2010, from Wooldridge, J. (2002). Introductory econometrics: A modern approach. (2nd Ed?) South-Western College Pub, City?. Wright, P. G. (1928). The tariff on animal and vegetable oils. New York: Macmillan.