Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males.

Similar presentations


Presentation on theme: "Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males."— Presentation transcript:

1 Overview of Regression Analysis

2 Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males is $54,648 (March 2010) We are also often interested in how this mean differs by other individual characteristics. E.g. How do the mean earnings differ between black and non-black workers? Mean earnings for working non-black males ages 25-44 = $56,614 Mean earnings for working black males ages 25-44 = $39,380 These are known as Conditional Means (the mean conditioned on some other characteristics, in this case race) So without controlling for anything else, 25-44 yr old black working males earn on average $17,234 less annually, or 30% less, than similar aged white working males.

3 Conditional Means When testing a theory though, we often want to know how much of a given mean difference can be attributed to a particular observable variable, after controlling for other observable differences. For example, we also know that earnings are highly tied to schooling, and there is a significant racial gap in schooling, so we might want to know how large is racial earnings gap net of racial differences in years of schooling (i.e., controlling for schooling).

4 Conditional Means One way to do this is to calculate even more complicated conditional means. E.g., Non-Black males between 25-44 w/out hs degree = $25,278 Black males between 25-44 w/out hs degree = $22,275 Non-Black males between 25-44 w/ hs degree = $39,922 Black males between 25-44 w/ hs degree = $32,670 Non-Black males between 25-44 w/ college degree = $80,295 Black males between 25-44 w/ college degree = $61,136

5 Conditional Means Then, we can find how much less blacks earn than whites, after controlling for education, via the following weighted mean formula: where i corresponds to the three education categories, n b,i / n b corresponds to the fraction of black male workers in education category i, earnings b,i corresponds to the mean earnings for black workers in education category i, earnings w,i corresponds to the mean earnings for white workers in education category i. Doing so we find that according to the above conditional mean calculations, black male workers earn about $11,064, or 11,064/54,648 = 20 percent less, than white male workers with similar education characteristics So conditioning on years of education explains about 33% of racial earnings gap ([0.30 - 0.20]/0.30 = 0.33)

6 Conditional Means This can be quite cumbersome to compute all these conditional means though, especially if we start adding in more categories for education e.g., only up to 10 th grade, only up to 11 th grade, only up to 12 th grade, 1 yr of college, 2 years of college, 3 years of college, etc. Moreover, what if we are also interested in the impact of another year of schooling on earnings, after controlling for race? That would require a whole new set of calculations.

7 Regression This is why a regression model is often a simpler way to describe conditional means. earnings i = α + β 1 *black i + β 2 *yrs of school i + e i Often, the left-hand-side variable is called dependant variables, right-hand-side variables are called “control” variables or “regressors” (or sometimes “independent” variables, but I don’t like that). α is known as intercept, β’s are (slope) coefficients, e i is the “residual” Estimating a regression amounts to finding the intercept and slope coefficients that minimize the sum of the squared e i terms across the sample (i.e. find best “fit”) So intercepts and coefficients essentially account for the variation in the dependant variable (earnings) that is common across all people with respect to the control variables, while the residual is the individual specific variation, or how each individual differs from the average. Graphically?

8 Regression α α+β1α+β1 Yrs of Schooling Earnings Slope = β 2

9 Regression When I estimate this model I get: earnings i = -70,003 – 10,381*black + 8,888*yrs of schooling i + e i Or

10 Regression If we simply take coefficients, this can be referred to as our estimated linear conditional expectation function. E[earnings] = -70,003 – 10,381*black + 8,888*yrs of schooling i Computing the equation for particular characteristics gives the “expected,” or mean, earnings for a person with those characteristics. So for a non-black with 12 years of schooling, expected earnings are: -70,003 – 10,381*0 + 8,888*12 = $36,653 (compares to $39,922 when we computed this directly using actual conditional means)

11 Regression Often we are simply interested in coefficients, how each right-hand side variable is associated with dependant variable. Interpreting the coefficients (i.e. “Betas”). Consider a generic linear function y = 100 + 0.5*x 1 – 12*x 2 How do we determine the change in y associated with a one unit change in x 2 holding everything else constant? Now suppose we want to know the expected change in earnings due to a one year increase in schooling, holding all other variables constant? Recall our estimated linear conditional expectation function earnings i = -70,003 – 10,381*black + 8,888*yrs of schooling i

12 Regression So, if one is interested in the expected change in a dependant variable associated with a one unit change in one of the control variables, simply take the derivative of the estimated conditional expectation function with respect to that control variable. So, given our estimate, earnings are “expected” to increase by $8,888 for each additional year of schooling.

13 Regression Consider again our estimated linear conditional expectation function for earnings: E[earnings] = -70,003 – 10,381*black + 8,888*yrs of schooling i Taking the derivative with respect to the “black” indicator variable we get 10,381. This means that, holding everything else equal (i.e. yrs of education), on average, black workers earn $10,381 less than white workers (i.e., expected earnings for black workers are $10,381 less than expected earnings for white workers with similar education) This compares similarly to the $11,064 conditional pay differential we computed before, but is still a little different. Why?

14 Regression Often, when we run regressions, we aren’t really interested in “point estimates” (i.e. specific coefficient estimates), but rather in using these estimates to test hypotheses. For example, what if what we are really interested in is whether black workers have a different return to an additional year of schooling than white workers. How could we test this?

15 Regression What if I added in an “interaction” term between schooling and race? E[earnings] = α + β 1 *black i + β 2 *yrs of school i + β 3 *black i *yrs of school i Doing this estimation I get: E[earnings] = -47,011 + 1381*black i + 7,321*yrs of school i - 982*black*yrs of school i How do we interpret these coefficients? What is the avg impact of another year of schooling on a white worker’s earnings? What is the avg impact of another year of schooling on a black worker’s earnings? So how do we test whether return to an additional year of schooling is different for blacks than whites?

16 Regression Precision/Significance of estimates: Consider again the previous estimates What we are testing is whether coefficient of interest is “significantly” different than zero (i.e., how likely is it that we would have gotten this large of an estimate by chance even if it was really equal to zero) To hypothesis test, we must compare size of coefficient to its standard error. A good rule of thumb is that absolute magnitude of coefficient is close to or above twice the standard error. So what will generally impact whether an estimate is significant?

17 Specification form Often when doing regressions researchers will use the natural log of earnings rather than simply earnings as the dependant variable: ln(earnings i )= α + β 1 *black i + β 2 *schooling i + e i This is done for two reasons: 1. This specification often “fits” the data better, as log transformation makes a variable with a highly skewed distribution closer to a normal distribution, which generally helps the regression fit. 2. The coefficients can be roughly interpreted as percentage changes in dependant variable associated with a unit change in the corresponding control variable (i.e., elasticity), rather than how the level of the dependant variable changes given a unit change in the corresponding control variable.

18 Specification form

19 Omitted variables If we are really interested in the wage gap between black workers and white workers after conditioning on years of education, what are we missing from the basic specification that might obscure the answer we are really looking for? E[ ln(earnings)] = α + β 1 *black i + β 2 *schooling i

20 Omitted variables E[ln(earnings)] = α + β 1 *black i + β 2 *Hispanic i + β 3 *schooling i + e i What will this likely do to coefficient on black indicator?

21 Omitted variables E[ln(earnings i )] = α + β 1 *black i + β 2 *Hispanic i + β 3 *schooling i What will this likely do to coefficient on black indicator?

22 Omitted variables What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?

23 Omitted variables What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?

24 Omitted variables In the end, it is not necessary to control for every possible thing that can affect dependant (y, or left-hand side) variable. What to control for depends on your question of interest. Robustness – A finding is said to be relatively robust if basic qualitative finding is unchanged by inclusion of further variables, adding more interaction terms (i.e., the combination of two existing variables such as the term black*years of school), or changes in specification form (i.e. log transformation of dependant variable)

25 Selection Be weary of making causal inferences of significant correlations In particular, there are often issues of sample selection/endogeneity/omitted variables Many characteristics are often the products of choice (often called endogenous characteristics). In such cases it is hard to identify how the outcome of interest depends on that endogenous characteristic, versus other unobserved/omitted characteristics that determined that choice. Consider the Brooklyn Bridge “effect” on wages.

26 Selection More specifically, what if we wanted to estimate the effect of being in a gang on individual criminality or the effect of being marriage on criminality? Suppose we estimated E[y] = α + β 1 *Gang + β 2 *x 2 + β 3 *x 3 E[y] = α + β 1 *Marriage + β 2 *x 2 + β 3 *x 3 ? Will this tell us what we want to know? What if we further controlled for income, neighborhood, education, and lots and lots of other stuff in the other x’s?

27 Selection

28 In general, we are often interested in estimating the expected effect of increasing some variable x 1 on some outcome variable y. But it is often the case that x 1 isn’t randomly determined for each person, rather it is chosen (gang status, marriage). Moreover, people who choose different amount of x might be expected to have different values of y even if they didn’t choose a different x. Essentially, there is some unobservable variable z that may impact both an individual’s expected value of y and his expected value of x 1. When we estimate E[y] = α + β 1 *x 1 + β 2 *x 2 + β 3 *x 3, β 1 is reflecting both the impact of x 1 on y and the impact of the unobserved z on y (since z affects x 1 ). A basic regression can’t separately identify these two mechanisms.

29 Selection One way to handle selection is to use what is referred to as an Instrument or Instrumental Variables. The idea is to find something that is essentially random, or at least not a choice made by the individuals in the sample, that impacts the individual’s value of x 1. Consider Job Corps We want to know impact of participating in Job Corp (x 1 ) on Earnings (y). Problem, it isn’t random who participates in the program (i.e., who gets x 1 = 1). Consider the following simple model:

30 Selection Two types of people A’s and B’s who are eligible for Job Corp TypeJob CorpNo Job Corp A’s$40,000$34,000 B’s$30,000 $31,000 Suppose if given chance, only A’s would enroll, not B’s (which everyone knows). So true impact of program on those who would participate is $6000 Researcher wants to uncover this effect, but doesn’t know above info, and can’t observe each person’s type. If Job Corp offered access to all eligibles, what would be estimated impact if one just compared Job Corp to no Job Corp?

31 Selection Two types of people A’s and B’s who are eligible for Job Corp TypeJob CorpNo Job Corp A’s$40,000$34,000 B’s$31,000 $30,000 Now suppose researcher randomized access of who could enroll (this access can be called an Instrumental Variable)? Amongst winners, half would enroll (A’s) half wouldn’t (B’s). Average earnings would be 0.5*40,000 + 0.5*30,000 = $35,000. Amongst losers, none would enroll. Average earnings would be 0.5*34,000 + 0.5*30,000 = $32,000. Comparing winners vs. losers, estimated impact? Comparing winner participants vs. losers, estimated impact? Instrumental Variable (IV), estimated impact?

32 Selection In Job Corp study we read, randomization into “treatment” and “control” was explicit, but effectively an IV. Often researchers can’t randomize explicitly, so they get creative. Look for “natural experiments” that effectively do randomization. Ask, what could impact an individual’s realization of x 1, but should not be at all correlated with an individual’s expected outcome y?

33 Summary In summary, Coefficient on a given variable tells you how the expected change in the outcome of interest due to a one unit change in that variable, after controlling for all of the other included characteristics. Little credence should be given to imprecisely estimated coefficients (i.e. large enough standard errors so that they are not statistically different from zero), especially when hypothesis testing. Part of the key details of a paper is the “empirical strategy” it uses to deal with selection effects. Much of this class will be spent on discussing various empirical strategies authors use in the papers we read. In the end, use your empirical intuition---can this data really answer the question of interest?


Download ppt "Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males."

Similar presentations


Ads by Google