Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males.

Slides:



Advertisements
Similar presentations
Managerial Economics in a Global Economy
Advertisements

Objectives 10.1 Simple linear regression
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
3.2 OLS Fitted Values and Residuals -after obtaining OLS estimates, we can then obtain fitted or predicted values for y: -given our actual and predicted.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Econ 140 Lecture 151 Multiple Regression Applications Lecture 15.
The Use and Interpretation of the Constant Term
Models with Discrete Dependent Variables
Choosing a Functional Form
Part 1 Cross Sectional Data
Copyright © 2010, 2007, 2004 Pearson Education, Inc. *Chapter 29 Multiple Regression.
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Chapter 12 Simple Regression
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
Multiple Regression Models
The Simple Regression Model
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
Intro to Statistics for the Behavioral Sciences PSYC 1900
Econ 140 Lecture 171 Multiple Regression Applications II &III Lecture 17.
1 Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Consider the variable.
Correlation and Regression Analysis
Simple Linear Regression Analysis
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
Objectives of Multiple Regression
3. Multiple Regression Analysis: Estimation -Although bivariate linear regressions are sometimes useful, they are often unrealistic -SLR.4, that all factors.
Regression and Correlation Methods Judy Zhong Ph.D.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Regression with 2 IVs Generalization of Regression from 1 to 2 Independent Variables.
Linear Regression and Correlation
Multiple Regression. In the previous section, we examined simple regression, which has just one independent variable on the right side of the equation.
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
Ch4 Describing Relationships Between Variables. Pressure.
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
Welcome to Econ 420 Applied Regression Analysis Study Guide Week Two Ending Sunday, September 9 (Note: You must go over these slides and complete every.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Multiple Regression 3 Sociology 5811 Lecture 24 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Ordinary Least Squares Estimation: A Primer Projectseminar Migration and the Labour Market, Meeting May 24, 2012 The linear regression model 1. A brief.
Public Policy Analysis ECON 3386 Anant Nyshadham.
Multiple Linear Regression ● For k>1 number of explanatory variables. e.g.: – Exam grades as function of time devoted to study, as well as SAT scores.
Lecture 7: What is Regression Analysis? BUEC 333 Summer 2009 Simon Woodcock.
1 Javier Aparicio División de Estudios Políticos, CIDE Primavera Regresión.
Discussion of time series and panel models
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for year old working males.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Multiple Independent Variables POLS 300 Butz. Multivariate Analysis Problem with bivariate analysis in nonexperimental designs: –Spuriousness and Causality.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Experimental Evaluations Methods of Economic Investigation Lecture 4.
Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Stats Methods at IC Lecture 3: Regression.
The simple linear regression model and parameter estimation
Chapter 4: Basic Estimation Techniques
Regression Analysis.
Chapter 4 Basic Estimation Techniques
Multiple Regression Analysis with Qualitative Information
1) A residual: a) is the amount of variation explained by the LSRL of y on x b) is how much an observed y-value differs from a predicted y-value c) predicts.
Multiple Regression Analysis with Qualitative Information
Chapter 7: The Normality Assumption and Inference with OLS
Financial Econometrics Fin. 505
Presentation transcript:

Overview of Regression Analysis

Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for year old working males is $54,648 (March 2010) We are also often interested in how this mean differs by other individual characteristics. E.g. How do the mean earnings differ between black and non-black workers? Mean earnings for working non-black males ages = $56,614 Mean earnings for working black males ages = $39,380 These are known as Conditional Means (the mean conditioned on some other characteristics, in this case race) So without controlling for anything else, yr old black working males earn on average $17,234 less annually, or 30% less, than similar aged white working males.

Conditional Means When testing a theory though, we often want to know how much of a given mean difference can be attributed to a particular observable variable, after controlling for other observable differences. For example, we also know that earnings are highly tied to schooling, and there is a significant racial gap in schooling, so we might want to know how large is racial earnings gap net of racial differences in years of schooling (i.e., controlling for schooling).

Conditional Means One way to do this is to calculate even more complicated conditional means. E.g., Non-Black males between w/out hs degree = $25,278 Black males between w/out hs degree = $22,275 Non-Black males between w/ hs degree = $39,922 Black males between w/ hs degree = $32,670 Non-Black males between w/ college degree = $80,295 Black males between w/ college degree = $61,136

Conditional Means Then, we can find how much less blacks earn than whites, after controlling for education, via the following weighted mean formula: where i corresponds to the three education categories, n b,i / n b corresponds to the fraction of black male workers in education category i, earnings b,i corresponds to the mean earnings for black workers in education category i, earnings w,i corresponds to the mean earnings for white workers in education category i. Doing so we find that according to the above conditional mean calculations, black male workers earn about $11,064, or 11,064/54,648 = 20 percent less, than white male workers with similar education characteristics So conditioning on years of education explains about 33% of racial earnings gap ([ ]/0.30 = 0.33)

Conditional Means This can be quite cumbersome to compute all these conditional means though, especially if we start adding in more categories for education e.g., only up to 10 th grade, only up to 11 th grade, only up to 12 th grade, 1 yr of college, 2 years of college, 3 years of college, etc. Moreover, what if we are also interested in the impact of another year of schooling on earnings, after controlling for race? That would require a whole new set of calculations.

Regression This is why a regression model is often a simpler way to describe conditional means. earnings i = α + β 1 *black i + β 2 *yrs of school i + e i Often, the left-hand-side variable is called dependant variables, right-hand-side variables are called “control” variables or “regressors” (or sometimes “independent” variables, but I don’t like that). α is known as intercept, β’s are (slope) coefficients, e i is the “residual” Estimating a regression amounts to finding the intercept and slope coefficients that minimize the sum of the squared e i terms across the sample (i.e. find best “fit”) So intercepts and coefficients essentially account for the variation in the dependant variable (earnings) that is common across all people with respect to the control variables, while the residual is the individual specific variation, or how each individual differs from the average. Graphically?

Regression α α+β1α+β1 Yrs of Schooling Earnings Slope = β 2

Regression When I estimate this model I get: earnings i = -70,003 – 10,381*black + 8,888*yrs of schooling i + e i Or

Regression If we simply take coefficients, this can be referred to as our estimated linear conditional expectation function. E[earnings] = -70,003 – 10,381*black + 8,888*yrs of schooling i Computing the equation for particular characteristics gives the “expected,” or mean, earnings for a person with those characteristics. So for a non-black with 12 years of schooling, expected earnings are: -70,003 – 10,381*0 + 8,888*12 = $36,653 (compares to $39,922 when we computed this directly using actual conditional means)

Regression Often we are simply interested in coefficients, how each right-hand side variable is associated with dependant variable. Interpreting the coefficients (i.e. “Betas”). Consider a generic linear function y = *x 1 – 12*x 2 How do we determine the change in y associated with a one unit change in x 2 holding everything else constant? Now suppose we want to know the expected change in earnings due to a one year increase in schooling, holding all other variables constant? Recall our estimated linear conditional expectation function earnings i = -70,003 – 10,381*black + 8,888*yrs of schooling i

Regression So, if one is interested in the expected change in a dependant variable associated with a one unit change in one of the control variables, simply take the derivative of the estimated conditional expectation function with respect to that control variable. So, given our estimate, earnings are “expected” to increase by $8,888 for each additional year of schooling.

Regression Consider again our estimated linear conditional expectation function for earnings: E[earnings] = -70,003 – 10,381*black + 8,888*yrs of schooling i Taking the derivative with respect to the “black” indicator variable we get 10,381. This means that, holding everything else equal (i.e. yrs of education), on average, black workers earn $10,381 less than white workers (i.e., expected earnings for black workers are $10,381 less than expected earnings for white workers with similar education) This compares similarly to the $11,064 conditional pay differential we computed before, but is still a little different. Why?

Regression Often, when we run regressions, we aren’t really interested in “point estimates” (i.e. specific coefficient estimates), but rather in using these estimates to test hypotheses. For example, what if what we are really interested in is whether black workers have a different return to an additional year of schooling than white workers. How could we test this?

Regression What if I added in an “interaction” term between schooling and race? E[earnings] = α + β 1 *black i + β 2 *yrs of school i + β 3 *black i *yrs of school i Doing this estimation I get: E[earnings] = -47, *black i + 7,321*yrs of school i - 982*black*yrs of school i How do we interpret these coefficients? What is the avg impact of another year of schooling on a white worker’s earnings? What is the avg impact of another year of schooling on a black worker’s earnings? So how do we test whether return to an additional year of schooling is different for blacks than whites?

Regression Precision/Significance of estimates: Consider again the previous estimates What we are testing is whether coefficient of interest is “significantly” different than zero (i.e., how likely is it that we would have gotten this large of an estimate by chance even if it was really equal to zero) To hypothesis test, we must compare size of coefficient to its standard error. A good rule of thumb is that absolute magnitude of coefficient is close to or above twice the standard error. So what will generally impact whether an estimate is significant?

Specification form Often when doing regressions researchers will use the natural log of earnings rather than simply earnings as the dependant variable: ln(earnings i )= α + β 1 *black i + β 2 *schooling i + e i This is done for two reasons: 1. This specification often “fits” the data better, as log transformation makes a variable with a highly skewed distribution closer to a normal distribution, which generally helps the regression fit. 2. The coefficients can be roughly interpreted as percentage changes in dependant variable associated with a unit change in the corresponding control variable (i.e., elasticity), rather than how the level of the dependant variable changes given a unit change in the corresponding control variable.

Specification form

Omitted variables If we are really interested in the wage gap between black workers and white workers after conditioning on years of education, what are we missing from the basic specification that might obscure the answer we are really looking for? E[ ln(earnings)] = α + β 1 *black i + β 2 *schooling i

Omitted variables E[ln(earnings)] = α + β 1 *black i + β 2 *Hispanic i + β 3 *schooling i + e i What will this likely do to coefficient on black indicator?

Omitted variables E[ln(earnings i )] = α + β 1 *black i + β 2 *Hispanic i + β 3 *schooling i What will this likely do to coefficient on black indicator?

Omitted variables What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?

Omitted variables What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?

Omitted variables In the end, it is not necessary to control for every possible thing that can affect dependant (y, or left-hand side) variable. What to control for depends on your question of interest. Robustness – A finding is said to be relatively robust if basic qualitative finding is unchanged by inclusion of further variables, adding more interaction terms (i.e., the combination of two existing variables such as the term black*years of school), or changes in specification form (i.e. log transformation of dependant variable)

Selection Be weary of making causal inferences of significant correlations In particular, there are often issues of sample selection/endogeneity/omitted variables Many characteristics are often the products of choice (often called endogenous characteristics). In such cases it is hard to identify how the outcome of interest depends on that endogenous characteristic, versus other unobserved/omitted characteristics that determined that choice. Consider the Brooklyn Bridge “effect” on wages.

Selection More specifically, what if we wanted to estimate the effect of being in a gang on individual criminality or the effect of being marriage on criminality? Suppose we estimated E[y] = α + β 1 *Gang + β 2 *x 2 + β 3 *x 3 E[y] = α + β 1 *Marriage + β 2 *x 2 + β 3 *x 3 ? Will this tell us what we want to know? What if we further controlled for income, neighborhood, education, and lots and lots of other stuff in the other x’s?

Selection

In general, we are often interested in estimating the expected effect of increasing some variable x 1 on some outcome variable y. But it is often the case that x 1 isn’t randomly determined for each person, rather it is chosen (gang status, marriage). Moreover, people who choose different amount of x might be expected to have different values of y even if they didn’t choose a different x. Essentially, there is some unobservable variable z that may impact both an individual’s expected value of y and his expected value of x 1. When we estimate E[y] = α + β 1 *x 1 + β 2 *x 2 + β 3 *x 3, β 1 is reflecting both the impact of x 1 on y and the impact of the unobserved z on y (since z affects x 1 ). A basic regression can’t separately identify these two mechanisms.

Selection One way to handle selection is to use what is referred to as an Instrument or Instrumental Variables. The idea is to find something that is essentially random, or at least not a choice made by the individuals in the sample, that impacts the individual’s value of x 1. Consider Job Corps We want to know impact of participating in Job Corp (x 1 ) on Earnings (y). Problem, it isn’t random who participates in the program (i.e., who gets x 1 = 1). Consider the following simple model:

Selection Two types of people A’s and B’s who are eligible for Job Corp TypeJob CorpNo Job Corp A’s$40,000$34,000 B’s$30,000 $31,000 Suppose if given chance, only A’s would enroll, not B’s (which everyone knows). So true impact of program on those who would participate is $6000 Researcher wants to uncover this effect, but doesn’t know above info, and can’t observe each person’s type. If Job Corp offered access to all eligibles, what would be estimated impact if one just compared Job Corp to no Job Corp?

Selection Two types of people A’s and B’s who are eligible for Job Corp TypeJob CorpNo Job Corp A’s$40,000$34,000 B’s$31,000 $30,000 Now suppose researcher randomized access of who could enroll (this access can be called an Instrumental Variable)? Amongst winners, half would enroll (A’s) half wouldn’t (B’s). Average earnings would be 0.5*40, *30,000 = $35,000. Amongst losers, none would enroll. Average earnings would be 0.5*34, *30,000 = $32,000. Comparing winners vs. losers, estimated impact? Comparing winner participants vs. losers, estimated impact? Instrumental Variable (IV), estimated impact?

Selection In Job Corp study we read, randomization into “treatment” and “control” was explicit, but effectively an IV. Often researchers can’t randomize explicitly, so they get creative. Look for “natural experiments” that effectively do randomization. Ask, what could impact an individual’s realization of x 1, but should not be at all correlated with an individual’s expected outcome y?

Summary In summary, Coefficient on a given variable tells you how the expected change in the outcome of interest due to a one unit change in that variable, after controlling for all of the other included characteristics. Little credence should be given to imprecisely estimated coefficients (i.e. large enough standard errors so that they are not statistically different from zero), especially when hypothesis testing. Part of the key details of a paper is the “empirical strategy” it uses to deal with selection effects. Much of this class will be spent on discussing various empirical strategies authors use in the papers we read. In the end, use your empirical intuition---can this data really answer the question of interest?