Presentation on theme: "Multiple Regression Analysis: Specification And Data Issues"— Presentation transcript:
1 Multiple Regression Analysis: Specification And Data Issues Chapter 9
2 I. Introduction Failure of zero conditional mean assumption Correlation between error, u, and one or more explanatory variables.Why variables can be endogenousPossible remediesFunctional Form MisspecificationIf omitted variable is a function of an explanatory variable in the model, the model suffers from functional for misspecificationUsing proxy variables to address omitted variable biasMeasurement errorNot all variables are measured accurately.
3 II. Functional FormRegression model can suffer from misspecification when it doesn’t account for relationship between dependent and explanatory variables.wage = b0 + b1educ + b2exper + uOmit exper2 or exper*educOmitting variable can lead to biased estimates of all regressorsUse wage rather than log(wage) (latter satisfies GM)using wrong variable to relate LHS and RHS can lead to biased estimates of all regressors.
4 II. Functional Form We can change linear relationship by: using logs on RHS, LHS or bothusing quadratic forms of x’sUsing interactions of x’sHow do we know if we’ve gotten the right functional form for our model?Use F-test for joint exclusion restrictions to detect misspecification
5 II. Functional Form Ex: Model of Crime Quadratics or not? Each of sq terms is individually and jointly signficant (F=31.37, df=3; 2,713Adding squares makes interpretation more difficult:Before, intuitive (–) sign on pcnv suggested conviction rate has deterrence on crime.Now, level is positive, quadratic is negative: for low levels conviction has no deterrent effect, only effective for large levels.Note: Don’t square qemp86, because it’s a discrete variable taking only few values.
6 II. Functional Form How do you know what to try? Use economic theory to guide youThink about the interpretationDoes it make more sense for x to affect y in percentage (use logs) or absolute terms?Does it make more sense for the derivative of x1 to vary with x1 (quadratic) or with x2 (interactions) or to be fixed?
7 II. Ramsey’s RESETKnow how to test joint exclusion restrictions for higher order terms or interactions.Can be tedious to add and test extra termsMay find a square term matters when really using logs would be even betterA test of functional form is Ramsey’s regression specification error test (RESET)Intuition: If specification okay, no nonlinear functions of the independent variables should be significant when put in original equation.Cost: Degrees of freedom
8 II. Ramsey’s RESETRESET relies on a trick similar to the special form of the White testInstead of adding functions of the x’s directly, we add and test functions of ŷy = b0 + b1x1 + … + bkxk + d1ŷ2 + d1ŷ3 +errorDon’t look at above for parameter estimates, just to test inclusion of extra termsH0: d1 = 0, d2 = 0 using F~F2,n-k-3Significant F-stat suggests there’s some sort of functional for problem
9 II. Ramsey’s RESET Ex: Housing Price Equation (n=88) price = b0 + b1lotsize + b2sqrft + b3bdrms +uRESET statistic (up to yhat3)=4.67F2,82 and p-value .012Evidence of functional form misspecificationlprice = b0 + b1llotsize + b2lsqrft + b3bdrms +uRESET statistic (up to yhat3)=2.56F2,82 and p-value .84.No evidence of functional form misspecificationOn basis of RESET, log equation is preferred.But just because loq equation “passed” RESET, does that mean it’s the right specification?Should still use economic theory to determine if functional form makes sense.
10 III. Proxy VariablesPreviously, assumed could resolve functional form misspecification because you had the relevant data.What if model is misspecified because no data is available on an important x variable?Log(wage) = b0 + b1educ +b2exper + b3abil + uWould like to hold ability fixed, but have no measure of it.Exclusion causes parameter estimates to be biased.Potential solution: Obtain proxy variable for omitted variable
11 III. Proxy VariablesA proxy variable is something that is related to the unobserved variable that we’d like to control for in our analysis-but can’t.Ex: IQ as proxy for abilityx3* = d0 + d3x3 + v3, where * implies unobservedv3 signals that x3 and x3* are not directly relatedd0 allows different scales to be compared (i.e. IQ scale may not be how ability measured)just substitute x3 for x3* in y= b0 + b1 x1 +b2 x2 + b3 x3* + u
12 III. Proxy VariablesWhat do we need for this solution to give us unbiased estimates of b1 and b2?Need assumptions on u and v31.) u uncorrelated with x1, x2, x3* (standard)Also suggests u uncorrelated with x3…once x1, x2, x3* included, x3 is irrelevant (i.e. x3 doesn’t directly affect y other than through x3*)2.) v3 is uncorrelated with x1, x2, x3.For v3 to be uncorrelated with x1, x2 that means x3* must be good proxy for x3Formally, this means E(x3* | x1, x2, x3) = E(x3* | x3) = d0 + d3x3Once x3 controlled for, x3* does not depend on x1, x2
13 III. Proxy Variables So are really running: E(abil|educ,exper,IQ)=E(abil|IQ)=d0 + d3IQImplies ability only changes with IQ, and not with educ and epxer (once include IQ).So are really running:y = (b0 + b3d0) + b1x1+ b2x2 + b3d3x3 + (u + b3v3)redefined intercept, error term, x3 coefficientCan rewrite as: y = a0 + a1x1+ a2x2 + a3x3 + eUnbiased estimates ofa0 , b1 =a1 , b2 =a2 , a3Won’t get original b0 or b3.
14 III. Proxy Variables IQ as proxy for ability Want to estimate return to education6.5% when run regression w/o ability proxy5.4% when include IQInteract educ*IQ, allows for possibility that returns to education differ across different ability levels. See that interaction not significant though.
15 III. Proxy VariablesProxy variable can still lead to bias if assumptions are not satisfiedSay x3* = d0 + d1x1 + d2x2 + d3x3 + v3 (violation)Then running:y = (b0 + b3d0) + (b1 + b3d1) x1+ (b2 + b3d2) x2 + b3d3x3 + (u + b3v3)Bias will depend on signs of b3 and djCan safely assume d1 >0 and b3 >0, so that return to education is upward biased even when using proxy variable.This bias may be smaller than omitted variable bias, though (if x3* and x1 correlated less than x3 and x1)
16 III. Lagged Dependent Variables What if there are unobserved variables, and you can’t find reasonable proxy variables?Can include a lagged dependent variable to account for omitted variables that contribute to both past and current levels of ymust think past and current y are related for this to make senseallows you to account for historical factors that cause current differences in dependent variables
17 III. Lagged Dependent Variables Ex: Model of Crime: Effect of expenditure on crimecrime= b0 + b1 unem +b2 expend +uConcerned that cities which have lots of crime react by spending more on crime…biased estimatesCoeff on unem and expend are not intuitivecrime= b0 + b1 unem +b2 expend+ b3 crime-1 + uLagged value controls for fact that cities with high historical crime rates may spend more on crime preventionCoefficient estimates now more intuitive
18 IV. Properties of OLS under Measurement Error Sometimes we have the variable we want, but we think it is measured with errorhow many hours did you work last year, how many weeks you used child care when your child was youngWhen use imprecise measure of variable in our regression, then model contains measurement error.Consequences of M.E.Model is similar to that of omitted variable biasOften variable with measurement error is the one we’re interested in measuringThere are some conditions under which we still get unbiased resultsMeasurement error in y different from measurement error in x
19 IV. Measurement Error in a Dependent Variable Let y* denote variable we’d like to explain, like annual savings.Model: y* = b0 + b1x1 + …+ bkxk + uMost often, respondents are not perfect in their reporting, and so reported savings is denoted yDefine measurement error as observed-actual:e0 = y – y*Thus, really estimating:y = b0 + b1x1 + …+ bkxk + u + e0
20 IV. Measurement Error in a Dependent Variable When will OLS produce unbiased results?Have assumed u has zero mean and that xj and u are uncorrelatedNeed to assumee0 also has zero mean (otherwise just biases b0 )but more importantly e0 and xj are uncorrelated.That is, the measurement error in y is statistically independent of each explanatory variable.As result, estimates are unbiased.Generally find Var(u+ e0 )= su2 + se02 > su2When have m.e. in LHS variable, get larger variances for OLS estimators.
21 IV. Measurement Error in a Dependent Variable Savings Functionsav* = b0 + b1inc + b2size+ b3educ+ b4age + ue0= sav-sav*Is m.e. correlated with RHS variables?May think families with higher incomes or more education more likely to report savings accurately.Never know if that’s true, so assume there is no systematic relationship: i.e. wealthy or more educated just as likely to mis-report as non-wealthy, uneducatedScrap RatesLog(scrap*) = b0 + b1grant + uError assumed to be multiplicative:y=(y*)*a0 where e0=log(a0)log(scrap)=log(scrap*)+e0Log(scrap) = b0 + b1grant + u + e0It’s possible that measurement error more likely to at firms that receive grantunderreport scrap rate to make grant look more effective-so get more in future.Can’t verify whether true, so assume no relationship: i.e. measurement error not correlated with grant.
22 IV. Measurement Error in an Explanatory Variable More complicated when measurement error occurs in the explanatory variable(s)Model: y = b0 + b1 x1* + ux1* is not observed, instead only observe x1define m.e. as e1 =observed-actual = x1 – x1*AssumeE(e1) = 0 (not strong assumption)E(y| x1*, x1) = E(y| x1*)…means x1 doesn’t affect y after control for x1*…means u uncorrelated with x1 and x1*….similar to proxy variable assumption.Now are estimatingy = b0 + b1x1 + (u – b1e1)
23 IV. Measurement Error in an Explanatory Variable What kind of results will OLS give us?depends on our assumption about the correlation between e1 and x1Suppose Cov(x1, e1) = 0OLS remains unbiasedVariances larger ( since Var(u-b1 e1)= su2 + b12s e1 2 )Assumption that Cov(x1, e1) is analogous to the proxy variable assumption.
24 IV. Measurement Error in an Explanatory Variable What if that’s not the case?Suppose only that Cov(x1*, e1) = 0Called classical errors-in-variables assumptionMore realistic assumption than assuming Cov(x1, e1) =0This means:Cov(x1, e1) = E(x1e1)-E(x1 )E(e1 ) =E[(x1*+e1)(e1)]= E(x1*e1) + E(e12) = 0 + se2 ≠0.This means x1 is correlated with the error so estimate is biased and inconsistent
25 IV. Measurement Error in an Explanatory Variable Notice that the multiplicative portion Var(x1*)/Var(x1)< 1Means the estimate is biased toward zero – called attenuation biasTrue regardless of if b1 is (+) or (-)Larger Var(x1*)/Var(x1) suggests inconsistency with OLS is small, because variation in “noise” (a.k.a. m.e.) is small relative to variation in true value.It’s more complicated with a multiple regression, but can still expect attenuation bias when assume classical errors in variables.Economics 20 - Prof. Anderson
26 IV. Measurement Error in an Explanatory Variable y = b0 + b1x*1 + b2x2 + b3x3 +uAssume u uncorrelated with x*1,x1,x2,x3If assume e1 uncorrelated with x1,x2,x3 then gety = b0 + b1x1 + b2x2 + b3x3 +u -b1e1get consistent estimatesBut, if e1 uncorrelated with x2,x3 but not necessarily x1, getIf x*1 uncorrelated with x2,x3 get consistent estimates of b2, b3If this doesn’t hold, then other estimates will be inconsistent (size and direction are indeterminate)Economics 20 - Prof. Anderson
27 IV. Measurement Error in an Explanatory Variable Ex: GPA with measurement errorcolGPA = b0 + b1faminc* + b2hsGPA+ b3SAT + b4smoke + ufaminc* is actual annual family incomefaminc=faminc*+e1Assuming CEV holds, get OLS estimator of b1 that is attenuated (biased toward zero).colGPA = b0 + b1faminc + b2hsGPA+ b3SAT + b4smoke* + usmoke=smoke*+e1CEV unlikely to hold, because those who don’t smoke are really unlikely to mis-report. Those that do smoke can mis-report, such that error and actual number of times smoked (smoked*) are correlated.Deriving the implications of measurement error when CEV doesn’t hold is difficult and out of scope of text.Economics 20 - Prof. Anderson
28 V. Missing Data, Nonrandom Samples, Outlying Observations Introduction into data problems that can violated MLR.2 of G-M assumptionsCases when data problems have no effect on OLS estimatesOther cases when get biased estimatesMissing DataGenerally collect data from random sample of observations (people, schools, firms)Discover that information from these observations on key variables are missingEconomics 20 - Prof. Anderson
29 V. Missing Data – Is it a Problem? ConsequencesIf any observation is missing data on one of the variables in the model, it can’t be usedData missing at RandomIf data is missing at random, using a sample restricted to observations with no missing values will be fineSimply reduces sample size, thus reducing precision of estimatesEconomics 20 - Prof. Anderson
30 V. Missing Data – Is it a Problem? Data not missing at randomA problem can arise if the data is missing systematicallyHigh income individuals refuse to provide income dataLow education people generally don’t report educationPeople with high IQ more likely to report IQWhen missing data does not lead to biasSample chosen on basis of independent variablesEx: Savings, income, age, size for population of people 35 years and olderNo bias because E(savings|income, age, size) is same for any subset of population described by income, age, size in this data.Economics 20 - Prof. Anderson
31 V. Nonrandom Samples When missing data leads to bias If the sample is chosen on the basis of the y variable, then we have sample selection biasEx: estimating wealth based on education, experience, and age.Only those with wealth below 250k includedOLS gives biased estimates because E(wealth|educ, exper, age) not same as expected value conditional on wealth being less than 250k.Economics 20 - Prof. Anderson
32 V. Outliers /Influential Observations Sometimes an individual observation can be very different from the others“Influential” for estimates if dropping that observation(s) from the analysis changes the key OLS estimates by a lotParticularly important with small data setsOLS susceptible to outliers because by definition, minimizes sum of squared residual, and this outlier will have “large” residual.Causes of outlierserrors in data entry – one reason why looking at summary statistics is importantsometimes the observation will just truly be very different from the othersEconomics 20 - Prof. Anderson
33 V. Outliers /Influential Observations Example: R& D Intensity & Firm SizeSales more than triples, and now statistically significant.Economics 20 - Prof. Anderson
34 V. OutliersNot unreasonable to fix observations where it’s clear there was just an extra zero entered or left off, etc.Not unreasonable to drop observations that appear to be extreme outliers, although readers may prefer to see estimates with and without the outliersCan use Stata to investigate outliers graphicallEconomics 20 - Prof. Anderson