Advanced Quantitative Techniques

Advanced Quantitative Techniques
Lab 9: regression diagnostics II Nov 10th 2016

today Recap of assumptions of OLS Diagnostics : Homoskedasticity
Errors are normally distributed around the mean. Homoskedasticity (errors should not get larger as X gets larger) Errors are independent from eachother (no autocorrelation) Diagnostics : Homoskedasticity Multicollinearity linearity

OLS requirement (assumption)
Diagnostics test 1. Errors are normally distributed around the mean. Plot and identify Studentize residuals Leverage Cook’s D DFITS 2. Homoskedasticity (errors should not get larger as X gets larger) rvfplot 3. Errors are independent from eachother (no autocorrelation) Durbin–Watson test: not now! 4. Variables are interval #s (most important for dependent variable) Use a logit or probit regression for 0/1 dependent variables. (or other such as poisson for ranked not here) 5. Linearity Use other form of regression if dependent variable has non-linear distribution. Some dependent variables (like income) make more sense as a ln(Y).

Prep dataset Download from IADB website
Open in excel. Format in pivot tables. gen walktrans = estimatedpercentageofcommuterswh + estimatedpercentageofjourneytowo BUT.. only creates sum for observations that have BOTH data points egen greencommute = rowtotal(estimatedpercentageofcommuterswh estimatedpercentageofjourneytowo) Use this instead! Assumes zero for missing variables. Not perfect, but better for what we need. replace greencommute = greencommute[.] if greencommute==0 Make sure that we don’t count cities with no data as zero % of ‘green’ commuters!

REGRESS --Does the level of ‘green’ commuters relate to density, service distribution (schools), safety, centrality (proxmity to transit stop), average travel time? reg greencommute maximumallowabledensityinnewhous whatistheaveragetraveltimeinminu theshareoftheareaofthecityinneig theestimatedpercentageofthecityw theaveragetimeofthejourneytowork

list pr greencommute in 1/10
predict pr list pr greencommute in 1/10 predict res, residual list res in 1/10 _n

Residual rvpplot greencommute

Studentized Residuals
Studentized residuals are a type of standardized residual that can be used to identify outliers. predict r, rstudent sort r list id r in 1/10 list id r in -10/ list r id bwt age lwt smoke ht ui ftv black if abs(r) > 2 display 189*0.05 5% * N (>2) 1%* N (>3)

Leverage Leverage is a measure of how far an observation deviates from the mean of that variable. predict lev, leverage Generally, a point with leverage greater than (2k+2)/n should be carefully examined. k =number of predictors (in our example 7) n = number of observations. (in our example 189) display (2*7+2)/189 list bwt age lwt smoke ht ui ftv black id lev if lev >

Cook’s D Overall measurement of both information on the residual and leverage. The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the more influential the point. The convention for a cut-off point for undue influence from a single observation as measured through Cook’s D is 4/n. predict d, cooksd list id bwt age lwt smoke ht ui ftv black d if d>4/189

DFITS similar to Cook’s D except that they scale differently, but they give us similar answers. can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence. The cut-off point for DFITS is 2*sqrt(k/n) predict dfit, dfits list id bwt age lwt smoke ht ui ftv black dfit if abs(dfit)>2*sqrt(7/189)

We find that Id=226 is an observation that both has a large residual and large leverage. Such points are potentially the most influential. regress bwt age lwt smoke ht ui ftv black if id!=226

Diagnostics 3: Checking Homoscedasticity of Residuals
rvfplot, yline(0) A commonly used graphical method is to plot the residuals versus predicted values. If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the variance of the residuals is non-constant then the residual variance is said to be "heteroscedastic. We do this by the rvfplot command. the yline(0) option is to put a reference line at y=0. We see that the pattern of the data points is getting a little narrower towards the right end, which is an indication of heteroscedasticity. In our case, there is a little narrowing in the error bandwidth, but it is minor.

Diagnostics 4: Checking for Multicollinearity
< 10 vif multicollinearity will arise if we have put in too many variables that measure the same thing. Estat vif

Diagnostics 5: Checking Linearity
Bivariate regression twoway (scatter bwt lwt) (lfit bwt lwt) (lowess bwt lwt) We will try to illustrate some of the techniques that you can use.

Diagnostics 5: Checking Linearity
Multiple regression: the most straightforward thing to do is to plot the residuals against each of the predictor variables in the regression model. If there is a clear nonlinear pattern, there is a problem of nonlinearity. Otherwise, we should see for each of the plots just a random scatter of points. scatter res age scatter res lwt

Diagnostics 6: autocorrelation
Durbin-Watson test D-W stat is close to 2.0 if there is no autocorrelation, equal to 0 if there is perfect positive autocorrelation equal to 4.0 if there is perfect negative autocorrelation.

Evaluate importance of independent variables
Standardized betas Convert change into SD units. Regress y x1 x2 x3, beta Evaluate importance of independent variables F stat of all – f stat of all + or – variable of interest. Partitioning variance nestreg: regress y x1 x2 x3

Advanced Quantitative Techniques

Similar presentations

Presentation on theme: "Advanced Quantitative Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Quantitative Techniques

Similar presentations

Presentation on theme: "Advanced Quantitative Techniques"— Presentation transcript:

Similar presentations

About project

Feedback