# Autocorrelation in Regression Analysis

## Presentation on theme: "Autocorrelation in Regression Analysis"— Presentation transcript:

Autocorrelation in Regression Analysis
Tests for Autocorrelation Examples Durbin-Watson Tests Modeling Autoregressive Relationships

What causes autocorrelation?
Misspecification Data Manipulation Before receipt After receipt Event Inertia Spatial ordering

Checking for Autocorrelation
Test: Durbin-Watson statistic: Positive Zone of No Autocorrelation Zone of Negative autocorrelation indecision indecision autocorrelation |_______________|__________________|_____________|_____________|__________________|___________________| d-lower d-upper d-upper d-lower Autocorrelation is clearly evident Ambiguous – cannot rule out autocorrelation Autocorrelation in not evident

Consider the following regression:
Source | SS df MS Number of obs = F( 2, 325) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = price | Coef. Std. Err t P>|t| [95% Conf. Interval] ice | quantity | e e e e-06 _cons | Because this is time series data, we should consider the possibility of autocorrelation. To run the Durbin-Watson, first we have to specify the data as time series with the tsset command. Next we use the dwstat command. Durbin-Watson d-statistic( 3, 328) =

Find the D-upper and D-lower
Check a Durbin Watson table for the numbers for d-upper and d-lower. For n=20 and k=2, α = .05 the values are: Lower = 1.643 Upper = 1.704 Durbin's alternative test for autocorrelation lags(p) | chi df Prob > chi2 1 | H0: no serial correlation

Alternatives to the d-statistic
The d-statistic is not valid in models with a lagged dependent variable In the case of a lagged LHS variable you must use the Durbin-a test (the command is durbina in Stata) Also, the d-statistic is only for first order autocorrelation. In other instances you may use the Durbin-a Why would you suspect other than 1st order autocorrelation?

The Runs Test An alternative to the D-W test is a formalized examination of the signs of the residuals. We would expect that the signs of the residuals will be random in the absence of autocorrelation. The first step is to estimate the model and predict the residuals.

Runs continued Next, order the signs of the residuals against time (or spatial ordering in the case of cross-sectional data) and see if there are excessive “runs” of positives or negatives. Alternatively, you can graph the residuals and look for the same trends.

Runs test continued The final step is to use the expected mean and deviation in a standard t-test Stata does this automatically with the runtest command!

Visual diagnosis of autocorrelation (in a single series)
A correlogram is a good tool to identify if a series is autocorrelated

Dealing with autocorrelation
D-W is not appropriate for auto-regressive (AR) models, where: In this case, we use the Durbin alternative test For AR models, need to explicitly estimate the correlation between Yi and Yi-1 as a model parameter Techniques: AR1 models (closest to regression; 1st order only) ARIMA (any order)

Dealing with Autocorrelation
There are several approaches to resolving problems of autocorrelation. Lagged dependent variables Differencing the Dependent variable GLS ARIMA

Lagged dependent variables
The most common solution Simply create a new variable that equals Y at t-1, and use as a RHS variable To do this in Stata, simply use the generate command with the new variable equal to L.variable gen lagy = L.y gen laglagy = L2.y This correction should be based on a theoretic belief for the specification May cause more problems than it solves Also costs a degree of freedom (lost observation) There are several advanced techniques for dealing with this as well

Differencing Differencing is simply the act of subtracting the previous observation value from the current observation. To do this in Stata, again use the generate command with a capital D (instead of the L for lags) This process is effective; however, it is an EXPENSIVE correction This technique “throws away” long-term trends Assumes the Rho = 1 exactly

GLS and ARIMA GLS approaches use maximum likelihood to estimate Rho and correct the model These are good corrections, and can be replicated in OLS ARIMA is an acronym for Autoregressive Integrated Moving Average This process is a univariate “filter” used to cleanse variables of a variety of pathologies before analysis

Corrections based on Rho
There are several ways to estimate rho, the most simple being calculating it from the residuals We then estimate the regression by transforming the regressors so that: and This gives the regression:

High tech solutions Stata also offers the option of estimating the model with the AR (with multiple ways of estimating rho). There is also what is known as a prais-winsten regression which generates values for the lost observation For the truly adventurous, there is also the option of doing a full ARIMA model

Prais-winsten regression
Prais-Winsten AR(1) regression -- iterated estimates Source | SS df MS Number of obs = F( 2, 325) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = price | Coef. Std. Err t P>|t| [95% Conf. Interval] ice | quantity | e e e e-07 _cons | rho | Durbin-Watson statistic (original) Durbin-Watson statistic (transformed)

ARIMA The ARIMA model allows us to test the hypothesis of autocorrelation and remove it from the data. This is an iterative process akin to the purging we did when creating the ystar variable.

The model Estimate of rho Significant lag ARIMA regression
Sample: 1 to Number of obs = Wald chi2(1) = Log likelihood = Prob > chi = | OPG price | Coef. Std. Err z P>|z| [95% Conf. Interval] price | _cons | ARMA | ar | L1. | /sigma | Estimate of rho Significant lag

The residuals of the ARIMA model
There are a few significant lags a ways back. Generally we should expect some, but this mess is probably an indicator of a seasonal trend (well beyond the scope of this lecture)!

ARIMA with a covariate ARIMA regression
Sample: 1 to Number of obs = Wald chi2(3) = Log likelihood = Prob > chi = | OPG price | Coef. Std. Err z P>|z| [95% Conf. Interval] price | ice | quantity | e e e e-07 _cons | ARMA | ar | L1. | /sigma |

Final thoughts Each correction has a “best” application.
If we wanted to evaluate a mean shift (dummy variable only model), calculating rho will not be a good choice. Then we would want to use the lagged dependent variable Also, where we want to test the effect of inertia, it is probably better to use the lag

Final Thoughts Continued
In Small N, calculating rho tends to be more accurate ARIMA is one of the best options, however, it is very complicated! When dealing with time, the number of time periods and the spacing of the observations is VERY IMPORTANT! When using estimates of rho, a good rule of thumb is to make sure you have time points at a minimum. More if the observations are too close for the process you are observing!

Next Time: Review for Exam Exam Posting Plenary Session
Available after class Wednesday

Similar presentations