 REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y =  0 +  1 x +  | | + | | So far we focused on the regression part –

Presentation on theme: "REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y =  0 +  1 x +  | | + | | So far we focused on the regression part –"— Presentation transcript:

REGRESSION MODEL ASSUMPTIONS

The Regression Model We have hypothesized that: y =  0 +  1 x +  | | + | | So far we focused on the regression part – getting the best estimates for the  ’s Here we focus on the error term, 

THE RANDOM VARIABLE,  The error term, , is a random variable that describes how the observed values, y i, vary around the regression line. For any value of x,  has a distribution with a mean and a standard deviation At any x value x i, the observed value of the error term is called its residual, given by:

STEP 3: 4 ASSUMPTIONS ABOUT  The remainder of our discussion about linear regression assumes the following about  (1) DISTRIBUTION: –  is distributed normally (2) MEAN: –The errors average out to 0, i.e. E(  ), or  = 0 (3) STANDARD DEVIATION: same – , is the same at all values of x (4) INDEPENDENCE: independent –The errors are independent of each other

What Do These Assumptions Imply About y? y =  0 +  1 x + .  0 +  1 x is a constant for a given value of x  is normally distributed with mean 0 and standard deviation . Thus y is normally distributed with standard deviation  and mean E(y), E(y) = E(  0 +  1 x +  ) = E(  0 +  1 x) + E(  )  =  0 +  1 x + 0 =  0 +  1 x

BEST ESTIMATE FOR  The true value of  is unkown. It can estimated by s as follows:

Hand Calculation of SSE 11200101000109567.5773403214.02 28009200088540.5411967859.75 3100011000099054.05119813732.7 41300120000114824.3226787618.7 57009000083283.7845107560.26 68008200088540.5442778670.56 710009300099054.0536651570.49 86007500078027.039162892.622 99009100093797.307824872.169 101100105000104310.81474981.7385 SUM373972972.97 SSE 22 iiiii )( )y ˆ (y )y ˆ y ˆ y x i 

s Residual  Error SSE/(n-2) = s 2 SSE

Checking the Assumptions Many times it is just assumed that the assumptions hold. We now show how to check the assumptions.

Residuals RESIDUAL ANALYSISThe assumptions for  can be checked using RESIDUAL ANALYSIS. A residual, e i, is the observation of  at an observed value of x, x i. For example in the Dollar Only example: y 1 = 101,000 when x 1 = 1200

Standardized Residuals Is a residual of -8,567.67 large? –It depends on the size of a standard error, s. Standardized residual = e i /(standard error of e i for x i ). Standardized residuals are easier to use to test the assumptions. Two typical ways for calculating the standard error of e i for a particular x i value are: Both approaches yield substantially the same results.

Standardized Residuals in Excel Excel uses the following formula: This still gives approximately the same values as the other methods. We will use the ones generated by Excel to check the assumptions.

Checking to See if Errors (Residuals) Appear to Come From a Normal Distribution TWO WAYS TO CHECK Construct a plot of standardized residuals and see if they look normal –Could use Histogram from Data Analysis –A “quick check” – Standardized residuals are like z-values. Check to see if about 68% are between ± 1, 95% between ± 2, and virtually all between ± 3. Look at a normal probability plot. These are statistical plots to check for “normality”. A “perfect” normal distribution would be a straight line on such a plot.

Checking to see if  Is Constant Look at the residual plot to see if the points seem more spread out at some x’s than at others – in the Dollar Only example, it did not appear so on the Excel residual plot. homoscedasticityConstant  is called homoscedasticity! heteroscedasticityIf the points had looked like the next page, then we see for lower values of x there is less variation than at higher values and the constant variation assumption would have been violated. This is called heteroscedasticity!

x e Heteroscedasticity– Nonconstant Variance

Checking Independence This is mainly for time series data (i.e. the x-axis is time) used in forecasting But basically if the data looks like the next slide – errors are not independent –In this case whether you have a positive or negative error (residual) depends on the x- value. –This is called autocorrelation.

X=time Y Example of Autocorrelation (Errors are Dependent on x)

Residual Analysis in Excel CHECK: Residuals Standardized Residuals Residual Plots Normal Probability Plots

Standardized Residuals 70% are between ± 1 100% are between ±2 “Close” to expected normal normal values Residual values appear to average out to 0 everywhere. There is no discernable pattern for the errors.

Normal Probability Plot The following is the normal probability plot generated by Excel. Again Excel does it “slightly wrong”, but it should give us a good idea. Looks close to a straight line – normality assumption appears valid.

Review 4 assumptions about  1.  is normal. 2.  = E(  ) = 0. 3.  is the same for all values of x. 4.Errors are independent. Checking The Assumptions –Check residual plot to see if variation changes for different values of x. –Check normality assumption by a normal probability plot or by creating a histogram of standardized residuals. Does it appear normal and centered around 0? Are about 68% between ±1, 95% between ±2, almost all between ±3?

Download ppt "REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y =  0 +  1 x +  | | + | | So far we focused on the regression part –"

Similar presentations