 # Regression Diagnostics Checking Assumptions and Data.

## Presentation on theme: "Regression Diagnostics Checking Assumptions and Data."— Presentation transcript:

Regression Diagnostics Checking Assumptions and Data

Questions What is the linearity assumption? How can you tell if it seems met? What is homoscedasticity (heteroscedasticity)? How can you tell if it’s a problem? What is an outlier? What is leverage? What is a residual? How can you use residuals in assuring that the regression model is a good representation of the data? What is a studentized residual?

Linear Model Assumptions Linear relations between X and Y Independent Errors Normal distribution for errors & Y Equal Variance of Errors: Homoscedasticity ( spread of error in Y across levels of X)

Good-Looking Graph No apparent departures from line.

Problem with Linearity

Problem with Heteroscedasticity Common problem when Y = \$

Outliers Outlier = pathological point

Residual Plots Histogram of Residuals Residuals vs Fitted Values Residuals vs Predictor Variable Normal Q-Q Plots Studentized Residuals or standardized Residuals

Residuals Standardized Residuals Look for large values (some say |>2) Studentized residual: The studentized residual considers the distance of the point from the mean. The farther X is from the mean, the smaller the standard error and the larger the residual. Look for large values. Residual i Standard deviation

Residual Plots

Abnormal Patterns in Residual Plots Figures a), b) Non-linearity Figure c) Augtocorrelations Figure d) Heteroscedasticity

Patterns of Outliers a) Outlier is extreme in both X and Y but not in pattern. Removal is unlikely to alter regression line. b) Outlier is extreme in both X and Y as well as in the overall pattern. Inclusion will strongly influence regression line c) Outlier is extreme for X nearly average for Y. d) Outlier extreme in Y not in X. e) Outlier extreme in pattern, but not in X or Y.

Influence Analysis Leverage: h_ii (in page8) Leverage is an index of the importance of an observation to a regression analysis. –Function of X only –Large deviations from mean are influential –Maximum is 1; min is 1/n – It is considered large if more than 3 x p /n (p=number of predictors including the constant).

Cook’s distance measures the influence of a data point on the regression equation. i.e. measures the effect of deleting a given observation: data points with large residuals (outliers) and/or high leverage Cook’s D > 1 requires careful checking (such points are influential); > 4 suggests potentially serious outliers.

Sensitivity in Inference All tests and intervals are very sensitive to even minor departures from independence. All tests and intervals are sensitive to moderate departures from equal variance. The hypothesis tests and confidence intervals for β 0 and β 1 are fairly "robust" (that is, forgiving) against departures from normality. Prediction intervals are quite sensitive to departures from normality.

Remedies If important predictor variables are omitted, see whether adding the omitted predictors improves the model. If there are unequal error variances, try transforming the response and/or predictor variables or use "weighted least squares regression." If an outlier exists, try using a "robust estimation procedure." If error terms are not independent, try fitting a "time series model."

If the mean of the response is not a linear function of the predictors, try a different function. For example, polynomial regression involves transforming one or more predictor variables while remaining within the multiple linear regression framework. For another example, applying a logarithmic transformation to the response variable also allows for a nonlinear relationship between the response and the predictors.

Data Transformation The usual approach for dealing with nonconstant variance, when it occurs, is to apply a variance-stabilizing transformation. For some distributions, the variance is a function of E(Y). Box-Cox transformation λ λ λ λ