Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)

Outliers and Influential Observations in Simple Regression Outlier: Any really unusual observation. Outlier in the X direction (called high leverage point): Has the potential to influence the regression line. Outlier in the direction of the scatterplot: An observation that deviates from the overall pattern of relationship between Y and X. Typically has a residual that is large in absolute value. Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the x direction are often influential.

Influential Points, High Leverage Points, Outliers in Multiple Regression As in simple linear regression, we identify high leverage and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats). High influence points: Cook’s distance > 1 High leverage points: Points for which the explanatory variables are an outlier in a multidimensional sense. Hat greater than (3*(# of explanatory variables + 1))/n is a point with high leverage. Use same guidelines for dealing with influential observations as in simple linear regression. Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero.

Multiple regression, modeling and outliers, leverage and influential points Pollution Example Data set pollution2.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961. The variables are y (MORT)=total age adjusted mortality in deaths per 100,000 population; PRECIP=mean annual precipitation (in inches); EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of No x (related to amount of tons of No x emitted per day per square kilometer); SO2=log of relative pollution potential of SO 2

Multiple Regression: Steps in Analysis 1.Preliminaries: Define the question of interest. Review the design of the study. Correct errors in the data. 2.Explore the data. Use graphical tools, e.g., scatterplot matrix; consider transformations of explanatory variables; fit a tentative model; check for outliers and influential points. 3.Formulate an inferential model. Word the questions of interest in terms of model parameters.

Multiple Regression: Steps in Analysis Continued 4.Check the Model. (a) Check the model assumptions of linearity, constant variance, normality. (b) If needed, return to step 2 and make changes to the model (such as transformations or adding terms for interaction and curvature 5.Infer the answers to the questions of interest using appropriate inferential tools (e.g., confidence intervals, hypothesis tests, prediction intervals). 6.Presentation: Communicate the results to the intended audience.

Scatterplot Matrix Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points. Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.

Crunched Variables When an X variable is “crunched” – meaning that most of its values are crunched together and a few are far apart – there will be influential points. To reduce the effects of crunching, it is a good idea to transform the variable to log of the variable. When a Y variable is crunched, it is often times also useful to transform it to log Y.

2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed. b) There seems to be approximately a linear relationship between MORT and the other variables

New Orleans has Cook’s Distance greater than 1 – New Orleans may be influential. 3 RMSEs= 108 No points are outliers in residuals

Labeling Observations To have points identified by a certain column, go the column, click Columns and click Label (click Unlabel to Unlabel). To label a row, go to the row, click rows and click label.

Dealing with New Orleans New Orleans is influential. New Orleans also has high leverage, hat=0.45>(3*6/60)=0.2. Thus, it is reasonable to exclude New Orleans from the analysis, report that we excluded New Orleans, and note that our model does not apply to cities with explanatory variables in the range of New Orleans’.

Leverage Plots A “simple regression view” of a multiple regression coefficient. For x j: Residual y (w/o x j ) vs. Residual x j (vs the rest of x ’ s) (both axes are recentered) Slope in leverage plot: coefficient for that variable in the multiple regression Distances from the points to the LS line are multiple regression residuals. Distance from point to horizontal line is the residual if the explanatory variable is not included in the model. Useful to identify (for x j ) outliers leverage influential points (Use them the same way as in a simple regression to identify the effect of points for the regression coefficient of a particular variable) Leverage plots are particularly useful for points which are influential for a particular coefficient in the regression.

The enlarged observation New Orleans is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential.

Analysis without New Orleans

Checking the Model

Linearity, constant variance and normality assumptions all appear reasonable.

Inference About Questions of Interest Strong evidence that mortality is positively associated with S02 for fixed levels of precipitation, education, nonwhite, NOX. No strong evidence that mortality is associated with NOX for fixed levels of precipitation, education, nonwhite, S02.

Multiple Regression and Causal Inference Goal: Figure out what the causal effect on mortality would be of decreasing air pollution (and keeping everything else in the world fixed) Lurking variable: A variable that is associated with both air pollution in a city and mortality in a city. In order to figure out whether air pollution causes mortality, we want to compare mean mortality among cities with different air pollution levels but the same values of the confounding variables. If we include all of the lurking variables in the multiple regression model, the coefficient on air pollution represents the change in the mean of mortality that is caused by a one unit increase in air pollution. If we omit some of the lurking variables, then there is omitted variables bias, i.e., the multiple regression coefficient on air pollution does not measure the causal effect of air pollution.

Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)

Similar presentations

Presentation on theme: "Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)

Similar presentations

Presentation on theme: "Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)"— Presentation transcript:

Similar presentations

About project

Feedback