Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section 6.4-6.6) –Outliers and Influential Points (Section 6.7) Homework.

Slides:



Advertisements
Similar presentations
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Advertisements

Inference for Regression
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Lecture 22: Thurs., April 1 Outliers and influential points for simple linear regression Multiple linear regression –Basic model –Interpreting the coefficients.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Lecture 25 Multiple Regression Diagnostics (Sections )
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Stat 112: Lecture 14 Notes Finish Chapter 6:
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24 Multiple Regression (Sections )
Lecture 24: Thurs., April 8th
Lecture 20 Simple linear regression (18.6, 18.9)
Regression Diagnostics - I
Class 7: Thurs., Sep. 30. Outliers and Influential Observations Outlier: Any really unusual observation. Outlier in the X direction (called high leverage.
Statistics 350 Lecture 10. Today Last Day: Start Chapter 3 Today: Section 3.8 Homework #3: Chapter 2 Problems (page 89-99): 13, 16,55, 56 Due: February.
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
Regression Diagnostics Checking Assumptions and Data.
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Business Statistics - QBM117 Statistical inference for regression.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Inference for regression - Simple linear regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
Summarizing Bivariate Data
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Chapter 14 Inference for Regression © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Stat 112: Notes 1 Main topics of course: –Simple Regression –Multiple Regression –Analysis of Variance –Chapters 3-9 of textbook Readings for Notes 1:
Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday.
Chapter 12: Correlation and Linear Regression 1.
Ch14: Linear Least Squares 14.1: INTRO: Fitting a pth-order polynomial will require finding (p+1) coefficients from the data. Thus, a straight line (p=1)
Stat 112 Notes 5 Today: –Chapter 3.7 (Cautions in interpreting regression results) –Normal Quantile Plots –Chapter 3.6 (Fitting a linear time trend to.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Lecture Slides Elementary Statistics Twelfth Edition
Inference for Least Squares Lines
Chapter 12: Regression Diagnostics
(Residuals and
Regression model Y represents a value of the response variable.
Diagnostics and Transformation for SLR
Week 5 Lecture 2 Chapter 8. Regression Wisdom.
Stats Club Marnie Brennan
Residuals The residuals are estimate of the error
No notecard for this quiz!!
The greatest blessing in life is
Three Measures of Influence
Chapter 13 Multiple Regression
Algebra Review The equation of a straight line y = mx + b
Diagnostics and Transformation for SLR
Presentation transcript:

Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework 4 is due this Thursday. Please let me know of any ideas you want to discuss for the final project.

Review of Checking and Remedying Assumptions 1.Linearity: Check residual by predicted plots and residual plots for each variable for pattern in the mean of the residuals. Remedies: Transformations and Polynomials. To see if remedy works, check new residual plots for pattern in the mean of the residuals.. 2.The standard deviation of Y for the subpopulation of units with is the same for all subpopulations. Check residual by predicted plot for pattern in the spread of the residuals. Remedies: Transformation of Y. To see if remedy works, check residual by predicted plot for the transformed Y regression. 3.Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations. Check histogram for bell shaped distribution of residuals and normal quantile plot of residuals for approximately straight line. Remedies: Transformation of Y. To see if remedy works, check histogram and normal quantile plot of residuals for transformed Y regression residuals

Checking whether a transformation of Y works for remedying Non- constant variance 1.Create a new column with the transformation of the Y variable by right clicking in the new column and clicking Formula and putting in the appropriate formula for the transformation (Note: Log is contained in the class of transcendental functions) 2.Fit the regression of the transformation of Y on the X variables 3.Check the residual by predicted plot to see if the spread of the residuals appears constant over the range of predicted values.

Outliers in Residuals Standardized Residual: Under normality assumption, 95% of standardized residuals should be between -2 and 2, and 99% should be between -3 and 3. An observation with a standardized residual above 3 or below -3 is considered to be an outlier in its residual, i.e., its Y value is unusual given its explanatory variables. It is worth looking further at the observation to see if any reasons for the large magnitude residual can be identified.

philacrimerate.JMP outliers in residuals

Influential Points and Leverage Points Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the X direction are often influential. Leverage point: Point that is an outlier in the X direction that has the potential to be influential. It will be influential if its residual is of moderately large magnitude.

Center City Philadelphia is influential; Gladwyne is not. In general, points that have high leverage are more likely to be influential. Which Observations Are Influential?

Excluding Observations from Analysis in JMP To exclude an observation from the regression analysis in JMP, go to the row of the observation, click Rows and then click Exclude/Unexclude. A red circle with a diagonal line through it should appear next to the observation. To put the observation back into the analysis, go to the row of the observation, click Rows and then click Exclude/Unexclude. The red circle should no longer appear next to the observation.

Formal measures of leverage and influence Leverage: “Hat values” (JMP calls them hats) Influence: Cook’s Distance (JMP calls them Cook’s D Influence). To obtain them in JMP, click Analyze, Fit Model, put Y variable in Y and X variable in Model Effects box. Click Run Model box. After model is fit, click red triangle next to Response. Click Save Columns and then Click Hats for Leverages and Click Cook’s D Influences for Cook’s Distances. To sort observations in terms of Cook’s Distance or Leverage, click Tables, Sort and then put variable you want to sort by in By box.

Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high leverage (hat value > 3*2/99=0.06). No other observations have high influence or high leverage.

Rules of Thumb for High Leverage and High Influence High Leverage Any observation with a leverage (hat value) > (3 * # of coefficients in regression model)/n has high leverage, where # of coefficients in regression model = 2 for simple linear regression. n=number of observations. High Influence: Any observation with a Cook’s Distance greater than 1 indicates a high influence.

What to Do About Suspected Influential Observations? See flowchart attached to end of slides Does removing the observation change the substantive conclusions? If not, can say something like “Observation x has high influence relative to all other observations but we tried refitting the regression without Observation x and our main conclusions didn’t change.”

If removing the observation does change substantive conclusions, is there any reason to believe the observation belongs to a population other than the one under investigation? –If yes, omit the observation and proceed. –If no, does the observation have high leverage (outlier in explanatory variable). If yes, omit the observation and proceed. Report that conclusions only apply to a limited range of the explanatory variable. If no, not much can be said. More data (or clarification of the influential observation) are needed to resolve the questions.

General Principles for Dealing with Influential Observations General principle: Delete observations from the analysis sparingly – only when there is good cause (observation does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.

Influential Points, High Leverage Points, Outliers in Multiple Regression As in simple linear regression, we identify high leverage and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats). High influence points: Cook’s distance > 1 High leverage points: Hat greater than (3*(# of explanatory variables + 1))/n is a point with high leverage. These are points for which the explanatory variables are an outlier in a multidimensional sense. Use same guidelines for dealing with influential observations as in simple linear regression. Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero.

Multiple regression, modeling and outliers, leverage and influential points Pollution Example Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between The variables are y (MORT)=total age adjusted mortality in deaths per 100,000 population; PRECIP=mean annual precipitation (in inches); EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of No x (related to amount of tons of No x emitted per day per square kilometer); SO2=relative pollution potential of SO 2

Scatterplot Matrix Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points. Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.

Scatterplot Matrix

Crunched Variables When an X variable is “crunched – meaning that most of its values are crunched together and a few are far apart – there will be influential points. To reduce the effects of crunching, it is a good idea to transform the variable to log of the variable.

2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed. b) The curvature in MORT vs. SO2 indicates a Log transformation for SO2 may be suitable. After the two transformations we have the following correlations:

New Orleans has Cook’s Distance greater than 1 – New Orleans may be influential.

Labeling Observations To have points identified by a certain column, go the column, click Columns and click Label (click Unlabel to Unlabel). To label a row, go to the row, click rows and click label.

Leverage Plots A “simple regression view” of a multiple regression coefficient. For x j: Residual y (w/o x j ) vs. Residual x j (vs the rest of x ’ s) (both axes are recentered) Slope: coefficient for that variable in the multiple regression Distances from the points to the LS line are multiple regression residuals. Distance from point to horizontal line is the residual if the explanatory variable is not included in the model. Useful to identify (for x j ) outliers leverage influential points (Use them the same way as in a simple regression to identify the effect of points for the regression coefficient of a particular variable)

The enlarged observation New Orleans is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox and log SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential.