Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.

Slides:



Advertisements
Similar presentations
Forecasting Using the Simple Linear Regression Model and Correlation
Advertisements

Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Stat 112: Lecture 17 Notes Chapter 6.8: Assessing the Assumption that the Disturbances are Independent Chapter 7.1: Using and Interpreting Indicator Variables.
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Class 15: Tuesday, Nov. 2 Multiple Regression (Chapter 11, Moore and McCabe).
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Stat 112: Lecture 10 Notes Fitting Curvilinear Relationships –Polynomial Regression (Ch ) –Transformations (Ch ) Schedule: –Homework.
Lecture 25 Multiple Regression Diagnostics (Sections )
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 23: Tues., April 6 Interpretation of regression coefficients (handout) Inference for multiple regression.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
The Simple Regression Model
Stat 112: Lecture 14 Notes Finish Chapter 6:
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 6 Notes Note: I will homework 2 tonight. It will be due next Thursday. The Multiple Linear Regression model (Chapter 4.1) Inferences from.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24 Multiple Regression (Sections )
Chapter Topics Types of Regression Models
Lecture 24: Thurs., April 8th
Class 10: Tuesday, Oct. 12 Hurricane data set, review of confidence intervals and hypothesis tests Confidence intervals for mean response Prediction intervals.
Class 7: Thurs., Sep. 30. Outliers and Influential Observations Outlier: Any really unusual observation. Outlier in the X direction (called high leverage.
Lecture 23 Multiple Regression (Sections )
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
Regression Diagnostics Checking Assumptions and Data.
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Class 11: Thurs., Oct. 14 Finish transformations Example Regression Analysis Next Tuesday: Review for Midterm (I will take questions and go over practice.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression Analysis
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Correlation & Regression
Inference for regression - Simple linear regression
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Chapter 13: Inference in Regression
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Stat 112 Notes 17 Time Series and Assessing the Assumption that the Disturbances Are Independent (Chapter 6.8) Using and Interpreting Indicator Variables.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday.
Stat 112 Notes 5 Today: –Chapter 3.7 (Cautions in interpreting regression results) –Normal Quantile Plots –Chapter 3.6 (Fitting a linear time trend to.
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Chapter 12 Inference for Linear Regression. Reminder of Linear Regression First thing you should do is examine your data… First thing you should do is.
Inference for Least Squares Lines
AP Statistics Chapter 14 Section 1.
Diagnostics and Transformation for SLR
CHAPTER 29: Multiple Regression*
Chapter 13 Additional Topics in Regression Analysis
Diagnostics and Transformation for SLR
Presentation transcript:

Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies for Its Violation (Section 6.8) Homework 5 due next Thursday. I will e- mail it tonight. Please let me know of any ideas you want to discuss for the final project.

Multiple regression, modeling and outliers, leverage and influential points Pollution Example Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between The variables are y (MORT)=total age adjusted mortality in deaths per 100,000 population; PRECIP=mean annual precipitation (in inches); EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of No x (related to amount of tons of No x emitted per day per square kilometer); SO2=relative pollution potential of SO 2

Multiple Regression: Steps in Analysis 1.Preliminaries: Define the question of interest. Review the design of the study. Correct errors in the data. 2.Explore the data. Use graphical tools, e.g., scatterplot matrix; consider transformations of explanatory variables; fit a tentative model; check for outliers and influential points. 3.Formulate an inferential model. Word the questions of interest in terms of model parameters.

Multiple Regression: Steps in Analysis Continued 4.Check the Model. (a) Check the model assumptions of linearity, constant variance, normality. (b) If needed, return to step 2 and make changes to the model (such as transformations or adding terms for interaction and curvature); (c) Drop variables from the model that are not of central interest and are not significant. 5.Infer the answers to the questions of interest using appropriate inferential tools (e.g., confidence intervals, hypothesis tests, prediction intervals). 6.Presentation: Communicate the results to the intended audience.

Scatterplot Matrix Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points. Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.

Scatterplot Matrix

Crunched Variables When an X variable is “crunched – meaning that most of its values are crunched together and a few are far apart – there will be influential points. To reduce the effects of crunching, it is a good idea to transform the variable to log of the variable.

2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed. b) The curvature in MORT vs. SO2 indicates a Log transformation for SO2 may be suitable. After the two transformations we have the following correlations:

Influential Points, High Leverage Points, Outliers in Multiple Regression As in simple linear regression, we identify high leverage and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats). High influence points: Cook’s distance > 1 High leverage points: Hat greater than (3*(# of explanatory variables + 1))/n is a point with high leverage. These are points for which the explanatory variables are an outlier in a multidimensional sense. Use same guidelines for dealing with influential observations as in simple linear regression. Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero.

New Orleans has Cook’s Distance greater than 1 – New Orleans may be influential. 3 RMSEs= 108 No points are outliers in residuals

Labeling Observations To have points identified by a certain column, go the column, click Columns and click Label (click Unlabel to Unlabel). To label a row, go to the row, click rows and click label.

Dealing with New Orleans New Orleans is influential. New Orleans also has high leverage, hat=0.45>(3*6/60)=0.2. Thus, it is reasonable to exclude New Orleans from the analysis, report that we excluded New Orleans, and note that our model does not apply to cities with explanatory variables in the range of New Orleans’.

Leverage Plots A “simple regression view” of a multiple regression coefficient. For x j: Residual y (w/o x j ) vs. Residual x j (vs the rest of x ’ s) (both axes are recentered) Slope in leverage plot: coefficient for that variable in the multiple regression Distances from the points to the LS line are multiple regression residuals. Distance from point to horizontal line is the residual if the explanatory variable is not included in the model. Useful to identify (for x j ) outliers leverage influential points (Use them the same way as in a simple regression to identify the effect of points for the regression coefficient of a particular variable) Leverage plots are particularly useful for points which are influential for a particular coefficient in the regression.

The enlarged observation New Orleans is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox and log SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential.

Analysis without New Orleans

Checking the Model

Linearity, constant variance and normality assumptions all appear reasonable.

Inference About Questions of Interest Strong evidence that mortality is positively associated with S02 for fixed levels of precipitation, education, nonwhite, NOX. No strong evidence that mortality is associated with NOX for fixed levels of precipitation, education, nonwhite, S02.

Multiple Regression and Causal Inference Goal: Figure out what the causal effect on mortality would be of decreasing air pollution (and keeping everything else in the world fixed) Lurking variable: A variable that is associated with both air pollution in a city and mortality in a city. In order to figure out whether air pollution causes mortality, we want to compare mean mortality among cities with different air pollution levels but the same values of the confounding variables. If we include all of the lurking variables in the multiple regression model, the coefficient on air pollution represents the change in the mean of mortality that is caused by a one unit increase in air pollution. If we omit some of the lurking variables, then there is omitted variables bias, i.e., the multiple regression coefficient on air pollution does not measure the causal effect of air pollution.

Time Series Data and Autocorrelation When Y is a variable collected for the same entity (person, state, country) over time, we call the data time series data. For time series data, we need to consider the independence assumption for the simple and multiple regression model. Independence Assumption: The residuals are independent of one another. This means that if the residual is positive this year, it needs to be equally likely for the residuals to be positive or negative next year, i.e., there is no autocorrelation. Positive autocorrelation: Positive residuals are more likely to be followed by positive residuals than by negative residuals. Negative autocorrelation: Positive residuals are more likely to be followed by negative residuals than by positive residuals.

Example: Melanoma Incidence Is the incidence of melanoma (skin cancer) increasing over time? Is melanoma related to solar radiation? We address thse questions by looking at melanoma incidence among males from the Connecticut Tumor Registry from 1936 to Data is in melanoma.JMP

Residuals suggest positive autocorrelation.

Test of Independence The Durbin-Watson test is a test of whether the residuals are independent. The null hypothesis is that the residuals are independent and the alternative hypothesis is that the residuals are not independent (either positively or negatively) autocorrelated. To compute Durbin-Watson test in JMP, after Fit Model, click the red triangle next to Response, click Row Diagnostics and click Durbin-Watson Test. Then click red triangle next to Durbin-Watson to get p-value. For melanoma data, Remedy for autocorrelation: Add lagged value of Y to model.