Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Statistical Techniques I EXST7005 Multiple Regression.
 Population multiple regression model  Data for multiple regression  Multiple linear regression model  Confidence intervals and significance tests.
12-1 Multiple Linear Regression Models Introduction Many applications of regression analysis involve situations in which there are more than.
12 Multiple Linear Regression CHAPTER OUTLINE
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Multiple Regression Predicting a response with multiple explanatory variables.
Psychology 202b Advanced Psychological Statistics, II February 3, 2011.
Feb 21, 2006Lecture 6Slide #1 Adjusted R 2, Residuals, and Review Adjusted R 2 Residual Analysis Stata Regression Output revisited –The Overall Model –Analyzing.
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24: Thurs., April 8th
Multiple Linear Regression
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Lecture 4: Inference in SLR (continued) Diagnostic approaches in SLR BMTRY 701 Biostatistical Methods II.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Regression Model Building LPGA Golf Performance
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 23: Slide 1 Stats 330: Lecture 23.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Environmental Modeling Basic Testing Methods - Statistics III.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Ch14: Linear Least Squares 14.1: INTRO: Fitting a pth-order polynomial will require finding (p+1) coefficients from the data. Thus, a straight line (p=1)
Outliers and influential data points. No outliers?
Linear Models Alan Lee Sample presentation for STATS 760.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Unit 9: Dealing with Messy Data I: Case Analysis
Chapter 6 Diagnostics for Leverage and Influence
Chapter 9 Multiple Linear Regression
Multiple Linear Regression
Regression Diagnostics
Regression model with multiple predictors
Multiple Linear Regression.
Regression Model Building - Diagnostics
Lecture 12 Model Building
Three Measures of Influence
Regression Model Building - Diagnostics
Multivariate Linear Regression
Presentation transcript:

Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II

Recall the added variable plots  These can help check for adequacy of model  Is there curvature between Y and X after adjusting for the other X’s?  “Refined” residual plots  They show the marginal importance of an individual predictor  Help figure out a good form for the predictor

Example: SENIC  Recall the difficulty determining the form for INFIRSK in our regression model.  Last time, we settled on including one term, INFRISK^2  But, we could do an adjusted variable plot approach.  How?  We want to know, adjusting for all else in the model, what is the right form for INFRISK?

R code av1 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION) ) av2 <- lm(INFRISK ~ AGE + XRAY + CENSUS + factor(REGION) ) resy <- av1$residuals resx <- av2$residuals plot(resx, resy, pch=16) abline(lm(resy~resx), lwd=2)

Added Variable Plot

What does that show?  The relationship between logLOS and INFRISK if you added INFRISK to the regression  But, is that what we want to see?  How about looking at residuals versus INFRISK (before including INFRISK in the model)?

R code mlr8 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION)) smoother <- lowess(INFRISK, mlr8$residuals) plot(INFRISK, mlr8$residuals) lines(smoother)

R code > infrisk.star 4,INFRISK-4,0) > mlr9 <- lm(logLOS ~ INFRISK + infrisk.star + AGE + XRAY + >CENSUS + factor(REGION)) > summary(mlr9) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.798e e < 2e-16 *** INFRISK 1.836e e infrisk.star 6.795e e * AGE 5.554e e * XRAY 1.361e e * CENSUS 3.718e e e-06 *** factor(REGION) e e * factor(REGION) e e *** factor(REGION) e e e-07 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 104 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 8 and 104 DF, p-value: < 2.2e-16

Residual Plots SPLINE FOR INFRISK INFRISK 2

Which is better?  Cannot compare via ANOVA because they are not nested!  But, we can compare statistics qualitatively  R-squared: MLR7: 0.60 MLR9: 0.62  Partial R-squared: MLR7: 0.17 MLR9: 0.19

Identifying Outliers  Harder to do in the MLR setting than in the SLR setting.  Recall two concepts that make outliers important: Leverage is a function of the explanatory variable(s) alone and measures the potential for a data point to affect the model parameter estimates. Influence is a measure of how much a data point actually does affect the estimated model.  Leverage and influence both may be defined in terms of matrices

“Hat” matrix  We must do some matrix stuff to understand this  Section 6.2 is MLR in matrix terms  Notation for a MLR with p predictors and data on n patients.  The data:

 More notation:  THE MODEL:  What are the dimensions of each? Matrix Format for the MLR model

“Transpose” and “Inverse”  X-transpose: X’ or X T  X-inverse: X -1  Hat matrix = H  Why is H important? It transforms Y’s to Yhat’s:

Estimating, based on fitted model Variance-Covariance Matrix of residuals: Variance of ith residual: Covariance of ith and jth residual:

Other uses of H I = identity matrix Variance-Covariance Matrix of residuals: Variance of ith residual: Covariance of ith and jth residual:

Property of hij’s This means that each row of H sums to 1 And, that each column of H sums to 1

Other use of H  Identifies points of leverage

Using the Hat Matrix to identify outliers  Look at hii to see if a datapoint is an outlier  Large values of hii imply small values of var(ei)  As hii gets close to 1, var(ei) approaches 0.  Note that  As hii approaches 1, yhat approaches y  This gives hii the name “leverage”  HIGH HAT VALUE IMPLIES POTENTIAL FOR OUTLIER!

R code hat <- hatvalues(reg) plot(1:102, hat) highhat 0.10,1,0) plot(x,y) points(x[highhat==1], y[highhat==1], col=2, pch=16, cex=1.5)

Hat values versus index

Identifying points with high hii

Does a high hat mean it has a large residual?  No.  hii measures leverage, not influence  Recall what hii is made of it depends ONLY on the X’s it does not depend on the actual Y value  Look back at the plot: which of these is probably most “influential”  Standard cutoffs for “large” hii: 2p/n 0.5 very high, high

Let’s look at our MLR9  Any outliers?

Using the hat matrix in MLR  Studentized residuals  Acknowledge: each residual has a different variance magnitude of residual should be made relative to its variance (or sd)  Studentized residuals recognize differences in sampling errors

Defining Studentized Residuals  From slide 15,  We then define  Comparing ei and ri ei have different variance due to sampling variations ri have constant variance

Deleted Residuals  Influence is more intuitively quantified by how things change when an observation is in versus out of the estimation process  Would be more useful to have residuals in the situation when the observation is removed.  Example: if a Yi is far out then it may be very influential in the regression and the residual will be small but, if that case is removed before estimating and then the residual is calculated based on the fit, the residual would be large

Deleted Residuals, di  Process: delete ith case fit regression with all other cases obtain estimate of E(Yi) based on its X’s and fitted model

Deleted Residuals, di  Nice result: you don’t actually have to refit without the ith case! where ei is the ‘plain’ residual from the ith case and hii is the hat value. Both are from the regression INCLUDING the case  For small hii: ei and di will be similar  For large hii: ei and di will be different

Studentized Deleted Residuals  Recall the need to standardize, based on the knowledge of the variance  The difference between ti and ri?

Another nice result  You can calculate MSE (i) without refitting the model

Testing for outliers  outlier = Y observations whose studentized deleted residuals are large (in absolute value)  t i ~ t with n-p-1 degrees of freedom  Two examples: simulated data mlr9