Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stat 6601 Project: Regression Diagnostics (V&R 6.3)

Similar presentations


Presentation on theme: "Stat 6601 Project: Regression Diagnostics (V&R 6.3)"— Presentation transcript:

1 Stat 6601 Project: Regression Diagnostics (V&R 6.3)
Presenters: Anthony Britto, Kathy Fung, Kai Koo September 20, 2018 Regression Diagnostics

2 Basic Definition of Regression Diagnostics
An old robust method Developed to measure and iteratively detect possibly wrong data and reject them through analysis of globally fitted model September 20, 2018 Regression Diagnostics

3 Regression Diagnostics
Goal: Detection of possibly wrong data through analysis of globally fitted model. Typical approach: (1) Determine an initial fitted model (2) Compute the residuals (3) Reject / identify outliers (4) Rebuild model or tracking the source of errors September 20, 2018 Regression Diagnostics

4 Influence and Leverage (1)
Influence: An observation is influential if the estimates change substantially when this observation is omitted. Leverage: The "horizontal" distance of the x -value from the mean of x. The further from the mean, the more leverage an observation has. y-discrepancy: The vertical distance between yobs. and ypredicted Conceptual formula: Influence = Leverage × y-Discrepancy September 20, 2018 Regression Diagnostics

5 Influence and Leverage (2)
High influence point (5,60) Low influence point (30,105) (x - mean of x)2 = (x - mean of x)2 = 15 yobs - ypred = yobs - ypred = 45 September 20, 2018 Regression Diagnostics

6 Regression Diagnostics
Detecting Outliers Distinguish the difference between two types of outliers 1st type: outliers in the response variable represent model failure, such observations are called outliers. 2nd type: outliers with respect to the predictors are called leverage points. Both types can affect the regression model. However, they may almost uniquely determine regression coefficients. They may also cause the standard error of regression coefficients to be much smaller than they would be if the observation were excluded. September 20, 2018 Regression Diagnostics

7 Methods to detect outliers in R
Outliers in the predictors can often be detected by simply examining the distribution of the predictors. Dot Plots Stem-and-leaf plots Box Plots Histograms Outliers can unduly affect the results of a regression analysis and can be present in the predictors, x, and the response, Y. September 20, 2018 Regression Diagnostics

8 Regression Diagnostics
Linear Model Y = b0 + b1x1+ b2x bkxk + e Matrix form Y = Xb + e Y = X = b = e = September 20, 2018 Regression Diagnostics

9 R Functions for Regression Diagnostics
Package Function Description Base plot(model) Basic diagnostics plots ls.diag (lsfit(x,y)) Diagnostic tool car cr.plots(model) Partial residual plots av.plots(model) Partial regression plots hatvalues (model) Hat values outlier.test (model) Test for largest residual df.betas(model) DfBet as measure of influence cookd(model) Cook’s D measure of influence rstudent(model) Studentized residuals vif(model) VIF or GVIF for each term in the model September 20, 2018 Regression Diagnostics

10 R function for Robust Regression
Package Function Description MASS rlm (yx) M-Estimation lqs ltsreg (yx) Least-Trimmed squares lms(yx) Least-Median regression September 20, 2018 Regression Diagnostics

11 Example: Linear regression (one independent variable) 1
Matrix form R / S-plus script Y = Xb + e > xd <- c(rep(1,5),1,3,4,5,7) > yd <- c(6,14,10,14,26) > x <- matrix(xd,5,2, byrow=F) > y <- matrix(yd,5,1, byrow=T) > xtrp <- t(x) # Matrix transpose > xxtrp <- xtrp %*% x # Matrix multiplication > inxxtrp <- solve(xxtrp) #Matrix inverting > b.hat <- inxxtrp %*% xtrp %*% y > b.hat [,1] [1,] 2 [2,] 3 > H <- x %*% inxxtrp %*% xtrp # hat matrix > H [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] [5,] September 20, 2018 Regression Diagnostics

12 Regression Diagnostics
Example: Linear regression (one independent variable) 2 Extraction of leverages and predicted values Leverage of the ith observation (hii) (for one independent variable; n = # of obs.; p =1) > n <- 5 > lev <- numeric(n) > for (i in 1:n) { + lev[i] <- H[i,i] + } > lev [1] > h <- lm.influence(lm(y~x))$hat > h > ls.diag(lsfit(x[,2],y))$hat > y1.pred <- 0 + y1.pred <- y1.pred + H[1,i]* y[i] > y1.pred # y1.pred=(x1=1)*3(slope+2(intercept) [1] 5 hij = leverage of (xi, yi) if i =j September 20, 2018 Regression Diagnostics

13 Example: linear regression (measurement of residuals)
From y-discrepancy to influence Raw residual value (y-discrepancy) Standardized residual value (influence) Studentized residual value (influence) September 20, 2018 Regression Diagnostics

14 Influence, leverage and discrepancy
The influence of observations can be determined by their residual values and leverages. September 20, 2018 Regression Diagnostics

15 Calculation of residual values
# Do it by youself in R > y.pred <- numeric(n) > for (i in 1:n) { + for (j in 1:n) { + y.pred[i] <- y.pred[i] + H[i,j]* yd[j] + } + } > res <- yd-y.pred > Sy <- sqrt(sum(res^2)/(n-2)) > resstd <- res/(Sy*sqrt(1-lev)) > resstd [1] # Using ls.diag to get residuals > ls.diag(lsfit(x[,2],y))$std.res #standardized residuals > ls.diag(lsfit(x[,2],y))$stud.res #Studentized residuals [1] September 20, 2018 Regression Diagnostics

16 Example: Multiple regression
R / S-plus script R output Call: glm(formula = log10price ~ elevation + date + flood + distance, data = project.data) Deviance Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-13 *** elevation *** date e-07 *** flood *** distance ** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for gaussian family taken to be ) Null deviance: on 30 degrees of freedom Residual deviance: on 26 degrees of freedom AIC: Number of Fisher Scoring iterations: 2 project.data<-read.csv("projdata.csv") model1 <- glm(log10price~elevation+date+flood+distance, data=project.data) summary(model1) Variables Description price in X*1000 / acre county 0=San Mateo 1=Santa Clara size acres elevation Average elevation in feet above the see level sewer Distance to the sewer connection date date of sale backward from now (month) flood flooding by tidal action=1 otherwise=2 distance distance from current project First model: log(price) = b0 + b1 elevation + b2 date + b3 flood +b4 distance + eproject.data<-read.csv("projdata.csv") price=project.data$price elevation=project.data$elevation date=project.data$date flood=project.data$flood distance=project.data$distance log10price=log10(price) September 20, 2018 Regression Diagnostics

17 Regression Diagnostics
Example: Multiple regression (measurement of influence using R / S-plus) Residual plot R / S-plus script # Measurement of influence y <- matrix(log10price,31,1, byrow=T) x <- matrix(c(elevation, date, flood, distance), 31,4,byrow=F) lesi <- ls.diag(lsfit(x,y)) # Regression diagnostics lesi$stud.res # Extraction of Studentized residuals plot(lesi$stud.res, ylab="Studentized residuals", xlab="obs #") lesi$cooks # Extraction of Cook's [1] e e e e e-01 [6] e e e e e-02 [11] e e e e e-03 [16] e e e e e-03 [21] e e e e e-02 [26] e e e e e-01 [31] e-02 September 20, 2018 Regression Diagnostics

18 Example: Multiple regression (SAS)
SAS script Output The REG Procedure Model: MODEL1 Dependent Variable: log10price Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 size elevation flood date <.0001 data land(drop=county sewer); infile "c:\stat 6401\projdata.csv" delimiter=',' firstobs=2; input price county size elevation sewer date flood distance; log10price=log10(price); run; proc reg data=land; model log10price=elevation size date flood /r ; plot rstudent.*log10price='+'; output out=pred pred=phat; title 'linear regression for housing prices'; September 20, 2018 Regression Diagnostics

19 Example: Multiple regression (SAS)
Output Statistics Dep Var Predicted Std Error Std Error Student Cook's Obs log10price Value Mean Predict Residual Residual Residual D | *| | | |******| | **| | | **| | | |** | | *| | | | | | | | | *| | | ***| | | | | | | | | | | | | | | |* | | *| | | | | | |* | | *| | | | | | **| | | |** | | |** | | | | | *| | | |**** | | | | | *| | | | | | |** | | | | September 20, 2018 Regression Diagnostics

20 Example: Multiple regression (SAS)
Residual plot Studentized Residual plot September 20, 2018 Regression Diagnostics

21 Further studies for regression analysis
Analysis of models Multicollinearity Heteroscedasticity Autocorrelation Validation of models Website of R Function for modern regression September 20, 2018 Regression Diagnostics

22 Regression Diagnostics
The End September 20, 2018 Regression Diagnostics


Download ppt "Stat 6601 Project: Regression Diagnostics (V&R 6.3)"

Similar presentations


Ads by Google