Lecture 17: Regression Diagnostics II

Lecture 17: Regression Diagnostics II
Residuals

Regression Diagnostics
We’ve examined how to check the proportional hazards assumption Graphical approaches Regression approaches Time-varying covariates Schoenfeld residual test Once we have verified that this assumption is met, we also want to examine Model goodness of fit Functional form of covariates Outliers or influential points

Using Residuals for Diagnostics
Several types Schoenfeld residuals Cox-Snell residuals Martingale residuals Deviance residuals (similar to Martingale) What are residuals? Linear regression: What is the interpretation of a residual in Cox regression? Not necessarily the same

Cox-Snell “residuals”
Where H0(Tj) = baseline cumulative hazard at time Tj Zjk is kth covariate value for the jth person bk is the coefficient estimate for the kth covariate Interpretation The expected number of events for each observation Think of as expected counts (not residuals) Theory If model fits, rj’s should look like a censored sample from a unit exponential distribution (i.e. l = 1) That is, deviations from expected should be small

Why? Some Theory Behind This…
Assume X has survival distribution 𝑆 𝑋 𝑋 Recall the definitions

Why? Some Theory Behind This…
If we define, Y = H(X), as cumulative hazard of X Then Thus Y ~ exp(1) regardless of the distribution of X And so HX(X) should look ~ exp(1)

Cox-Snell Residuals How do we use these residuals in linear regression? Assess: Model fit Model assumptions Shape of covariates What should we compare these Cox-Snell residuals to?

Empirical vs. Fitted rj = Cox-Snell residuals
CS residuals are always >0 Hcs(t) = Nelson-Aalen H(t) = empirical cumulative hazard Estimate by fitting cox model with residuals as time variable and dj as event indicator If model fits/obeys, i.e. assumptions/covariates are appropriately modeled then: Plot of rj’s vs. Hcs(t) should be…

Implementation Get Cox-Snell residuals
Get linear predictor(s) Zb Get baseline cumulative hazard Multiply OR… get them from R Get cumulative hazard estimates: Estimate SNA(t) using KM approach with event time = rj and event indicator = dj Transform to Hcs(t) scale

Getting Residuals #fit regression reg<-coxph(st~dx+fab+ttrans+mtx+dnage+ptage+dnage*ptage, method="breslow") #get cox-snell residuals par(mfrow=c(1, 2)) cs.res<-event-reg$residuals #Plot of residuals vs. cum hazard fitres<-survfit(coxph(Surv(cs.res, bmt$Either)~1, method="breslow"), type="aalen") plot(fitres$time,-log(fitres$surv), type="s", xlab="Cox-Snell Residuals", ylab="Estimated Cumulative Hazard Function", lwd=2) abline(0, 1, col=2, lwd=2) ## Alternatively, use reg2<-survfit(Surv(cs.res, bmt$Either)~1) Htilde<-cumsum(reg2$n.event/reg2$n.risk) plot(reg2$time, Htilde, type="s", xlab="Cox-Snell Residuals", ylab="Estimated Cumulative Hazard Function", lwd=2, col=4)

Diagnostic Plots

What about MTX? Recall MTX did not meet the proportional hazards assumption Let’s look at each MTX group to see how this effects fit Also look at MTX stratified model

R Code: MTX Groups cs.res0<-cs.res[which(bmt$MTX==0)]; event0<-event[which(bmt$MTX==0)] cs.res1<-cs.res[which(bmt$MTX==1)]; event1<-event[which(bmt$MTX==1)] fitres0<-survfit(coxph(Surv(cs.res0,event0)~1,method="breslow"),type="aalen") fitres1<-survfit(coxph(Surv(cs.res1,event1)~1,method="breslow"),type="aalen") plot(fitres0$time,-log(fitres0$surv),type="s",xlab="Cox-Snell Residuals", ylab="Estimated Cumulative Hazard Function", col=3, lwd=2, lty=2, xlim=c(0, 3), ylim=c(0, 3)) lines(fitres1$time,-log(fitres1$surv),type="s", col=4, lwd=2, lty=2) abline(0,1,col=2, lwd=2) legend(2, .5, c("MTX=0","MTX=1"), col=3:4, lty=2, lwd=2, bty="n")

R: MTX Groups

R Code: MTX Stratified st<-Surv(dfs, event) reg.strat<-coxph(st~dx+fab+ttrans+dnage+ptage+dnage*ptage+strata(mtx), method="breslow") cs.strat<-event-reg.strat$resid cs.strat0<-cs.strat[which(bmt$MTX==0)] cs.strat1<-cs.strat[which(bmt$MTX==1)] fitres.strat0<-survfit(coxph(Surv(cs.strat0,event0)~1,method="breslow"),type="aalen") fitres.strat1<-survfit(coxph(Surv(cs.strat1,event1)~1,method="breslow"),type="aalen") plot(fitres.strat0$time,-log(fitres.strat0$surv), type="s", xlab="Cox-Snell Residuals", ylab="Estimated Cumulative Hazard Function", lwd=2, xlim=c(0, 3), ylim=c(0, 3)) lines(fitres.strat1$time, -log(fitres.strat1$surv), type="s", col=5, lwd=2) abline(0,1,col=2, lwd=2) legend(2, .5, c("MTX=0","MTX=1"), col=c(1,5), lwd=2, bty="n")

MTX Stratified Model

Alternative Plots for CS Residuals
There are alternative plots you can consider but they tend to require larger N We can plot: CS residuals vs. Exp(1) CS residuals vs. CS-NA cumulative hazard estimate Let’s consider a larger dataset…

Large Data Set Study examining factors that impact the time until first-time mother’s weaned their infants Data includes information on 927 mothers Variables in the data include Race (white, black, other) Mother in poverty Smoking at childbirth Alcohol use at child birth Age of the mother Years of school Prenatal care after the 3rd month

Model > data(bfeed) > mod<-coxph(Surv(duration, delta)~factor(race)+poverty+smoke+alcohol+agemth+yschool+pc3mth, data=bfeed) > mod Call: coxph(formula = Surv(duration, delta) ~ factor(race) + poverty + smoke + yschool + pc3mth, data = bfeed) coef exp(coef) se(coef) z p factor(race) factor(race) poverty smoke alcohol yschool Likelihood ratio test=29.7 on 6 df, p=4.51e-05 n= 927, number of events= 892

CS As Step Function vs. 45o Line
###Obtaining CS residuals cs.resid<-bfeed$delta-mod$resid ### Fitting Cox model for residuals ### Compare to 450 line fitres<-survfit(coxph(Surv(cs.resid,bfeed$delta)~ 1,method="breslow"),type="aalen") plot(fitres$time, -log(fitres$surv), type="s",xlab="Cox-Snell Residuals", ylab="Estimated Cumulative Hazard Function", lwd=2, ylim=c(0, 8)) abline(0,1,col=2, lwd=2)

Comparison to Exp(1) ###Obtaining CS residuals cs.resid<-bfeed$delta-mod$resid ### Also compare to exponential efit<-survfit(Surv(duration, delta)~1, data=bfeed) exp1<-rexp(10000, 1) plot(density(exp1), lwd=2, col=1) lines(density(cs.resid), col=2, lwd=2, lty=2)

Comparing CS and NA ###Comparing NA to CS NAe<--log(efit$surv) cs<-cbind(cs.resid, bfeed$duration) na<-cbind(NAe, efit$time) all<-merge(cs, na, by=2, all=T) cs<-all$cs.resid[-927] na<-all$NAe[-927] plot(cs, na, pch=16, xlab="Cox-Snell", ylab="Cox-Snell - Nelson-Aalen") abline(0,1, col=2, lwd=2) fit1<-lm(na~cs) abline(fit1, lwd=2, col=3)

Problem with Cox-Snell Approach
CS residuals can diagnose that the model does not fit But they don’t help figure out why or where Note, overall pattern can be helpful (e.g. CS > NA or vice versa) Martingale residuals are better

Martingale Residuals When model is correct, E(Mj) = 0
Range between -∞ and 1 Difference over time between observed and expected number of events Mj tends to be negative if estimated cumulative hazard is too large Mj tends to be positive if estimated cumulative hazard is too small

Martingale Residuals Average martingale can be computed for different values of a covariate Or range of covariate values Determines if Mjs tend to be positive or negative in the range Helps to find improper specification of effect of covariate on hazard

Use of Martingale Residuals
To examine best functional form of the given covariate Approach: Assume optimal model is: Fit model with only Z* Save Martingale residuals from Z* model Plot Martingale residuals versus Z1 Use smoother to help find best transformation Only works on continuous or ordinal variables!

Example ###Martingale Residuals- NHL vs HOD in BMT data(hodg) fit1<-coxph(Surv(time, delta)~factor(dtype)+factor(gtype)+score+factor(dtype)*factor(gtype), data=hodg) res1<-fit1$resid fit2<-coxph(Surv(time, delta)~factor(dtype)+factor(gtype)+wtime+factor(dtype)*factor(gtype), res2<-fit2$resid par(mfrow=c(1,2)) plot(bmt2$wait, res1, xlab="Waiting Time (months)", ylab="Martingale Residuals", pch=16) lines(lowess(bmt2$wait, res),col=2, lwd=2) lines(bmt2$wait, lm(res1~bmt2$wait), col=4, lwd=2) plot(bmt2$kar, res2, xlab="Karnofsky Score", ylab="Martingale Residuals", pch=16) lines(lowess(bmt2$karn, res),col=2, lwd=2) lines(bmt2$karn, lm(res1~bmt2$wait), col=4, lwd=2)

Martingale Plots

Model with “Inappropriate” Waiting Time
> fit<-coxph(Surv(time, cens)~wait+factor(dis)+factor(graft)+karn+ factor(dis)*factor(graft), data=bmt2) > summary(fit) Call: coxph(formula = Surv(time, cens) ~ wait + factor(dis) + factor(graft) + karn + factor(dis) * factor(graft), data = bmt2) n= 43, number of events= 26 coef exp(coef) se(coef) z Pr(>|z|) wait factor(dis) ** factor(graft) karn e-05 *** dis*graft *

How Do We Find Best Cutpoint?
Want cutoff that gives largest difference between individuals in the two data-defined groups Clinically chosen value (i.e. what do clinicians find meaningful? Choose based on data (often good choice) Contal and O’Quigley Keep in mind this may bias the model towards inclusion of covariate

Outcome-Oriented Choice
Contal and O’Quigley steps Identify possible unique cut points Construct dichotomized predictor for all cut points Conduct log-rank test for each dichotomized version of the variable Choose cutoff based on largest log-rank statistic Based on this procedure, waiting time of 84 months is “best” cut point

Model with Dichotomized Waiting Time
> #Model with dichotomized waiting time > iwait<-ifelse(84<=bmt2$wait, 1, 0) > fit.d<-coxph(Surv(time, cens)~iwait+factor(dis)+factor(graft)+karn+ + factor(dis)*factor(graft), data=bmt2) > summary(fit.d) Call: coxph(formula = Surv(time, cens) ~ iwait + factor(dis) + factor(graft) + karn + factor(dis) * factor(graft), data = bmt2) n= 43, number of events= 26 coef exp(coef) se(coef) z Pr(>|z|) iwait * factor(dis) ** factor(graft) karn e-06 *** dis:graft * *NOTE: we can not use the p-value for our waiting time indicator. We should adjust for multiple comparisons because we consider MANY cut points for waiting time (pg. 273 in text). -Here the adjusted p-value is 0.679

What About Other Transformations
Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver (1974 to 1984) 312 PCB randomized to placebo or D-penicillamine Clinical, biochemical, and histologic measures also collected Goal: develop natural history model (ignoring treatment) to determine how baseline status impacts survival

PCB Survival

PBC Example Covariates of interest
Age Albumin Prothrombin time (i.e. clotting time) Presence of edema Serum bilirubin (mg/dL) Edema is a factor variable and is used “as is” What about the other variables?

Where to Start Fit a model with age and edema
2. Get the martingale residuals from this fit 3. Plot the martingale residuals -vs. age -vs. albumin -vs. bilirubin -vs. prothrombin time 4. Check possible transformations where necessary

Age and Albumin

Albumin

Bilirubin

The Problem Child… Clotting Time

The Problem Child… Clotting Time
Log transformation is a good first guess but it doesn’t always work Deviations in the plot don’t necessarily lead us easily to the best functional form There are many we can try Z×log(Z) exp{Z} Power transformations (think Box-Cox) So let’s “explore” a little

Try Z×lnZ and eZ

Power Transformations?

The Point? Sometimes it is difficult to find a good transformation
Choose among the set of possibilities Is one transformation more interpretable? Does a particular transformation make clinical sense? Add log(bilirubin) and log(albumin) to the model with age and edema to see if this helps

Model and Residuals > ### Model including bilirubin and albumin > fit<-coxph(Surv(time, status)~age + factor(edema)+log(bili)+log(albumin)) > summary(fit) Call: coxph(formula = Surv(time, status) ~ age + factor(edema) + log(bili) + log(albumin)) n= 393, number of events= 161 coef exp(coef) se(coef) z Pr(>|z|) age e-06 *** factor(edema) factor(edema) *** log(bili) < 2e-16 *** log(albumin) e-05 *** > res1<-fit$resid

Looking Again at Clotting Time
### Model with JUST age and edema fit<-coxph(Surv(time, status)~age + factor(edema)) plot(protime, res1, xlab="Clot Time", ylab="Martingale Residuals", pch=16, main="Model w/ Age & Edema") lines(lowess(protime, res1), col=2, lwd=2) lines(protime, fitted(lm(res1~protime)), col=4, lwd=4) ### Model including bilirubin and albumin fit<-coxph(Surv(time, status)~age + factor(edema)+log(bili)+log(albumin)) res2<-resid(fit, type="martingale") plot(protime, res2, xlab="Clotting Time", ylab="Martingale Residuals", pch=16, main="Model with 4 covariates") lines(lowess(protime, res2), col=2, lwd=2) lines(protime, fitted(lm(res2~protime)), col=4, lwd=4)

Compare Residual Plots

Transformations in the Models?

What to Conclude? Transformations better but still not great
> pt1.5<- (protime)^(-1.5) > fit<-coxph(Surv(time, status)~age + factor(edema)+log(bili)+log(albumin)+pt1.5) > fit Call: coxph(formula = Surv(time, status) ~ age + factor(edema) + log(bili) + log(albumin) + pt1.5) coef exp(coef) se(coef) z p age e factor(edema) e factor(edema) e log(bili) e < 2e-16 log(albumin) e e-05 pt e Likelihood ratio test=229 on 6 df, p=0 n= 418, number of events= 186

Concerns with Martingale Residuals?
One problem with Martingale residuals… they tend to be asymmetric Range from -∞ to 1 These are therefore best used to assess covariate form, NOT general goodness of fit. Also note, there is susceptibility to overfitting when playing around with functional form

Outliers Defined in survival as
an unusual observed failure time given the covariate value Zj Martingale residuals do measure the degree to which the jth subject is an outlier BUT as we mentioned the distribution is heavily skewed Makes it hard to identify outliers

Deviance Residuals Deviance residuals are transformation of Martingale residuals Better behaved than Martingale residuals More like ~ N(0,1) Helpful for determining outliers Negative for survival times that are smaller than expected

Deviance vs. Martingale Residuals
Deviance residuals have shorter left and longer right tails Distribution more closely resembles ~N(0,1) Because deviance residuals ~N, we can think of outliers as values outside the range (-3, 3) More conservative? (-2.5, 2.5)

Compare to ~ N(0, 1) #################################### ### DEVIANCE RESIDUALS ### fit2<-coxph(Surv(time, delta) ~ factor(dtype) + factor(gtype) + score + wtime + factor(dtype)*factor(gtype), data=hodg) #Comparing Density of Martingale and deviance residuals to ~N(0, 1) par(mfrow=c(1,2)) mart.res<-resid(fit2, type="martingale") plot(density(mart.res), main="Martingale Residuals", lwd=2) lines(seq(-3,3,0.1), dnorm(seq(-3,3,0.1)), col=2, lwd=2) dev.res<-resid(fit2, type="deviance") plot(density(dev.res), main="Deviance Redisduals", lwd=2, ylim=c(0, 0.4))

Compare to ~ N(0, 1)

Martingale vs. Deviance Residuals
#### Compare deviance to martingale residuals par(mfrow=c(2,2)) fit1<-coxph(Surv(time, delta)~factor(dtype)+factor(gtype)+score+factor(dtype)*factor(gtype), data=hodg) mart.res<-resid(fit1,type="martingale") plot(hodg$wtime, mart.res, xlab="Time to Transplant (months)", ylab="Martingale Residuals", pch=16) lines(lowess(hodg$wtime, mart.res),col=2, lwd=2) lines(hodg$wtime, fitted(lm(mart.res~hodg$wtime)), col=4, lwd=2) dev.res<-resid(fit1,type="deviance") plot(hodg$wtime, dev.res, xlab="Time to Transplant (months)", ylab="Deviance Residuals", pch=16) lines(lowess(hodg$wtime, dev.res),col=2, lwd=2) lines(hodg$wtime, fitted(lm(dev.res~hodg$wtime)), col=4, lwd=2) fit2<-coxph(Surv(time, delta)~factor(dtype)+factor(gtype)+wtime+factor(dtype)*factor(gtype), data=hodg) mart.res<-resid(fit2,type="martingale") plot(hodg$score, mart.res, xlab="Karnofsky Score", ylab="Martingale Residuals", pch=16) lines(lowess(hodg$score, mart.res),col=2, lwd=2) lines(hodg$score, fitted(lm(mart.res~hodg$score)), col=4, lwd=2) dev.res<-resid(fit2,type="deviance") plot(hodg$score, dev.res, xlab="Karnofsky Score", ylab="Deviance Residuals", pch=16) lines(lowess(hodg$score, dev.res),col=2, lwd=2) lines(hodg$score, fitted(lm(dev.res~hodg$score)), col=4, lwd=2)

Back to Outliers In order to uses our deviance residuals to determine potential outliers Plot Dj versus the risk score, Again, anything outside of (-3, 3) or even more conservative…

R Code > fit<-coxph(Surv(time, status) ~ age + factor(edema) + log(bili) + log(albumin)) > fit Call: coxph(formula = Surv(time, status) ~ age + factor(edema) + log(bili) + log(albumin)) coef exp(coef) se(coef) z p age factor(edema) factor(edema) log(bili) < 2e-16 log(albumin) e-05 Likelihood ratio test=190 on 5 df, p=0, n= 312, number of events= 144 > dev.res<-resid(fit, type="deviance") > lp<-predict(fit, type="lp") > plot(lp, dev.res, xlab=“Risk Score", ylab="Deviance Residual", pch=16) > abline(h=c(-2.5, 2.5), col="red", lwd=2) > abline(h=c(-3, 3), col=4, lwd=2)

Outlier Plot

Investigating Outliers
> summary(dev.res) Min. 1st Qu. Median Mean 3rd Qu. Max > summary(cbind(time, status, age, log(albumin), log(bili), edema)) time status age log(albumin) Min. : 41 Min. :0.000 Min. :26.28 Min. : st Qu.:1093 1st Qu.: st Qu.: st Qu.: Median :1730 Median :0.000 Median :51.00 Median : Mean :1918 Mean :0.445 Mean :50.74 Mean : rd Qu.:2614 3rd Qu.: rd Qu.: rd Qu.: Max. :4795 Max. :1.000 Max. :78.44 Max. : log(bili) edema Min. : Min. : st Qu.: st Qu.:1.000 Median : Median :1.000 Mean : Mean : rd Qu.: rd Qu.:1.000 Max. : Max. :3.000

Investigating Outliers
> cbind(dev.res, cbind(time, status, age, log(albumin), log(bili), edema)) [abs(dev.res) >= 2.5,] dev.res time status age log(alb) log(bili) edema

Caveat with Deviance Residuals
As we’ve seen, deviance residuals can be helpful for identifying outliers However, given that we are assuming a normal approximation for our residuals, we need to think about sample size In data with a large number of censored observations (>25%), deviance residuals will tend to be too large.

Influence Consider only fixed-time covariates High leverage
An unusual observation with respect to the covariate vector Zi High influence An observation for which the combination Degree to which it is an outlier And its leverage = strong influence on estimates of b

Delta-Betas Let be the estimate of from all the data
Let be the estimate of from data with the ith subject removed Then the delta-beta is This is a measure the influence for the ith subject on the estimate of

Delta-Betas However, this is computationally intensive
Fit model n times There is an approximation that uses score residuals and the estimated variance-covariance matrix to calculate Each subject has one for each covariate in the model

Assessing Influence > ### A look at delta-betas for influential points > fit<-coxph(Surv(time, status)~age + factor(edema)+log(bili)+log(albumin)) > dfbeta<-residuals(fit, type="dfbeta") > colnames(dfbeta)<-names(fit$coef) > head(round(dfbeta, 5)) age edema=0.5 edema=1 log(bili) log(albumin)

Influence Plots >plot(pbc$id[-ids], dfbeta[,4], xlab="Patient ID", ylab="log(bilirubin) delta-beta", pch=16) > pbc[ dfbeta[,"log(bili)"] < -.029, c(1,2,3,5,10,11,13)] id time status age edema log(bili) log(albumin)

Assessment of Influence
Subject 81 is older and has high serum bilirubin (2 sd on log scale) Bilirubin is an important predictor of high risk, but subjects are in the upper 40th percentile of survival times We may want to do a sensitivity analysis with and without observations 81 and 362 BUT unless we have very good reason (i.e. data entry error) to remove 81, we should not delete them

Lecture 17: Regression Diagnostics II

Similar presentations

Presentation on theme: "Lecture 17: Regression Diagnostics II"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 17: Regression Diagnostics II

Similar presentations

Presentation on theme: "Lecture 17: Regression Diagnostics II"— Presentation transcript:

Similar presentations

About project

Feedback