BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink

Presentation on theme: "BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink"— Presentation transcript:

BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink bentink@jimmy.harvard.edu

Linear Regression Models residual error regression coefficient dependent variable intercept independent variable Using the methods of least squares, we can derive the following estimators: Our goal is to test the hypothesis: We can do this with a T test: under the null hypothesis, this follows a T distribution with (n-1) df.

my.model <- lm(y ~ x)

Some important functions my.model <- lm(x~y) summary(my.model) anova(my.model) predict(my.model,new.data)

Specifying Models In R we use model formula to specify the model we want to fit to our data. y ~ x Simple Linear Regression y ~ x – 1 Simple Linear Regression without the intercept (line goes through origin) y ~ x1 + x2 + x3 Multiple Regression y ~ x + I(x^2) Quadratic Regression log(y) ~ x1 + x2 Multiple Regression of Transformed Variable For factors A, B: y ~ A 1-way ANOVA y ~ A + B 2-way ANOVA y ~ A*B 2-way ANOVA + interaction term

ANOVA Example Let's use a different dataset: > library(MASS) > data(ChickWeight) > attach(ChickWeight) The factor Diet has 4 levels. > levels(Diet) > anova(lm(weight ~ Diet, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 10.81 6.433e-07 Residuals 574 2758693 4806

Two-way ANOVA We can fit a two-way ANOVA: > anova(lm(weight ~ Diet + Chick, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 11.5045 2.571e-07 Chick 46 374243 8136 1.8015 0.001359 Residuals 528 2384450 4516 The interpretation of the model output is sequential, from the bottom to the top. This line tests the model: weight ~ Diet + Chick This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.

Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where Logit

Fit the Logistic Regression Model > anes.logit <- glm(move ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.469 2.418 2.675 0.00748 ** conc -5.567 2.044 -2.724 0.00645 ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Update models and model selection Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step stepAIC

SURVIVAL ANALYSIS

Problem 5 – Survival Analysis 1.Read in the data file aml.txt. This data stores the survival data on patients with Acute Myelogenous Leukemia. 2.Compute the Kaplan-Meier estimate for all patients in this data. Compute the corresponding Kaplan-Meier plot. Construct Kaplan-Meier plots grouped by chemotherapy status. 3.Using a log-rank test, test if the two survival curves (patients on maintenance chemotherapy, patients who are not) are identical. 4.Fit a Cox proportional hazards model to the data set. 5.Plot these survival functions for patients from the different groups.

Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml\$time,aml\$status)~1) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)

Survival analysis > cp <- coxph(Surv(aml\$time, + aml\$status)~x,data=aml) > > summary(cp) > > plot(survfit(Surv(aml\$time,aml\$status)~x, + data=aml),col=c("red","green"),lwd=2)

Download ppt "BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink"

Similar presentations