Presentation on theme: "BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink"— Presentation transcript:
BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink
Linear Regression Models residual error regression coefficient dependent variable intercept independent variable Using the methods of least squares, we can derive the following estimators: Our goal is to test the hypothesis: We can do this with a T test: under the null hypothesis, this follows a T distribution with (n-1) df.
my.model <- lm(y ~ x)
Some important functions my.model <- lm(x~y) summary(my.model) anova(my.model) predict(my.model,new.data)
Specifying Models In R we use model formula to specify the model we want to fit to our data. y ~ x Simple Linear Regression y ~ x – 1 Simple Linear Regression without the intercept (line goes through origin) y ~ x1 + x2 + x3 Multiple Regression y ~ x + I(x^2) Quadratic Regression log(y) ~ x1 + x2 Multiple Regression of Transformed Variable For factors A, B: y ~ A 1-way ANOVA y ~ A + B 2-way ANOVA y ~ A*B 2-way ANOVA + interaction term
ANOVA Example Let's use a different dataset: > library(MASS) > data(ChickWeight) > attach(ChickWeight) The factor Diet has 4 levels. > levels(Diet) > anova(lm(weight ~ Diet, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet e-07 Residuals
Two-way ANOVA We can fit a two-way ANOVA: > anova(lm(weight ~ Diet + Chick, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet e-07 Chick Residuals The interpretation of the model output is sequential, from the bottom to the top. This line tests the model: weight ~ Diet + Chick This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.
Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.
Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where Logit
Fit the Logistic Regression Model > anes.logit <- glm(move ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) ** conc ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)
Update models and model selection Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step stepAIC
Problem 5 – Survival Analysis 1.Read in the data file aml.txt. This data stores the survival data on patients with Acute Myelogenous Leukemia. 2.Compute the Kaplan-Meier estimate for all patients in this data. Compute the corresponding Kaplan-Meier plot. Construct Kaplan-Meier plots grouped by chemotherapy status. 3.Using a log-rank test, test if the two survival curves (patients on maintenance chemotherapy, patients who are not) are identical. 4.Fit a Cox proportional hazards model to the data set. 5.Plot these survival functions for patients from the different groups.
Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml$time,aml$status)~1) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)