BIO503: Lecture 4 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

BIO503: Lecture 4 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Contributed R Packages Because R is an open source language, anyone can make their own changes. As a result, the software is continually improved and rigorously tested by a large and knowledgeable community. Specifically, users have contributed R packages. These packages usually contain software developed for a specialized method or task. Some packages also contain data sets. E.g. mixture models, Bayesian statistics, non-parametric approaches, time series algorithms, genetic analyses. Etc. There are currently 1628 packages!1628 packages

Downloading & Installing R Packages Contributed R packages can be downloaded within the R GUI. Select the Packages toolbar, then Install package(s)… Alternatively (on Windows), you can download the.zip file for the package. Select the Packages toolbar, then Install package(s) from local zip files… Once installed, use the library function to load the package into your R session.

Example: Installing ISwR Package Please install the ISwR package on your computer. This package contains all the data sets used in Peter Dalgaard's book Introductory Statistics with R. To load the package into your current R session: > library(ISwR) To find out more information, including what objects are contained in a package: > library(help=ISwR)

Linear Regression Models residual error regression coefficient dependent variable intercept independent variable Using the methods of least squares, we can derive the following estimators: Our goal is to test the hypothesis: We can do this with a T test: under the null hypothesis, this follows a T distribution with (n-1) df.

Example Dataset tlc We'll be using the dataset tlc that exists in ISwR package. To load this dataset: > data(tlc) # tlc = total lung capacity What kind of object is tlc ? > class(tlc) To learn about this dataset: > help(tlc) By using the attach command, we release the columns of the data.frame into the workspace. > attach(tlc) > age

Linear Regression with lm Is there a linear relationship between height and Total Lung Capacity (TLC)? > lmObject <- lm(tlc ~ height, data=tlc) What kind of object is lmObject ? > class(lmObject) The model object represents an object that encapsulates the results of a model fit. The desired quantities can be obtained using extractor functions. A basic extractor function is summary : > summary(lmObject)

Interpreting the Output from lm Call:... Stores the call to the lm function. Handy for keeping track of what model you fit. Residuals:... Coefficients:... Summary stats of the residuals from the model. Recall: residuals should follow approximately a standard Normal distribution if the model is a good fit. Estimates from the model. Standard error. T statistics. P-value. Residual standard error:... Residual variance Plug in estimates, to get RSE:

Interpreting the Output from lm Multiple R- squared:... Pearson correlation coefficient 2 of y, x (e.g. tlc, height). Do cor(tlc[,4], height))^2 to check. F statistic: This is the F test that the regression coefficient is zero. Notice: F = (T 2 ) for the regression coefficient. Adjust R- squared:...

Visualizing Our Fitted Model So what does it really mean, to have fit a linear model of TLC and height? Plot the data: > TLC <- tlc[,4] > plot(height, TLC) Add the regression line we fit with our model: > abline(lmObject)

Values Fitted by the Linear Model Retrieve the fitted values using the extractor function fitted: > fitted(lmObject) Convince yourself that these are the points that fall on the line we just made in the previous plot. > plot(height, TLC) > abline(lmObject) > points(height, fitted(lmObject), pch=20, col="red") To grab the residual values: > resid(lmObject)

Eyeballing Good Fit We can use the information from the residuals and fitted values to create plots to see how good our model fit is. > plot(height, TLC) > abline(lmObject) Use the segments function to draw vertical lines from the fitted values to the real data values. > segments(height, fitted(lmObject), height, TLC) We can also take a look at the residuals. > plot(height, resid(lmObject)) And use a QQ plot to see if the residuals are normally distributed. > qqplot(resid(lmObject))

Confidence Interval Bands Confidence bands are added to regression lines to reflect the uncertainty about a true parameter that cannot be observed i.e. the true regression coefficient . The more narrow the confidence band is, suggests a well- determined line. Using the predict function without any other input arguments yields the fitted values predicted by the model. > predict(lmObject) To compute the confidence interval bands for the fitted values, you need to specify the interval argument: > predict(lmObject, interval="confidence")

Visualizing Confidence Interval Bands Go ahead and plot these values: > pp <- predict(lmObject, interval="confidence") > plot(height, pp[,1], type="l", ylim=range(pp, height)) > lines(height, pp[,2], col="red", lwd=3) > lines(height, pp[,3], col="blue", lwd=3) What's the problem?

Visualizing Confidence Interval Bands

Predicting on New Data Our problem will be solved if height was ordered sequentially. One solution: predict fitted values on new (ordered) height values. Create a sequence of numbers that go from min(height) to max(height) approximately. > range(height) > new <- data.frame(height = seq(from=120, to=200, by=2)) Compute new predictions: > pp.new <- predict(lmObject, new, interval="confidence")

Plotting Confidence Interval Bands Now we can make the plot. First the fitted values: > plot(new$height, pp.new[,1], type="l") Then the upper interval band: > lines(new$height, pp.new[,2], col="red", lwd=2) Finally the lower interval band: > lines(new$height, pp.new[,3], col="blue", lwd=2)

Prediction Interval Bands Prediction interval bands reflect the uncertainty about future observations. Prediction bands are generally wider than the confidence interval bands. > predict(lmObject, interval="prediction") Plot the fitted data with the prediction interval bands and confidence interval bands superimposed. > pp.pred <- predict(lmObject, new, interval="prediction") > pp.conf <- predict(lmObject, new, interval="confidence")

Visualizing Fitted Values, Prediction and Confidence Bands Note: instead of using lines individually, we can also use the function matlines which plots columns of a matrix. > help(matlines) > plot(new$height, pp.pred[,1], ylim=range(pp.pred, pp.conf)) Add the prediction bands: > matlines(new$height, pp.pred[,-1], type=c("l", "l"), lwd=c(3,3), col=c("red", "blue"), lty=c(1,1)) Add the confidence bands: > matlines(new$height, pp.conf[,-1], type=c("l", "l"), lwd=c(3,3), col=c("red", "blue"), lty=c(2,2))

Tutorial 3

ANOVA Models The (one-way) ANOVA model generalizes linear regression. –One factor that has G levels. –For each level we have J replicates. j = 1 j = J J replicates i = 1i = Gi = 2… … G levels ANOVA Model:

Analysis of Variance Models j = 1 j = J J replicates i = 1i = Gi = 2… … G levels Total variation: Variation between groups: Variation within groups: ANOVA is all about splitting variation up.

ANOVA Models Question: is there a significant relationship between our factor A and the response variable Y? If yes, then ideally the variation within groups should be small. The variation between groups should be large. Variation between groups: Variation within groups: Our test statistic: General idea: Under the null hypothesis that the factor A has no effect: F = 1. Large values of F indicate the factor is significant. B = # levels W = n - B

ANOVA Example Let's use a different dataset: > library(MASS) > data(ChickWeight) > attach(ChickWeight) The factor Diet has 4 levels. > levels(Data) > anova(lm(weight ~ Diet, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 10.81 6.433e-07 Residuals 574 2758693 4806

Two-way ANOVA We can fit a two-way ANOVA: > anova(lm(weight ~ Diet + Chick, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 11.5045 2.571e-07 Chick 46 374243 8136 1.8015 0.001359 Residuals 528 2384450 4516 The interpretation of the model output is sequential, from the bottom to the top. This line tests the model: weight ~ Diet + Chick This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.

Specifying Models In R we use model formula to specify the model we want to fit to our data. y ~ x Simple Linear Regression y ~ x – 1 Simple Linear Regression without the intercept (line goes through origin) y ~ x1 + x2 + x3 Multiple Regression y ~ x + I(x^2) Quadratic Regression log(y) ~ x1 + x2 Multiple Regression of Transformed Variable For factors A, B: y ~ A 1-way ANOVA y ~ A + B 2-way ANOVA y ~ A*B 2-way ANOVA + interaction term

BIO503: Lecture 4 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

Similar presentations

Presentation on theme: "BIO503: Lecture 4 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BIO503: Lecture 4 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics

Similar presentations

Presentation on theme: "BIO503: Lecture 4 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics"— Presentation transcript:

Similar presentations

About project

Feedback