Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. Slides available.

Similar presentations


Presentation on theme: "1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. Slides available."— Presentation transcript:

1 1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. hb@maths.lth.se Slides available @ http://www.maths.lth.se/bioinformatics/calendar/20031113/

2 2 of 40 Why not R? Currently there are no good image analysis packages. Matlab is still better here. Currently S-Plus (commercial software similar to R) can handle larger data set....yeah, why not?

3 3 of 40 Why R? R was written by scientists to be used by scientists. R is a great language. Runs on Windows, Mac, Unix, Linux,... Easy to install. R is free. Hundreds(!) of packages available. Papers are often published with a R package. Great help and documentation. 24/7 online mail support. A great community – friendly and helpful people... Easy to call Fortran, C, Java,... libraries Everyone else is and will be using R (-Yes, I‘m willing to bet!)

4 4 of 40 Starting and quitting R On Windows click on the R icon or as on Unix type R: markov{hb}: R R : Copyright 2003, The R Development Core Team Version 1.7.1 (2003-06-16) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type `license()' or `licence()' for distribution details. R is a collaborative project with many contributors. Type `contributors()' for more information. Type `demo()' for some demos, `help()' for on-line help, or `help.start()' for a HTML browser interface to help. Type `q()' to quit R. > 1+2+3 [1] 6 > q() Save workspace image? [y/n/c]: n markov{hb}: markov{hb}: R R : Copyright 2003, The R Development Core Team Version 1.7.1 (2003-06-16) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type `license()' or `licence()' for distribution details. R is a collaborative project with many contributors. Type `contributors()' for more information. Type `demo()' for some demos, `help()' for on-line help, or `help.start()' for a HTML browser interface to help. Type `q()' to quit R. > 1+2+3 [1] 6 > q() Save workspace image? [y/n/c]: n markov{hb}:

5 5 of 40 Getting help > help.start() > help.start() Browse the help > help(lm) > ?lm Getting help on a specific function > help.search(“linear model”) > help.search(“linear model”) Searching for help > help.search(“linear model”) Help files with alias or title matching 'linear model' using fuzzy matching: anova.glm(base) Analysis of Deviance for Generalized Linear Model Fits anova.lm(base) ANOVA for Linear Model Fits glm(base) Fitting Generalized Linear Models family.glm(base) Accessing Generalized Linear Model Fits summary.glm(base) Summarizing Generalized Linear Model Fits lm(base) Fitting Linear Models family.lm(base) Accessing Linear Model Fits summary.lm(base) Summarizing Linear Model Fits lm.fit(base) Fitter Functions for Linear Models loglin(base) Fitting Log-Linear Models predict.lm(base) Predict method for Linear Model Fits cv.glm(boot) Cross-validation for Generalized Linear Models glm.diag.plots(boot) Diagnostics plots for generalized linear > help.search(“linear model”) Help files with alias or title matching 'linear model' using fuzzy matching: anova.glm(base) Analysis of Deviance for Generalized Linear Model Fits anova.lm(base) ANOVA for Linear Model Fits glm(base) Fitting Generalized Linear Models family.glm(base) Accessing Generalized Linear Model Fits summary.glm(base) Summarizing Generalized Linear Model Fits lm(base) Fitting Linear Models family.lm(base) Accessing Linear Model Fits summary.lm(base) Summarizing Linear Model Fits lm.fit(base) Fitter Functions for Linear Models loglin(base) Fitting Log-Linear Models predict.lm(base) Predict method for Linear Model Fits cv.glm(boot) Cross-validation for Generalized Linear Models glm.diag.plots(boot) Diagnostics plots for generalized linear

6 6 of 40 Running help examples > help(lm) > example(lm) lm> ctl <- c(4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14) lm> trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69) lm> group <- gl(2, 10, 20, labels = c("Ctl", "Trt")) lm> weight <- c(ctl, trt) lm> anova(lm.D9 <- lm(weight ~ group)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) group 1 0.6882 0.6882 1.4191 0.249 Residuals 18 8.7293 0.4850 lm> summary(lm.D90 <- lm(weight ~ group - 1)) Call: lm(formula = weight ~ group - 1) Residuals: Min 1Q Median 3Q Max -1.0710 -0.4938 0.0685 0.2462 1.3690 Coefficients: Estimate Std. Error t value Pr(>|t|) groupCtl 5.0320 0.2202 22.85 9.55e-15 *** groupTrt 4.6610 0.2202 21.16 3.62e-14 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.6964 on 18 degrees of freedom Multiple R-Squared: 0.9818, Adjusted R-squared: 0.9798 F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16 lm> summary(resid(lm.D9) - resid(lm.D90)) Min. 1st Qu. Median Mean 3rd Qu. Max. -6.661e-16 -6.245e-17 2.776e-17 1.544e-17 1.110e-16 3.331e-16 lm> opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0)) lm> plot(lm.D9, las = 1) lm> par(opar) > help(lm) > example(lm) lm> ctl <- c(4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14) lm> trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69) lm> group <- gl(2, 10, 20, labels = c("Ctl", "Trt")) lm> weight <- c(ctl, trt) lm> anova(lm.D9 <- lm(weight ~ group)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) group 1 0.6882 0.6882 1.4191 0.249 Residuals 18 8.7293 0.4850 lm> summary(lm.D90 <- lm(weight ~ group - 1)) Call: lm(formula = weight ~ group - 1) Residuals: Min 1Q Median 3Q Max -1.0710 -0.4938 0.0685 0.2462 1.3690 Coefficients: Estimate Std. Error t value Pr(>|t|) groupCtl 5.0320 0.2202 22.85 9.55e-15 *** groupTrt 4.6610 0.2202 21.16 3.62e-14 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.6964 on 18 degrees of freedom Multiple R-Squared: 0.9818, Adjusted R-squared: 0.9798 F-statistic: 485.1 on 2 and 18 DF, p-value: < 2.2e-16 lm> summary(resid(lm.D9) - resid(lm.D90)) Min. 1st Qu. Median Mean 3rd Qu. Max. -6.661e-16 -6.245e-17 2.776e-17 1.544e-17 1.110e-16 3.331e-16 lm> opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0)) lm> plot(lm.D9, las = 1) lm> par(opar)

7 7 of 40 r-help mailing list Ask questions on the r-help mailing list. Get answers 24/7 (including Xmas day!) Sign up on http://www.r-project.org/http://www.r-project.org/ Things to think of before sending a question: –Make sure you read the help/docs first. –Try to add a code example what you are trying to do. –Tell what version of R you are using.

8 8 of 40 A first example > x <- c(6,5,4,3,2,1) > x [1] 6 5 4 3 2 1 > sum(x) [1] 21 > x[c(1,3,5)] [1] 6 4 2 > x <- 6:1 > x [1] 6 5 4 3 2 1 > x[1:3] [1] 6 5 4 > x[1:3]+x[6:4] [1] 7 7 7 > y <- c(0,10) > x+y [1] 6 15 4 13 2 11 > x <- c(6,5,4,3,2,1) > x [1] 6 5 4 3 2 1 > sum(x) [1] 21 > x[c(1,3,5)] [1] 6 4 2 > x <- 6:1 > x [1] 6 5 4 3 2 1 > x[1:3] [1] 6 5 4 > x[1:3]+x[6:4] [1] 7 7 7 > y <- c(0,10) > x+y [1] 6 15 4 13 2 11 > x <- matrix(1:18, nrow=3) > x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 10 13 16 [2,] 2 5 8 11 14 17 [3,] 3 6 9 12 15 18 > x[,3] [1] 7 8 9 > x[-1,4:6] [,1] [,2] [,3] [1,] 11 14 17 [2,] 12 15 18 > str(x) int [1:3, 1:6] 1 2 3 4 5 6 7 8 9 10... > x <- matrix(1:18, nrow=3) > x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 10 13 16 [2,] 2 5 8 11 14 17 [3,] 3 6 9 12 15 18 > x[,3] [1] 7 8 9 > x[-1,4:6] [,1] [,2] [,3] [1,] 11 14 17 [2,] 12 15 18 > str(x) int [1:3, 1:6] 1 2 3 4 5 6 7 8 9 10... Vectors:Matrices (default is to fill column by column) :

9 9 of 40 R is multiplying elementwise! Matrices: > X <- matrix(1:9, ncol=3, byrow=TRUE) > X [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > I <- diag(1, nrow=3) > I [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1 > I * X [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 5 0 [3,] 0 0 9 > I %*% X [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > X <- matrix(1:9, ncol=3, byrow=TRUE) > X [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > I <- diag(1, nrow=3) > I [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1 > I * X [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 5 0 [3,] 0 0 9 > I %*% X [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 Why? Because it turns out that it is more common than vector and matrix multiplications.

10 10 of 40 R is looping over missed values - really useful indeed! > x <- matrix(1:16, nrow=3) Warning message: Replacement length not a multiple of the elements to replace in matrix(...) > x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 10 13 16 [2,] 2 5 8 11 14 1 [3,] 3 6 9 12 15 2 > y <- c(-1,1) # Multiply (“moving down the columns”) > x * y [,1] [,2] [,3] [,4] [,5] [,6] [1,] -1 4 -7 10 -13 16 [2,] 2 -5 8 -11 14 -1 [3,] -3 6 -9 12 -15 2 > x <- matrix(1:16, nrow=3) Warning message: Replacement length not a multiple of the elements to replace in matrix(...) > x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 10 13 16 [2,] 2 5 8 11 14 1 [3,] 3 6 9 12 15 2 > y <- c(-1,1) # Multiply (“moving down the columns”) > x * y [,1] [,2] [,3] [,4] [,5] [,6] [1,] -1 4 -7 10 -13 16 [2,] 2 -5 8 -11 14 -1 [3,] -3 6 -9 12 -15 2 Matrices:Vectors: > x <- 7:1 > x [1] 7 6 5 4 3 2 1 > y <- 1:3 > y [1] 1 2 3 > x + y [1] 8 8 8 5 5 5 2 > x <- 7:1 > x [1] 7 6 5 4 3 2 1 > y <- 1:3 > y [1] 1 2 3 > x + y [1] 8 8 8 5 5 5 2 R will expand the shortest vector before adding! # Zero out every 2nd: > x <- 3:8 > y <- c(1,0) > x * y [1] 3 0 5 0 7 0 # Zero out every 2nd: > x <- 3:8 > y <- c(1,0) > x * y [1] 3 0 5 0 7 0

11 11 of 40 > col <- c(“red”, “blue”) > plot(x,y, col=col, cex=1:3, lwd=4) > col <- c(“red”, “blue”) > plot(x,y, col=col, cex=1:3, lwd=4) Loop over plot attributes: A first plot example > x <- seq(from=1, to=2*pi, length=41) > y <- sin(x) # Scatter plot, with red data points of size 2 > plot(x,y, col="red", cex=2) # A line with double width through data points > lines(x,y, lwd=2) > x <- seq(from=1, to=2*pi, length=41) > y <- sin(x) # Scatter plot, with red data points of size 2 > plot(x,y, col="red", cex=2) # A line with double width through data points > lines(x,y, lwd=2) > dev.print(png, "sin.png") > dev.print(postscript, “sin.ps”) > dev.print(png, "sin.png") > dev.print(postscript, “sin.ps”) Write last plot to file: (also in menus on Windows)

12 12 of 40 Questions for Clarification?! (2 minutes)

13 13 of 40 Putting names on things - helps you avoid stupid bugs etc. Vectors: > x <- c(87,76.3,1.67) > x [1] 87.0 76.3 1.67 > names(x) <- c(“age”, “weight”, “height”) > x age weight height 87.0 76.3 1.67 # Alternatively > x <- c(age=87, weight=76.3, height=1.67) > x <- c(87,76.3,1.67) > x [1] 87.0 76.3 1.67 > names(x) <- c(“age”, “weight”, “height”) > x age weight height 87.0 76.3 1.67 # Alternatively > x <- c(age=87, weight=76.3, height=1.67) Why? > x[“age”] [1] 87.0 > bmi <- x[“weight”]/x[“height”]^2 > x[“age”] [1] 87.0 > bmi <- x[“weight”]/x[“height”]^2 > x[1] [1] 87.0 > bmi <- x[2]/x[3]^2 > x[1] [1] 87.0 > bmi <- x[2]/x[3]^2 cf.

14 14 of 40 Putting names on things Matrices: > x1 <- c(87,76.3,1.67) > x2 <- c(78,96.3,1.84) > x3 <- c(45,62.9,1.54) > x <- matrix(c(x1,x2,x3), nrow=3, byrow=TRUE) > x [,1] [,2] [,3] [1,] 87 76.3 1.67 [2,] 78 96.3 1.84 [3,] 45 62.9 1.54 > colnames(x) <- c("age", "weight", "height") > rownames(x) <- c(“jon”, “kim”, “dan”) > x age weight height jon 87 76.3 1.67 kim 78 96.3 1.84 dan 45 62.9 1.54 > x[“jon”,] age weight height 87.00 76.30 1.67 > x[,c("weight","age")] weight age jon 76.3 87 kim 96.3 78 dan 62.9 45 > bmi <- x[,"weight"]/x[,"height"]^2 > bmi jon kim dan 27.35846 28.44400 26.52218 > x1 <- c(87,76.3,1.67) > x2 <- c(78,96.3,1.84) > x3 <- c(45,62.9,1.54) > x <- matrix(c(x1,x2,x3), nrow=3, byrow=TRUE) > x [,1] [,2] [,3] [1,] 87 76.3 1.67 [2,] 78 96.3 1.84 [3,] 45 62.9 1.54 > colnames(x) <- c("age", "weight", "height") > rownames(x) <- c(“jon”, “kim”, “dan”) > x age weight height jon 87 76.3 1.67 kim 78 96.3 1.84 dan 45 62.9 1.54 > x[“jon”,] age weight height 87.00 76.30 1.67 > x[,c("weight","age")] weight age jon 76.3 87 kim 96.3 78 dan 62.9 45 > bmi <- x[,"weight"]/x[,"height"]^2 > bmi jon kim dan 27.35846 28.44400 26.52218

15 15 of 40 Lists > x <- c(a=1:2, b=3, c=5:8) > x a1 a2 b c1 c2 c3 c4 1 2 3 5 6 7 8 > x <- list(a=1:2, b=3, c=5:8) > x $a [1] 1 2 $b [1] 3 $c [1] 5 6 7 8 > x$a [1] 1 2 > x[[“a”]] [1] 1 2 > x[2:3] $b [1] 3 $c [1] 5 6 7 8 > x[c(“b”,”c”)] $b [1] 3 $c [1] 5 6 7 8 > x <- c(a=1:2, b=3, c=5:8) > x a1 a2 b c1 c2 c3 c4 1 2 3 5 6 7 8 > x <- list(a=1:2, b=3, c=5:8) > x $a [1] 1 2 $b [1] 3 $c [1] 5 6 7 8 > x$a [1] 1 2 > x[[“a”]] [1] 1 2 > x[2:3] $b [1] 3 $c [1] 5 6 7 8 > x[c(“b”,”c”)] $b [1] 3 $c [1] 5 6 7 8 > x <- list(a=1:4, b=4:6+2i, src=“GenePix”) > x $a [1] 1 2 3 4 $b [1] 4+2i 5+2i 6+2i $src [1] "GenePix“ > x <- list(a=1:4, b=4:6+2i, src=“GenePix”) > x $a [1] 1 2 3 4 $b [1] 4+2i 5+2i 6+2i $src [1] "GenePix“ Vectors and matrices can only contain one type of data at the time, e.g. either numbers or strings, but lists can carry mixed types:

16 16 of 40 Data frames > df <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54)) > df name age weight height 1 jon 87 76.3 1.67 2 kim 78 96.3 1.84 3 dan 45 62.9 1.54 > df$weight [1] 76.3 96.3 62.9 > df[,c("name","age")] name age 1 jon 87 2 kim 78 3 dan 45 > as.matrix(df) name age weight height 1 "jon" "87" "76.3" "1.67" 2 "kim" "78" "96.3" "1.84" 3 "dan" "45" "62.9" "1.54" > df <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54)) > df name age weight height 1 jon 87 76.3 1.67 2 kim 78 96.3 1.84 3 dan 45 62.9 1.54 > df$weight [1] 76.3 96.3 62.9 > df[,c("name","age")] name age 1 jon 87 2 kim 78 3 dan 45 > as.matrix(df) name age weight height 1 "jon" "87" "76.3" "1.67" 2 "kim" "78" "96.3" "1.84" 3 "dan" "45" "62.9" "1.54" Data frames are very powerful. Simply speaking you can treat them as both lists and matrices:

17 17 of 40 Importing data > df <- read.table(“foo.txt”, header=TRUE) > dim(df) [1] 221952 6 > summary(df) slide spot R Min. :1.00 Min. : 1 Min. : 30.0 1st Qu.:1.75 1st Qu.:13873 1st Qu.: 60.0 Median :2.50 Median :27745 Median : 102.0 Mean :2.50 Mean :27745 Mean : 487.5 3rd Qu.:3.25 3rd Qu.:41616 3rd Qu.: 233.0 Max. :4.00 Max. :55488 Max. :65211.0 G Rb Gb Min. : 24.0 Min. : 32.00 Min. : 24.00 1st Qu.: 38.0 1st Qu.: 40.00 1st Qu.: 26.00 Median : 59.0 Median : 49.00 Median : 32.00 Mean : 194.7 Mean : 51.85 Mean : 31.92 3rd Qu.: 110.0 3rd Qu.: 64.00 3rd Qu.: 36.00 Max. :62341.0 Max. :1234.00 Max. :255.00 > str(df) `data.frame': 221952 obs. of 6 variables: $ slide: int 1 1 1 1 1 1 1 1 1 1... $ spot : int 1 2 3 4 5 6 7 8 9 10... $ R : int 4416 335 39 568 42 43 56 40 7912 51... $ G : int 1533 155 47 211 50 110 64 45 4535 65... $ Rb : int 39 38 39 39 39 39 39 40 39 39... $ Gb : int 43 42 43 42 42 44 42 43 42 48... > df <- read.table(“foo.txt”, header=TRUE) > dim(df) [1] 221952 6 > summary(df) slide spot R Min. :1.00 Min. : 1 Min. : 30.0 1st Qu.:1.75 1st Qu.:13873 1st Qu.: 60.0 Median :2.50 Median :27745 Median : 102.0 Mean :2.50 Mean :27745 Mean : 487.5 3rd Qu.:3.25 3rd Qu.:41616 3rd Qu.: 233.0 Max. :4.00 Max. :55488 Max. :65211.0 G Rb Gb Min. : 24.0 Min. : 32.00 Min. : 24.00 1st Qu.: 38.0 1st Qu.: 40.00 1st Qu.: 26.00 Median : 59.0 Median : 49.00 Median : 32.00 Mean : 194.7 Mean : 51.85 Mean : 31.92 3rd Qu.: 110.0 3rd Qu.: 64.00 3rd Qu.: 36.00 Max. :62341.0 Max. :1234.00 Max. :255.00 > str(df) `data.frame': 221952 obs. of 6 variables: $ slide: int 1 1 1 1 1 1 1 1 1 1... $ spot : int 1 2 3 4 5 6 7 8 9 10... $ R : int 4416 335 39 568 42 43 56 40 7912 51... $ G : int 1533 155 47 211 50 110 64 45 4535 65... $ Rb : int 39 38 39 39 39 39 39 40 39 39... $ Gb : int 43 42 43 42 42 44 42 43 42 48... Data in text (ASCII) files can be imported using read.table() :

18 18 of 40 Boxplots and density plots > df <- read.table(“foo.txt”, header=TRUE) > dim(df) [1] 221952 6 > boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green")) # Ooops > boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"), ylim=c(0,300)) # Better! > df <- read.table(“foo.txt”, header=TRUE) > dim(df) [1] 221952 6 > boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green")) # Ooops > boxplot(df[,c("R","G","Rb","Gb")], col=c("red","green"), ylim=c(0,300)) # Better! > plot(density(log(df$G,base=2)), col="green", main="Densities of R and G") > lines(density(log(df$R,base=2)), col="red") > plot(density(log(df$G,base=2)), col="green", main="Densities of R and G") > lines(density(log(df$R,base=2)), col="red")

19 19 of 40 Summary of “data types” Vectors (only one type at the time) : > x <- 7:1 > x [1] 7 6 5 4 3 2 1 > x <- c("a",3,"e") > x [1] "a" “3" "e" > x <- 7:1 > x [1] 7 6 5 4 3 2 1 > x <- c("a",3,"e") > x [1] "a" “3" "e" > x <- matrix(1:18, nrow=3) > x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 10 13 16 [2,] 2 5 8 11 14 17 [3,] 3 6 9 12 15 18 > x <- matrix(letters[1:12], nrow=3) > x [,1] [,2] [,3] [,4] [1,] "a" "d" "g" "j" [2,] "b" "e" "h" "k" [3,] "c" "f" "i" "l" > x <- matrix(1:18, nrow=3) > x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 10 13 16 [2,] 2 5 8 11 14 17 [3,] 3 6 9 12 15 18 > x <- matrix(letters[1:12], nrow=3) > x [,1] [,2] [,3] [,4] [1,] "a" "d" "g" "j" [2,] "b" "e" "h" "k" [3,] "c" "f" "i" "l" Matrices (rectangular, only one type at the time) : > x <- list(a=1:4, b=4:6+2i, src=“GenePix”) > x $a [1] 1 2 3 4 $b [1] 4+2i 5+2i 6+2i $src [1] "GenePix“ > x <- list(a=1:4, b=4:6+2i, src=“GenePix”) > x $a [1] 1 2 3 4 $b [1] 4+2i 5+2i 6+2i $src [1] "GenePix“ Lists (any shape, mixed type) : > x <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54)) > x name age weight height 1 jon 87 76.3 1.67 2 kim 78 96.3 1.84 3 dan 45 62.9 1.54 > x <- data.frame(name=c(“jon”,”kim”,”dan”), age=c(87,78,45), weight=c(76.3,96.3,62.9), height=c(1.67,1.84,1.54)) > x name age weight height 1 jon 87 76.3 1.67 2 kim 78 96.3 1.84 3 dan 45 62.9 1.54 Data frames (rectangular, mixed type) :

20 20 of 40 Questions for Clarification?! (2 minutes)

21 21 of 40 Installing R on Windows

22 22 of 40 Random numbers > x <- rnorm(10000, mean=2, sd=4) > length(x) [1] 10000 > mean(x) [1] 2.007904 > sd(x) [1] 3.969784 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -12.670 -0.607 2.000 2.008 4.701 15.850 > hist(x) # or use probabilities (not counts) on y-axis > hist(x, probability=TRUE) > x <- rnorm(10000, mean=2, sd=4) > length(x) [1] 10000 > mean(x) [1] 2.007904 > sd(x) [1] 3.969784 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -12.670 -0.607 2.000 2.008 4.701 15.850 > hist(x) # or use probabilities (not counts) on y-axis > hist(x, probability=TRUE)

23 23 of 40 Defining functions and scripts > cubic <- function(x) 1 + x^2 - 0.5*x^3 > cubic(0:4) [1] 1.0 1.5 1.0 -3.5 -15.0 > cubic <- function(x) 1 + x^2 - 0.5*x^3 > cubic(0:4) [1] 1.0 1.5 1.0 -3.5 -15.0 # Read R code from file > source(“cubic.R”) > cubic(0:4) [1] 1.0 1.5 1.0 -3.5 -15.0 > cubic function(x) { 1 + x^2 - 0.5*x^3 } # Read R code from file > source(“cubic.R”) > cubic(0:4) [1] 1.0 1.5 1.0 -3.5 -15.0 > cubic function(x) { 1 + x^2 - 0.5*x^3 }

24 24 of 40 Adding data points to an existing plot > f <- function(x) x^2 > x <- seq(-10,10, by=0.01) > eps <- rnorm(length(x), sd=4) > y <- f(x) + eps # plot() creates a new plot > plot(x,y) # points() add data points to # an existing plot > points(x,f(1.1*x), col="blue") # Same for abline() and others > abline(h=0, col="green") > abline(v=0, col="purple") > abline(a=0, b=1, col="orange") # Same for curve() with add=TRUE > curve(f, col="red“, add=TRUE) (On request by Linda)

25 25 of 40 Packages On CRAN - Comprehensive R Archive Network – there are today 300+ packages published! Browse CRAN at http://www.r-project.org/ http://www.r-project.org/ Find what you want. Dirt simple to install package! At the Centre for Mathematical Sciences we try to keep install and update all package on our system.

26 26 of 40 Install a package On Windows extremely easy! On all systems: > install.packages(c("adapt","maptools")) trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/PACKAGES' Content type `text/plain; charset=iso-8859-1' length 12674 bytes opened URL downloaded 12Kb trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/adapt_1.0-3.zip' Content type `application/zip' length 39304 bytes opened URL downloaded 38Kb trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/maptools_0.3-2.zip ' Content type `application/zip' length 129634 bytes opened URL downloaded 126Kb Delete downloaded files (y/N)? y updating HTML package descriptions > library(maptools)

27 27 of 40 Update packages On all systems: > update.packages() trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/PACKAGES' Content type `text/plain; charset=iso-8859-1' length 12674 bytes opened URL downloaded 12Kb cluster : Version 1.7.3 in c:/PROGRA~1/R/rw1071/library Version 1.7.6 on CRAN Update (y/N)? y foreign : Version 0.6-1 in c:/PROGRA~1/R/rw1071/library Version 0.6-3 on CRAN Update (y/N)? y... trying URL `http://cran.r-project.org/bin/windows/contrib/1.7/foreign_0.6-3.zip' Content type `application/zip' length 109855 bytes opened URL downloaded 107Kb Delete downloaded files (y/N)? y updating HTML package descriptions >

28 28 of 40 Using/loading a package On all systems: > library(maptools) # Some package loads other package too > library(com.braju.sma) Loading required package: R.oo R.oo v0.44 (2003/10/29) was successfully loaded. Loading required package: R.io R.io v0.44 (2003/10/29) was successfully loaded. Loading required package: R.graphics R.graphics v0.44 (2003/10/29) was successfully loaded. com.braju.sma v0.64 (2003/10/31) was successfully loaded. > example(MAData)

29 29 of 40 Questions for Clarification?! (2 minutes)

30 30 of 40 Fit a linear model (i) > help.search(“linear model”) Help files with alias or title matching 'linear model' using fuzzy matching: anova.glm(base) Analysis of Deviance for Generalized Linear Model Fits anova.lm(base) ANOVA for Linear Model Fits glm(base) Fitting Generalized Linear Models family.glm(base) Accessing Generalized Linear Model Fits summary.glm(base) Summarizing Generalized Linear Model Fits lm(base) Fitting Linear Models family.lm(base) Accessing Linear Model Fits summary.lm(base) Summarizing Linear Model Fits lm.fit(base) Fitter Functions for Linear Models loglin(base) Fitting Log-Linear Models predict.lm(base) Predict method for Linear Model Fits cv.glm(boot) Cross-validation for Generalized Linear Models glm.diag.plots(boot) Diagnostics plots for generalized linear models 1. Find function to use: It says that there is a function called lm() in a package called base. (Package base is always loaded when R is started)

31 31 of 40 Fit a linear model (ii) > help(lm) 2. Look at the help:

32 32 of 40 Fit a linear model (iii) # Data with noise > f <- function(x) 3.3 + 0.85*x > x <- seq(0,10, by=0.01) > eps <- rnorm(length(x), sd=2) > y <- f(x) + eps # Plot data > plot(x,y) # Fit a linear model. See ?lm > fit <- lm(y ~ x) > fit Call: lm(formula = y ~ x) Coefficients: (Intercept) x 3.2780 0.8386 # Plot fitted line > abline(fit, col=“red") 3. Test it on our data:

33 33 of 40 Fit a linear model (iv) 4. Investigate fitted line in detail: > summary(fit) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -8.2449 -1.3234 0.1234 1.2670 6.5321 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.27797 0.12608 26.00 <2e-16 *** x 0.83863 0.02183 38.41 <2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.996 on 999 degrees of freedom Multiple R-Squared: 0.5963, Adjusted R-squared: 0.5959 F-statistic: 1476 on 1 and 999 DF, p-value: < 2.2e-16

34 34 of 40 Fit a linear model (v) 5. What does lm() return? > str(fit) List of 12 $ coefficients : Named num [1:2] 3.357 0.824..- attr(*, "names")= chr [1:2] "(Intercept)" "x" $ residuals : Named num [1:1001] 1.09757 -0.00305 - 0.95878 1.64874 -2.59366.....- attr(*, "names")= chr [1:1001] "1" "2" "3" "4"... $ effects : Named num [1:1001] -236.55 75.33 -0.99 1.62 -2.63.....- attr(*, "names")= chr [1:1001] "(Intercept)" "x" "" ""... $ rank : int 2 $ fitted.values: Named num [1:1001] 3.36 3.36 3.37 3.38 3.39.....- attr(*, "names")= chr [1:1001] "1" "2" "3" "4"... $ assign : int [1:2] 0 1 $ qr :List of 5..$ qr : num [1:1001, 1:2] -31.6386 0.0316 0.0316 0.0316 0.0316.......- attr(*, "dimnames")=List of 2......$ : chr [1:1001] "1" "2" "3" "4"... [ SNIP ]......- attr(*, "order")= int 1......- attr(*, "intercept")= int 1......- attr(*, "response")= int 1......- attr(*, ".Environment")=length 10......- attr(*, "predvars")= language list(y, x) attr(*, "class")= chr "lm"

35 35 of 40 Fit a linear model (vi) 6. Related functions etc. > res <- residuals(fit) > hist(res) > qqnorm(res)

36 36 of 40 Questions for Clarification?! (2 minutes)

37 37 of 40 Local Polynomial Regression Fitting (LOESS) # Read microarray data > df <- read.table("foo.txt", header=TRUE) # Use only R & G from slide 1 > df <- df[df$slide == 1, c(“R”,”G”)] # Get the log-ratios and log-intensities > M <- log(df$R/df$G, base=2) > A <- log(df$R*df$G, base=2)/2 # Plot M vs A > plot(A,M) # Fit a LOESS curve > fit <- loess(M ~ A) > Mcurve <- predict(fit, A) > points(A, Mcurve, col="red") # Remove trend > M <- M - Mcurve > points(A,M, col=”blue”)

38 38 of 40 Column-by-column standardization of matrix # Subract the mean and scale by the std. dev. > zscore <- function(y,...) { + (y - mean(y,...)) / sd(y,...) + } # Generate data > x <- matrix(rnorm(1000, mean=2, sd=10), ncol=2) # Standardize data. If there are NA values, pass argument # na.rm=TRUE to both mean() and sd() via..., which tells # them to remove NAs first) > z <- apply(x, MARGIN=2, FUN=zscore, na.rm=TRUE) # Create a vector with the mean and the std. dev. of the first column > c(mean(x[,1]),sd(x[,1])) [1] 2.44114 10.25328 > c(mean(z[,1]),sd(z[,1])) [1] -1.616485e-16 1.000000e+00 (On request by Halfdan)

39 39 of 40 Links http://www.r-project.org/ - homepage of the R project - download and install R - sign up on the r-help mailing list - more manuals and examples. - additional packages (CRAN) - conferences and other large-scale projects -... http://www.maths.lth.se/help/R/ - Help @ MC - Local packages Although I’m finishing up my thesis... I can help out, but please make sure you read the help first. Don’t forget about the r-help list. Thanks!

40 40 of 40 Session open for any questions (remaining time)


Download ppt "1 of 40 Henrik Bengtsson Get started with R Mathematical Statistics, Centre for Mathematical Sciences, Lund University. Slides available."

Similar presentations


Ads by Google