Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA

We're moving ahead a bit... The majority of the class does some type of microarray analysis –Microarray analysis utilizes the same programmatic concepts we've been exploring Weve covered variables, control structures, data structures, functions in perl –Now well introduce a new language called R particularly suited for numerical analysis –Well learn about explorative data analysis –Well write our own functionality in this new language

R (A data-structure focused introduction) –Every R intro Ive read tends towards statistical tools first and data structures / programmatic concepts as an afterward I dont think this is the best way to learn a language –After all, well be doing complex data analysis –Going beyond one-off biostatistics learning Im going to introduce R the same way I introduced PERL I hope to show you that R is just as friendly…

What is R? From the R-project webpage (www.r-project.org):www.r-project.org R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering,...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

Where to get R? Available for a wide variety of platforms... (handled well in windows too!) http://www.r-project.org Libraries available via CRAN (like CPAN we used before) The bioperl version of R is bioconductor - used extensively for routine and experimental microarray analysis

Lets begin The default mode for R is interactive CLI (with emacs keybindings!) A query – response line mode –An overgrown calculator R evaluates commands through function calls > 2+2 [1] 4 –A programming language Complex data structures Control blocks Objects!

Symbolic variables Variable assignment is handled via arrow notation <- Variables can be examined by simply calling the variable The index of the first element of the variable is given in brackets on each line [1] Scalar elements can be numerical, character, or boolean > x<-2 > x [1] 2 > x + x [1] 4 > x<-ACTCGATCGACT > x [1] ACTCGATGCACT > x<-T > x [1] TRUE

Vectors R handles vectors as single objects R defines three types of vectors: –numerical vectors –character vectors –logical vectors Vectors are created (and treated) as concatenation of scalar elements: > x<-c(1,2,3,4,5) > x [1] 1 2 3 4 5 > x<-c(ACT,TCA,GGA,CCG) > x [1] ACT TCA GGA CCG > x<-c(T,T,F,T) > x [1] TRUE TRUE FALSE TRUE

Vector element access Very similar to Perl array element access Access by index –The index itself can be a vector, or any type of data element –Can be an expression –Negative indeces denote exclusion > x<-seq(1,20,2) > x[5] [1] 9 > x[c(1,3,4)] [1] 1 5 7 > x[x>10] [1] 11 13 15 17 19 > x[c(-1,-2,-3,-4,-5)] [1] 11 13 15 17 19

Vector functions seq (sequence) Creates a range of values in a vector > x<-seq(1,10,1) > x [1] 1 2 3 4 5 6 7 8 9 10 > x<-4:12 > x [1] 4 5 6 7 8 9 10 11 12 > x<-LETTERS(1:3) > x [1] A B C

Vector functions rep (replicate) Generates repeated values - Can be used to generate complex patterns - Can be used to generate data grouping codes > x<-c(10,100,1000) > rep(x,3) [1] 10 100 1000 10 100 1000 10 100 1000 > rep(x,1:3) [1] 10 100 100 1000 1000 1000 > rep(1:2,c(5,10)) [1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

Vector functions sort Sorts an array in-place > x<-c(10000,10,1000) > sort(x) [1] 10 1000 10000

Vector functions factor Grouping for categorical data > x<-c(0,1,2,1,2) > fx<-factor(x,levels=0:2) > levels(fx)<-c(low,middle,grande) > fx [1] low middle grande middle grande

Matrices Simply n-dimensional arrays –in R, most everything is an array Extends elements of any type Can dynamically set and change dimensions –default matrix dim is by columns > x<-seq(1,12) > dim(x)<-c(3,4) > x [,1][,2][,3][,4] [1,]14710 [2,]25811 [3,]36912 > matrix(1:12,nrow=3,byrow=T) [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]9101112

Matrix functions t (transpose) Changes rows and columns > matrix(1:12,nrow=3,byrow=T) [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]9101112 > t(x) [,1][,2][,3] [1,]159 [2,]2610 [3,]3711 [4,]4812

Matrix functions rownames Assigns scalars to the row indeces (like a hash) > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]9101112 > rownames(x)<-c(one,two,three) > x [,1][,2][,3][,4] one1234 two5678 three9101112

Matrix functions colnames Assigns scalars to the column indeces (like a hash) > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]9101112 > colnames(x)<-c(one,two,three,four) > x onetwothreefour [1,]1234 [2,]5678 [3,]9101112

Matrix functions cbind Adds (in the agglomerative sense) cols together like XL > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]9101112 > cbind(x,c(111,222,333)) [,1][,2][,3][,4][,5] [1,]1234111 [2,]5678222 [3,]9101112333

Matrix functions rbind Adds (in the agglomerative sense) rows together like XL > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]9101112 > cbind(x,c(111,222,333,444)) [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]9101112 [4,]111222333444

Object functions list Combines collections into composite objects - Objects are treated as vectors in R, plus methods - Matrices are collections of vectors > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]9101112 > y<-c(1:5) > z<-list(matrix=x,vector=y) > z $matrix [,1][,2][,3][,4]... $vector [1] 1 2 3 4 5

list (object) functions indexing Since it's a vector, we can obtain the elements > z$vector [1] 1 2 3 4 5 > z$vector[5] [1] 5 > z$matrix[1,3] [1] 3 > z$vector[z$vector>3] [1] 4 5

list (object) functions data.frame If the vectors are the same length, we can agglomerate them in a special data matrix - the data is paired, and has unique row names > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > z xy 116 227 338 449 5510

Reading data from files data.frame / read.table / read.csv / read.delim > myData<-read.table(example.txt, header=T) > myData field_onefield_two 110.31.08 211.20.97 3…… –The data frame is ideal for handling delimited files Assumes a header is present –(takes the header to have n-1 entries) Can handle a wide variety of interfaces with outputs –Tab, comma delimited txt files –SPSS, SAS, Stata, Minitab, S-PLUS v3 files –Works well with DB interface calls as well

Persistence save.image() /.RData / ls() –The workspace is dynamic Variables and functions are created or loaded –objects() or ls() shows availability of both –Can be saved to a local.RData file using save.image() –.RData loaded by default upon startup –Can specify the.RData (or whatever you name it) workspace using load() (may have to specify pathname!)

data frame (object) functions subset Allows extraction of a portion of a data frame > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > subset(z,x>2) xy 338 449 5519

data frame (object) functions transform Allows extension of a data frame > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > transform(z,x.log=log(x)) xx.log 110.0000000 220.6931472 331.0986123 441.3862944 551.6094379

data frame (object) functions split Lists vectors according to group > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > h<-split(z$x,z$y) > h $1 [1] 6 $2 [1] 7 $3 [1] 8 $4 [1] 9 $5 [1] 10

data frame (object) functions lapply Implicit looping over group members > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > lapply(z, mean) $x [1] 3 $y [1] 8

Functions in R very similar to what we've seen in perl! Blocks are the same - Takes arguments - Uses control structures (for, if, while loops,...) > x<-c(1:5) > my.function<-function(x) { u<-mean(x) } > y<-my.function(x) > y [1] 3

Control structures for loop Loops over a set range > myfunction<-function(x) { for (i in 1:10) { do something here } The variable i will take values of the sequence in turn The range is specified by the sequence

A stupid function example Just to illustrate passing args back and forth… > myfun<-function(x) + { + X<-x + for (i in 1:10) + { + X<-c(X,i) + } + X + } > myfun(0) [1] 0 1 2 3 4 5 6 7 8 9 10

A better function example A function to calculate the two sample t-statistic, showing all the steps. (From http://cran.r-project.org/doc/manuals/R-intro.html) > twosam <- function(y1, y2) { n1 <- length(y1); n2 <- length(y2) yb1 <- mean(y1); yb2 <- mean(y2) s1 <- var(y1); s2 <- var(y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) tst } With this function defined, you could perform two sample t-tests using a call such as: > tstat <- twosam(data$male, data$female); tstat

Control structures while loop Loops while an evaluation returns boolean TRUE > myfunction<-function(x) { while (x>10) { do something here } The evaluation is tested at the beginning of the loop; Note that in this case, the block may never be executed

Control structures repeat loop Loops until told to stop by break > myfunction<-function(x) { repeat { do something here if (x>10) break } Uses a conditional if statement; The break is called whenever the boolean evaluation is true and the block is exited

Descriptive statistics summary() Summary statistics related to a numeric variable > x<-rnorm(100) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.08800 -0.70850 -0.11480 -0.04413 0.76510 2.89100 >

Descriptive statistics plot() Simple x vs. y gram (scatterplot) > x<-rnorm(100) > y<-rnorm(100) > plot(x,y) > plot(rnorm(500)) > lines(rnorm(500))

Descriptive statistics heatmap generation (image) Scatterplot grid color weighted by intensities… - very useful for microarray analysis (well see next time…) - can be used with dendrogram generation

IQR Statistics of populations The equations so far are for sample statistics –a statistic is a single number estimated from a sample We use the sample to make inferences about the population. a parameter is a single number that summarizes some quality of a variable in a population. the term for the population mean is (mu), and Y bar is a sample estimator of. the term for the population standard deviation is (sigma), and s is a sample estimator of. Note that and are both elements of the normal probability curve. Source: http://www.bsos.umd.edu/socy/smartin/601/

IQR Measuring probabilities under the normal curve We can make transformations by scaling everything with respect to the mean and standard deviation. Let z = the number of standard deviations above or below the population mean. –z = 0y = –z = 1y = +/- (p=0.68) –z = 2y = +/- 2 (p=0.95) –z = 3y = +/- 3 (p=0.997)

Plotting using hist() and curve() > y<-hist(h,plot=F) > ylim<-range(0,y$density,dnorm(0)) > hist(x,freq=F,ylim=ylim > curve(dnorm(x),add=T)

Difficult to integrate… But probabilities have been Mapped out to this curve. Transformations from other Curves possible…

Plotting using qqnorm() > qqnorm(x)

Box plots (box and whiskers plots, Tukey, 1977) Outliers Fence / whiskers IQR Q3 Q1 Median Fence / whiskers min((Q3+1.5(IQR)),largest X) max((Q1+1.5(IQR)),smallest X) Plotting using boxplot() > boxplot(x) > boxplot(log(x))

My advice First learn to program in R. Then use the R libraries. Everything in R can be built up piecewise –The data is made of component parts Its extremely useful to know how to handle the objects –The graphics are made of component parts This allows extreme fine-tuning of your visualization! Go beyond scatterplots and barplots to describe complex data well and visualize hidden trends A good reference is Data Visualization by Edward Tufte.

Homework A simple problem, but one we may use frequently Use lapply (or sapply) to simulate the result of taking the mean of 100 random numbers from the normal distribution for 10 independent samples.

Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Similar presentations

Presentation on theme: "Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Similar presentations

Presentation on theme: "Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback