Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D.

Slides:



Advertisements
Similar presentations
Summary Statistics/Simple Graphs in SAS/EXCEL/JMP.
Advertisements

Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Parallel R Andrew Jaffe Computing Club 4/5/2015. Overview Introduction multicore Array jobs The rest.
1 1 Mechanical Design and Production Dept, Faculty of Engineering, Zagazig University, Egypt. Mechanical Design and Production Dept, Faculty of Engineering,
R for Macroecology Aarhus University, Spring 2011.
R Language. What is R? Variables in R Summary of data Box plot Histogram Using Help in R.
Matlab Intro Simple introduction to some basic Matlab syntax. Declaration of a variable [ ] Matrices or vectors Some special (useful) syntax. Control statements.
How to improve your Data Analysis Processes in your Web Application / ERP using RClass Juan Antonio Breña Moral
MATLAB – A Computational Methods By Rohit Khokher Department of Computer Science, Sharda University, Greater Noida, India MATLAB – A Computational Methods.
Training on R For 3 rd and 4 th Year Honours Students, Dept. of Statistics, RU Empowered by Higher Education Quality Enhancement Project (HEQEP) Department.
Introduction to MATLAB The language of Technical Computing.
MATLAB – What is it? Computing environment / programming language Tool for manipulating matrices Many applications, you just need to get some numbers in.
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
R for Research Data Analysis using R Day1: Basic R Baburao Kamble University of Nebraska-Lincoln.
Experiences in Integration of the 'R' System into Kepler Dan Higgins – National Center for Ecological Analysis and Synthesis (NCEAS), UC Santa Barbara.
Lecture 4 Sept 8 Complete Chapter 3 exercises Chapter 4.
Lecture 6 Sept 15, 09 Goals: two-dimensional arrays matrix operations circuit analysis using Matlab image processing – simple examples.
Concatenation MATLAB lets you construct a new vector by concatenating other vectors: – A = [B C D... X Y Z] where the individual items in the brackets.
Lecture 4 Sept 7 Chapter 4. Chapter 4 – arrays, collections and indexing This chapter discusses the basic calculations involving rectangular collections.
How to Use the R Programming Language for Statistical Analyses Part I: An Introduction to R Jennifer Urbano Blackford, Ph.D. Department of Psychiatry Kennedy.
LISA Short Course Series R Basics
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Baburao Kamble (Ph.D) University of Nebraska-Lincoln Data Analysis Using R Week2: Data Structure, Types and Manipulation in R.
What is R By: Wase Siddiqui. Introduction R is a programming language which is used for statistical computing and graphics. “R is a language and environment.
Basic R Programming for Life Science Undergraduate Students Introductory Workshop (Session 1) 1.
Chapter 5 Review: Plotting Introduction to MATLAB 7 Engineering 161.
Chapter 5. Loops are common in most programming languages Plus side: Are very fast (in other languages) & easy to understand Negative side: Require a.
LISA Short Course Series R Basics Ana Maria Ortega Villa Fall 2013 LISA: R BasicsFall 2013.
Introduction to MATLAB Session 1 Prepared By: Dina El Kholy Ahmed Dalal Statistics Course – Biomedical Department -year 3.
STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.
732A44 Programming in R.  Self-studies of the course book  2 Lectures (1 in the beginning, 1 in the end)  Labs (computer). Compulsory submission of.
A B C Q R S! Coilín Minto Department of Biology, Dalhousie University.
Arko Barman with modification by C.F. Eick COSC 4335 Data Mining Spring 2015.
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,
Data Objects in R Vector1 dimensionAll elements have the same data types Data types: numeric, character logic, factor Matrix2 dimensions Array2 or more.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
Eng Ship Structures 1 Introduction to Matlab.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
1 Chapter 3 Arrays (2) 1. Array Referencing 2. Common Operations 1. Slicing 2. Diminution 3. Augmentation 3. List of Commonly Used Built-in Functions 1.
Arrays 1 Multiple values per variable. Why arrays? Can you collect one value from the user? How about two? Twenty? Two hundred? How about… I need to collect.
Installing R CRAN: –(R homepage: –Windows 95 and later  Base –rw2001.exe.
Hands-on Introduction to R. We live in oceans of data. Computers are essential to record and help analyse it. Competent scientists speak C/C++, Java,
R packages/libraries Data input/output Rachel Carroll Department of Public Health Sciences, MUSC Computing for Research I, Spring 2014.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011.
STAT 534: Statistical Computing Hari Narayanan
INTRODUCTION TO MATLAB DAVID COOPER SUMMER Course Layout SundayMondayTuesdayWednesdayThursdayFridaySaturday 67 Intro 89 Scripts 1011 Work
© 2015 by Wade Rogers Introduction to R Cytomics Workshop December, 2015.
NET 222: COMMUNICATIONS AND NETWORKS FUNDAMENTALS ( NET 222: COMMUNICATIONS AND NETWORKS FUNDAMENTALS (PRACTICAL PART) Tutorial 2 : Matlab - Getting Started.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
1-2 What is the Matlab environment? How can you create vectors ? What does the colon : operator do? How does the use of the built-in linspace function.
Control Structures Hara URL:
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –
Introduction to R and Data Science Tools in the Microsoft Stack Jamey Johnston.
Introduction to R.
Programming in R Intro, data and programming structures
R programming language
Introduction to R Samal Dharmarathna.
Digital Text and Data Processing
INTRODUCTION TO BASIC MATLAB
MATLAB DENC 2533 ECADD LAB 9.
Topic 5: Exploring Quantitative data
Use of Mathematics using Technology (Maltlab)
Statistics 540 Computing in Statistics
Communication and Coding Theory Lab(CS491)
CSCI N317 Computation for Scientific Applications Unit R
R Course 1st Lecture.
Presentation transcript:

Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA

We're moving ahead a bit... The majority of the class does some type of microarray analysis –Microarray analysis utilizes the same programmatic concepts we've been exploring Weve covered variables, control structures, data structures, functions in perl –Now well introduce a new language called R particularly suited for numerical analysis –Well learn about explorative data analysis –Well write our own functionality in this new language

R (A data-structure focused introduction) –Every R intro Ive read tends towards statistical tools first and data structures / programmatic concepts as an afterward I dont think this is the best way to learn a language –After all, well be doing complex data analysis –Going beyond one-off biostatistics learning Im going to introduce R the same way I introduced PERL I hope to show you that R is just as friendly…

What is R? From the R-project webpage ( R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering,...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

Where to get R? Available for a wide variety of platforms... (handled well in windows too!) Libraries available via CRAN (like CPAN we used before) The bioperl version of R is bioconductor - used extensively for routine and experimental microarray analysis

Lets begin The default mode for R is interactive CLI (with emacs keybindings!) A query – response line mode –An overgrown calculator R evaluates commands through function calls > 2+2 [1] 4 –A programming language Complex data structures Control blocks Objects!

Symbolic variables Variable assignment is handled via arrow notation <- Variables can be examined by simply calling the variable The index of the first element of the variable is given in brackets on each line [1] Scalar elements can be numerical, character, or boolean > x<-2 > x [1] 2 > x + x [1] 4 > x<-ACTCGATCGACT > x [1] ACTCGATGCACT > x<-T > x [1] TRUE

Vectors R handles vectors as single objects R defines three types of vectors: –numerical vectors –character vectors –logical vectors Vectors are created (and treated) as concatenation of scalar elements: > x<-c(1,2,3,4,5) > x [1] > x<-c(ACT,TCA,GGA,CCG) > x [1] ACT TCA GGA CCG > x<-c(T,T,F,T) > x [1] TRUE TRUE FALSE TRUE

Vector element access Very similar to Perl array element access Access by index –The index itself can be a vector, or any type of data element –Can be an expression –Negative indeces denote exclusion > x<-seq(1,20,2) > x[5] [1] 9 > x[c(1,3,4)] [1] > x[x>10] [1] > x[c(-1,-2,-3,-4,-5)] [1]

Vector functions seq (sequence) Creates a range of values in a vector > x<-seq(1,10,1) > x [1] > x<-4:12 > x [1] > x<-LETTERS(1:3) > x [1] A B C

Vector functions rep (replicate) Generates repeated values - Can be used to generate complex patterns - Can be used to generate data grouping codes > x<-c(10,100,1000) > rep(x,3) [1] > rep(x,1:3) [1] > rep(1:2,c(5,10)) [1]

Vector functions sort Sorts an array in-place > x<-c(10000,10,1000) > sort(x) [1]

Vector functions factor Grouping for categorical data > x<-c(0,1,2,1,2) > fx<-factor(x,levels=0:2) > levels(fx)<-c(low,middle,grande) > fx [1] low middle grande middle grande

Matrices Simply n-dimensional arrays –in R, most everything is an array Extends elements of any type Can dynamically set and change dimensions –default matrix dim is by columns > x<-seq(1,12) > dim(x)<-c(3,4) > x [,1][,2][,3][,4] [1,]14710 [2,]25811 [3,]36912 > matrix(1:12,nrow=3,byrow=T) [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,]

Matrix functions t (transpose) Changes rows and columns > matrix(1:12,nrow=3,byrow=T) [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,] > t(x) [,1][,2][,3] [1,]159 [2,]2610 [3,]3711 [4,]4812

Matrix functions rownames Assigns scalars to the row indeces (like a hash) > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,] > rownames(x)<-c(one,two,three) > x [,1][,2][,3][,4] one1234 two5678 three

Matrix functions colnames Assigns scalars to the column indeces (like a hash) > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,] > colnames(x)<-c(one,two,three,four) > x onetwothreefour [1,]1234 [2,]5678 [3,]

Matrix functions cbind Adds (in the agglomerative sense) cols together like XL > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,] > cbind(x,c(111,222,333)) [,1][,2][,3][,4][,5] [1,] [2,] [3,]

Matrix functions rbind Adds (in the agglomerative sense) rows together like XL > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,] > cbind(x,c(111,222,333,444)) [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,] [4,]

Object functions list Combines collections into composite objects - Objects are treated as vectors in R, plus methods - Matrices are collections of vectors > x<-matrix(1:12,nrow=3,byrow=T) > x [,1][,2][,3][,4] [1,]1234 [2,]5678 [3,] > y<-c(1:5) > z<-list(matrix=x,vector=y) > z $matrix [,1][,2][,3][,4]... $vector [1]

list (object) functions indexing Since it's a vector, we can obtain the elements > z$vector [1] > z$vector[5] [1] 5 > z$matrix[1,3] [1] 3 > z$vector[z$vector>3] [1] 4 5

list (object) functions data.frame If the vectors are the same length, we can agglomerate them in a special data matrix - the data is paired, and has unique row names > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > z xy

Reading data from files data.frame / read.table / read.csv / read.delim > myData<-read.table(example.txt, header=T) > myData field_onefield_two …… –The data frame is ideal for handling delimited files Assumes a header is present –(takes the header to have n-1 entries) Can handle a wide variety of interfaces with outputs –Tab, comma delimited txt files –SPSS, SAS, Stata, Minitab, S-PLUS v3 files –Works well with DB interface calls as well

Persistence save.image() /.RData / ls() –The workspace is dynamic Variables and functions are created or loaded –objects() or ls() shows availability of both –Can be saved to a local.RData file using save.image() –.RData loaded by default upon startup –Can specify the.RData (or whatever you name it) workspace using load() (may have to specify pathname!)

data frame (object) functions subset Allows extraction of a portion of a data frame > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > subset(z,x>2) xy

data frame (object) functions transform Allows extension of a data frame > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > transform(z,x.log=log(x)) xx.log

data frame (object) functions split Lists vectors according to group > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > h<-split(z$x,z$y) > h $1 [1] 6 $2 [1] 7 $3 [1] 8 $4 [1] 9 $5 [1] 10

data frame (object) functions lapply Implicit looping over group members > x<-c(1:5) > y<-c(6:10) > z<-data.frame(x,y) > lapply(z, mean) $x [1] 3 $y [1] 8

Functions in R very similar to what we've seen in perl! Blocks are the same - Takes arguments - Uses control structures (for, if, while loops,...) > x<-c(1:5) > my.function<-function(x) { u<-mean(x) } > y<-my.function(x) > y [1] 3

Control structures for loop Loops over a set range > myfunction<-function(x) { for (i in 1:10) { do something here } The variable i will take values of the sequence in turn The range is specified by the sequence

A stupid function example Just to illustrate passing args back and forth… > myfun<-function(x) + { + X<-x + for (i in 1:10) + { + X<-c(X,i) + } + X + } > myfun(0) [1]

A better function example A function to calculate the two sample t-statistic, showing all the steps. (From > twosam <- function(y1, y2) { n1 <- length(y1); n2 <- length(y2) yb1 <- mean(y1); yb2 <- mean(y2) s1 <- var(y1); s2 <- var(y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) tst } With this function defined, you could perform two sample t-tests using a call such as: > tstat <- twosam(data$male, data$female); tstat

Control structures while loop Loops while an evaluation returns boolean TRUE > myfunction<-function(x) { while (x>10) { do something here } The evaluation is tested at the beginning of the loop; Note that in this case, the block may never be executed

Control structures repeat loop Loops until told to stop by break > myfunction<-function(x) { repeat { do something here if (x>10) break } Uses a conditional if statement; The break is called whenever the boolean evaluation is true and the block is exited

Descriptive statistics summary() Summary statistics related to a numeric variable > x<-rnorm(100) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max >

Descriptive statistics plot() Simple x vs. y gram (scatterplot) > x<-rnorm(100) > y<-rnorm(100) > plot(x,y) > plot(rnorm(500)) > lines(rnorm(500))

Descriptive statistics heatmap generation (image) Scatterplot grid color weighted by intensities… - very useful for microarray analysis (well see next time…) - can be used with dendrogram generation

IQR Statistics of populations The equations so far are for sample statistics –a statistic is a single number estimated from a sample We use the sample to make inferences about the population. a parameter is a single number that summarizes some quality of a variable in a population. the term for the population mean is (mu), and Y bar is a sample estimator of. the term for the population standard deviation is (sigma), and s is a sample estimator of. Note that and are both elements of the normal probability curve. Source:

IQR Measuring probabilities under the normal curve We can make transformations by scaling everything with respect to the mean and standard deviation. Let z = the number of standard deviations above or below the population mean. –z = 0y = –z = 1y = +/- (p=0.68) –z = 2y = +/- 2 (p=0.95) –z = 3y = +/- 3 (p=0.997)

Plotting using hist() and curve() > y<-hist(h,plot=F) > ylim<-range(0,y$density,dnorm(0)) > hist(x,freq=F,ylim=ylim > curve(dnorm(x),add=T)

Difficult to integrate… But probabilities have been Mapped out to this curve. Transformations from other Curves possible…

Plotting using qqnorm() > qqnorm(x)

Box plots (box and whiskers plots, Tukey, 1977) Outliers Fence / whiskers IQR Q3 Q1 Median Fence / whiskers min((Q3+1.5(IQR)),largest X) max((Q1+1.5(IQR)),smallest X) Plotting using boxplot() > boxplot(x) > boxplot(log(x))

My advice First learn to program in R. Then use the R libraries. Everything in R can be built up piecewise –The data is made of component parts Its extremely useful to know how to handle the objects –The graphics are made of component parts This allows extreme fine-tuning of your visualization! Go beyond scatterplots and barplots to describe complex data well and visualize hidden trends A good reference is Data Visualization by Edward Tufte.

Homework A simple problem, but one we may use frequently Use lapply (or sapply) to simulate the result of taking the mean of 100 random numbers from the normal distribution for 10 independent samples.