A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

Tutorial outline How to install R on your own computers – Its free – But its already installed on these computers Loading data from excel Plotting Summary statistics

Files Data and slides on: http://core.brc.iop.kcl.ac.uk/brc- bioinformatics-workshop-october-2012

Show file extensions

Uncheck hide extensions for known file types Click Apply

Installing R – skip as already installed

And follow operating system specific installation instructions Installing R – skip as already installed

Starting R on these computers

Help files

Loading help files A useful function is read.table() – It allows you to read data from spreadsheets into R To see its help file you can use You can use ?function_name for any function to see a help file ?read.table

Loading data into R from excel

From excel Open testdata.xls

From excel You need to save it as a comma separated value file (.csv), go to file>save as>other formats

From excel

R working directory To open a file you will need to point R towards the folder that contains it. You can do this with setwd(), but well do it using the mouse Suppose you have the file in My Documents

Browsing folders To check that you are in the right folder type To see files in this folder you can type To list the current variables type Nothing should be loaded yet getwd()list.files()ls()

Loading data To follow along with this section, make sure your R working directory is that which contains the tutorial data

Read the contents of file testdata.csv into an R variable my.data with: read.csv is a wrapper for read.table which lets you specify more details about your file, eg: my.data <- read.csv(testdata.csv)my.data <- read.table(testdata.csv,sep=,,header=TRUE)

sep : Column separator header : Does the first row of the file contain column headers? skip : Number of rows to skip at the top of the file ?read.table for other useful parameters read.table()

Looking at loaded data

Take a look at the top couple of lines: Generate some basic summary stats: Check your new variable is in the R environment: ls()head(my.data)summary(my.data)

Number of rows and columns Row and column names Check the dimensions of your dataset: dim(my.data) nrow(my.data) ncol(my.data) rownames(my.data) colnames(my.data)

Subsetting Data

Look at the first col: Look at the third column of row 10 Look at the first row: my.data[1,]my.data[,1]my.data[10,3]

Look at the first column for rows 100 to 110 Same as above, but save to a variable Same as above but pre-defining the index vector Look at rows 30,40,50 and 60 my.data[100:110,1]my.subset <- my.data[100:110,1]my.data[c(30,40,50,60),] my.indices <- c(30, 40, 50, 60) my.data[my.indices,]

Look at the columns named 'height' and 'weight' for row 1: Same as above but pre-define the colnames vector Look at the column named 'weight' for row 1: You can subset on names instead of indices: my.data[1,weight]my.data[1,c(weight,height)] cols <- c(weight,height) my.data[1,cols]

Look at all columns except the second for row 1 Extract all rows except 1-100 Extract all rows except 35, 67,101 Negative indices exclude elements: my.data[1,-2]my.new.data <- my.data[-1:-100,] my.indices <- -1 * c(35, 67, 101) my.new.data <- my.data[my.indices,]

How tall is the person in the 7 th row? What gender is the person in the 300 th row? For the people in rows 20-30, who is the heaviest? For the people in rows 110, 350, 219, 74, who is the tallest? Save all rows except 500-600 in a variable my.new.data How many males and females are in this new dataset?

Formatting problems

Data isn't comma-separated? Specify the separator in read.table tab-delimited text is another common format, for which you can use sep=\t Load "testdata.txt", a tab-delimited version of the data

Data has extra header information at the top? Either delete this data in Excel before exporting to csv Or, use the skip=N argument to read.table Have a look at "testdata_1.csv" in Excel and then load it into R using read.table

Factors are inconsistently named R will just read in the data you give it. If you aren't consistent naming the levels of your factors it will see them as different levels R is case sensitive. 'MyLevel' != 'mylevel' Load the data from testdata_2.csv and have a look at the gender variable. Try and fix the problems in Excel and reload.

Measurements and units in a single column If you store values like 10kg, R will not interpret this as a numeric column Try loading file 'testdata_3.csv' - what has happened to the weights and heights information? Try loading again so that the two are loaded as character vectors. Have a look at the sub() function and see if you can fix the problem

Excel has just screwed up your data Older versions of Excel have a limit of 65536 rows. If you open a larger dataset in Excel it will be truncated. If you then save this dataset you will be saving the truncated version. Avoid opening large datasets in Excel, use R Excel tries to be helpful by formatting elements for you. Try the following and then open in Excel, save as csv and reload into R. What has happened? my.genes<-c('MASH1','SOX2','OCT4') write.csv(my.genes, file='mygenes.csv')

Plotting

Drawing histograms Optional exercises – 1) Try drawing a histogram of height 2) Try and label the x axis [hint: read the help file] hist(my.data$weight)

Drawing normal QQ plots qqnorm(my.data$weight);qqline(my.data$weight)

Drawing scatterplots Optional exercises: try these, do you understand this plot? plot(height~weight,data=my.data) plot(height~weight,data=my.data,col=as.numeric(gender))

Drawing boxplots boxplot(height~gender,data=my.data)

Saving plots JPEGs PDFs jpeg(boxplot.jpg) boxplot(height~gender,data=my.data) dev.off() pdf(boxplot.pdf) boxplot(height~gender,data=my.data) dev.off()

Summary statistics

Functions Covered read.table() head() dim() write.table() mean() sd() cor() cor.test() t.test() shapiro.test() wilcox.test() kruskal.test() lm() anova() coefficients() fitted() residuals() NB: to find help type ?function Eg: ?cor http://www.statmethods.net/index.html

Writing tables my.data <- read.table("testdata.csv",head=T,sep=",")Here we have height and weight for males and females. We now want to calculate Body Mass Index and save the data as *.csv my.data$BMI <- my.data$weight/(my.data$height ^2)head(my.data)write.table(my.data,file="my_testdata.csv",sep=",",quote=F,row.names=F)

Calculate Mean and SD mean_height <- mean(my.data$height)sd_height <- sd(my.data$height)mean_heightsd_height Try this with the other phenotypes Now lets get mean & sd for just the males mean_height_M <- mean(my.data$height[my.data$gender=="M"])sd_height_M <- sd(my.data$height[my.data$gender=="M"])mean_height_Msd_height_M

Correlate phenotypes and test for group differences You can use the cor( ) function to produce correlations and t.test to test for group differences cor(my.data$height,my.data$weight)cor.test(my.data$height,my.data$weight) Assesses whether the means of two groups are statistically different from each other using T-test t.test(height~gender,data=my.data)

It is always important to check model assumptions before making statistical inferences Test for normality : try this for all phenotypesshapiro.test(my_data$height) Non-parametric alternatives to the t-test R provides functions for carrying out Mann-Whitney U, Wilcoxon Signed Rank and Kruskal Wallis test wilcox.test(height~ gender,data=my.data)kruskal.test(height~ gender,data=my.data)

Linear regression fit <- lm(height~ gender + BMI,data=my.data)summary(fit) Anova table anova(fit) Other useful functions coefficients(fit) # model coefficientsconfint(fit, level=0.95) # CIs for model parametersfitted(fit) # predicted valuesresiduals(fit) # residuals

A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Similar presentations

Presentation on theme: "A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Similar presentations

Presentation on theme: "A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group."— Presentation transcript:

Similar presentations

About project

Feedback