Presentation is loading. Please wait.

Presentation is loading. Please wait.

A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group.

Similar presentations


Presentation on theme: "A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group."— Presentation transcript:

1 A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group

2 Tutorial outline How to install R on your own computers – Its free – But its already installed on these computers Loading data from excel Plotting Summary statistics

3 Files Data and slides on: http://core.brc.iop.kcl.ac.uk/brc- bioinformatics-workshop-october-2012

4 Show file extensions

5 Uncheck hide extensions for known file types Click Apply

6 Installing R – skip as already installed

7

8

9 And follow operating system specific installation instructions Installing R – skip as already installed

10 Starting R on these computers

11 Help files

12 Loading help files A useful function is read.table() – It allows you to read data from spreadsheets into R To see its help file you can use You can use ?function_name for any function to see a help file ?read.table

13 Loading data into R from excel

14 From excel Open testdata.xls

15 From excel You need to save it as a comma separated value file (.csv), go to file>save as>other formats

16 From excel

17 R working directory To open a file you will need to point R towards the folder that contains it. You can do this with setwd(), but well do it using the mouse Suppose you have the file in My Documents

18 Browsing folders To check that you are in the right folder type To see files in this folder you can type To list the current variables type Nothing should be loaded yet getwd()list.files()ls()

19 Loading data To follow along with this section, make sure your R working directory is that which contains the tutorial data

20 Read the contents of file testdata.csv into an R variable my.data with: read.csv is a wrapper for read.table which lets you specify more details about your file, eg: my.data <- read.csv(testdata.csv)my.data <- read.table(testdata.csv,sep=,,header=TRUE)

21 sep : Column separator header : Does the first row of the file contain column headers? skip : Number of rows to skip at the top of the file ?read.table for other useful parameters read.table()

22 Looking at loaded data

23 Take a look at the top couple of lines: Generate some basic summary stats: Check your new variable is in the R environment: ls()head(my.data)summary(my.data)

24 Number of rows and columns Row and column names Check the dimensions of your dataset: dim(my.data) nrow(my.data) ncol(my.data) rownames(my.data) colnames(my.data)

25 Subsetting Data

26 Look at the first col: Look at the third column of row 10 Look at the first row: my.data[1,]my.data[,1]my.data[10,3]

27 Look at the first column for rows 100 to 110 Same as above, but save to a variable Same as above but pre-defining the index vector Look at rows 30,40,50 and 60 my.data[100:110,1]my.subset <- my.data[100:110,1]my.data[c(30,40,50,60),] my.indices <- c(30, 40, 50, 60) my.data[my.indices,]

28 Look at the columns named 'height' and 'weight' for row 1: Same as above but pre-define the colnames vector Look at the column named 'weight' for row 1: You can subset on names instead of indices: my.data[1,weight]my.data[1,c(weight,height)] cols <- c(weight,height) my.data[1,cols]

29 Look at all columns except the second for row 1 Extract all rows except 1-100 Extract all rows except 35, 67,101 Negative indices exclude elements: my.data[1,-2]my.new.data <- my.data[-1:-100,] my.indices <- -1 * c(35, 67, 101) my.new.data <- my.data[my.indices,]

30 Quiz!

31 How tall is the person in the 7 th row? What gender is the person in the 300 th row? For the people in rows 20-30, who is the heaviest? For the people in rows 110, 350, 219, 74, who is the tallest? Save all rows except 500-600 in a variable my.new.data How many males and females are in this new dataset?

32 Formatting problems

33 Data isn't comma-separated? Specify the separator in read.table tab-delimited text is another common format, for which you can use sep=\t Load "testdata.txt", a tab-delimited version of the data

34 Data has extra header information at the top? Either delete this data in Excel before exporting to csv Or, use the skip=N argument to read.table Have a look at "testdata_1.csv" in Excel and then load it into R using read.table

35 Factors are inconsistently named R will just read in the data you give it. If you aren't consistent naming the levels of your factors it will see them as different levels R is case sensitive. 'MyLevel' != 'mylevel' Load the data from testdata_2.csv and have a look at the gender variable. Try and fix the problems in Excel and reload.

36 Measurements and units in a single column If you store values like 10kg, R will not interpret this as a numeric column Try loading file 'testdata_3.csv' - what has happened to the weights and heights information? Try loading again so that the two are loaded as character vectors. Have a look at the sub() function and see if you can fix the problem

37 Excel has just screwed up your data Older versions of Excel have a limit of 65536 rows. If you open a larger dataset in Excel it will be truncated. If you then save this dataset you will be saving the truncated version. Avoid opening large datasets in Excel, use R Excel tries to be helpful by formatting elements for you. Try the following and then open in Excel, save as csv and reload into R. What has happened? my.genes<-c('MASH1','SOX2','OCT4') write.csv(my.genes, file='mygenes.csv')

38 Plotting

39 Drawing histograms Optional exercises – 1) Try drawing a histogram of height 2) Try and label the x axis [hint: read the help file] hist(my.data$weight)

40 Drawing normal QQ plots qqnorm(my.data$weight);qqline(my.data$weight)

41 Drawing scatterplots Optional exercises: try these, do you understand this plot? plot(height~weight,data=my.data) plot(height~weight,data=my.data,col=as.numeric(gender))

42 Drawing boxplots boxplot(height~gender,data=my.data)

43 Saving plots JPEGs PDFs jpeg(boxplot.jpg) boxplot(height~gender,data=my.data) dev.off() pdf(boxplot.pdf) boxplot(height~gender,data=my.data) dev.off()

44 Summary statistics

45 Functions Covered read.table() head() dim() write.table() mean() sd() cor() cor.test() t.test() shapiro.test() wilcox.test() kruskal.test() lm() anova() coefficients() fitted() residuals() NB: to find help type ?function Eg: ?cor http://www.statmethods.net/index.html

46 Writing tables my.data <- read.table("testdata.csv",head=T,sep=",")Here we have height and weight for males and females. We now want to calculate Body Mass Index and save the data as *.csv my.data$BMI <- my.data$weight/(my.data$height ^2)head(my.data)write.table(my.data,file="my_testdata.csv",sep=",",quote=F,row.names=F)

47 Calculate Mean and SD mean_height <- mean(my.data$height)sd_height <- sd(my.data$height)mean_heightsd_height Try this with the other phenotypes Now lets get mean & sd for just the males mean_height_M <- mean(my.data$height[my.data$gender=="M"])sd_height_M <- sd(my.data$height[my.data$gender=="M"])mean_height_Msd_height_M

48 Correlate phenotypes and test for group differences You can use the cor( ) function to produce correlations and t.test to test for group differences cor(my.data$height,my.data$weight)cor.test(my.data$height,my.data$weight) Assesses whether the means of two groups are statistically different from each other using T-test t.test(height~gender,data=my.data)

49 It is always important to check model assumptions before making statistical inferences Test for normality : try this for all phenotypesshapiro.test(my_data$height) Non-parametric alternatives to the t-test R provides functions for carrying out Mann-Whitney U, Wilcoxon Signed Rank and Kruskal Wallis test wilcox.test(height~ gender,data=my.data)kruskal.test(height~ gender,data=my.data)

50 Linear regression fit <- lm(height~ gender + BMI,data=my.data)summary(fit) Anova table anova(fit) Other useful functions coefficients(fit) # model coefficientsconfint(fit, level=0.95) # CIs for model parametersfitted(fit) # predicted valuesresiduals(fit) # residuals


Download ppt "A gentle introduction to R – how to load in data and produce summary statistics BRC MH Bioinformatics group."

Similar presentations


Ads by Google