1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4

1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4 mshmoish@cs.technion.ac.ilmshmoish@cs.technion.ac.il)‏ Michael Shmoish (mshmoish@cs.technion.ac.il)‏mshmoish@cs.technion.ac.il Bioinformatics Knowledge Unit The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - IIT

2 R as a set of statistical tables Distribution R name Additional arguments Normal norm mean, sd Uniform unif min, max Hypergeometric hyper m, n, k Poisson pois lambda Student’s t t df, ncp....

3 4 functions: d-, p-, q-, r-  For each probability distribution presented in R there are at least 4 basic functions whose names differ only in first character:  dnorm, dunif, dhyper, … - density  pnorm, punif, phyper, … - cumulative distribution function ( CDF) p- here stands for p-value  qnorm, qunif, qhyper, … - quantile ( inverse of CDF )  rnorm, runif, rhyper, … - random (simulate random deviates)

4 Hypergeometric: enrichment

5 Example 1. Given 100 genomic sequences, out of which 50 are of viral origin and 50 are non- viral. By searching (‘grep’ !) for a certain motif of interest found in 25 out of 100 sequences a researcher discovered that 18 viral genes have this motif. Is there any evidence for this motif over-representation in viral genes? >dhyper(18, 50, 50, 25) ### what’s a probability to get 18 white balls by choosing 25 balls blindly from the basket where 50 white and 50 black balls (without putting them back as in binomial) [1] 0.007435557 ### is that all? Not enough. Have to check “what’s the probability to get ‘18 or more’ “

6 Hypergeometric: dhyper and phyper To check “what’s the probability to get ‘18 or more’ “ > dhyper(18:25, 50, 50, 25) ### use ‘round(…, 4)’ to get nice numbers [1] 7.435557e-03 1.992302e-03 4.117425e-04 6.393517e-05 7.172611e-06 5.457421e-07 2.505959e-08 5.212394e-10 > sum(dhyper(18:25, 50, 50, 25)) [1] 0.009911281 What we’ve just computed is CDF: >phyper(17, 50,50,25, lower.tail = FALSE) ### p-value for getting more than 17 (i.e., ‘18 or more’) [1] 0.009911281 >barplot(dhyper(0:25, 50,50,25), ylim = c(0,0.2), main = "Prob. of Overlap")

7 Hypergeometric: dhyper

8 Exercise: dhyper and phyper 1) Under condition of example 1: a) What’s the probability to get exactly 13 viral genes? b) What’s the probability to get exactly 13 non-viral genes? c) What’s the probability to get ‘5 or less’ viral genes? Compute both using phyper and by sum(dhyper(…)), and then compare. 2) Generate 100 uniform random values in the range [0, 1], keep them in vector runi and draw them as time series.

9 Statistical tests

10 Student’s t-test

11 Student’s t-test > x <- rnorm(50) > y <- runif(30) > t.test(x,y) ### by default: unpaired, unequal variance (Welch), two-sided Welch Two Sample t-test data: x and y t = -4.1209, df = 68.356, p-value = 0.0001042 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.7818587 -0.2717309 …. > names(t.test(x,y)) [1] "statistic" "parameter" "p.value" "conf.int" "estimate" [6] "null.value" "alternative" "method" "data.name" > t.test(x,y)$p.val [1] 0.0001041911

12 Kolmogorov-Smirnov test

13 Kolmogorov-Smirnov test > x <- rnorm(50) > y <- runif(30) > kst = ks.test(x, y) # Do x and y come from the same distribution? > kst Two-sample Kolmogorov-Smirnov test data: x and y D = 0.56, p-value = 6.303e-06 alternative hypothesis: two-sided > names(kst) [1] "statistic" "p.value" "alternative" "method" "data.name" >kst$p.val [1] 6.303116e-06

14 Overview of R-package installation  Open R-console  Open ‘Packages’ drop-down list under RGui  Choose ‘Set CRAN mirror’ (then choose a mirror and click OK) >chooseCRANmirror() ### automatically appears  Choose repositories (CRAN – default, usually one adds ‘BioC software’ etc., click OK (clicking ‘Cancel’ prompts the dialog within the console): >setRepositories() ### automatically appears  Install package from repositories (could take time!)  Update package  Install package from a local zip file

15 Package installation (mirror)

16 Package installation (mirror)

17 Package installation (repository)

18 Package installation (repository)

19 Package installation (install)

20 Package installation (seqinR)

21 Package loading (seqinR) >library(seqinr) ### returns error OR >require(seqinr) ### returns FALSE ; designed for use inside functions OR

22 Package loading

23 Some R-packages for Bioinformatics  ‘limma’, ‘affy’, ‘marray’ (Bioconductor project), ‘lumi’, ‘beadarray’ – for microarray  ‘ape’ - phylogenetics  ‘seqinr’ - manipulation of biosequences  ‘BioNet’ - for networks integration  ‘mseq’, ‘DEGseq’ – next-generation sequencing

24 Getting help with R-packages  “Task Views” at http://cran.r-project.org/ : ClinicalTrials, Genetics, Cluster, Pharmacokinetics, etc.http://cran.r-project.org/  ‘sos’ package : function findFn of ‘sos’ package produces HTML page per keyword (e.g. “protein”): > findFn(“protein”, maxPages = 2)

25 seqinR PACKAGE

26 seqinR : read Fasta files

27 seqinR : read fasta files (cont) ‘read.fasta’ > ff <- system.file("sequences/someORF.fsa", package = "seqinr") > fs <- read.fasta(file = ff) > names(fs) [1] "YAL001C" "YAL002W" "YAL003W" "YAL005C" "YAL007C" "YAL008W" "YAL009W“ > count(fs[[1]],2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 802 266 370 524 317 147 145 290 405 182 223 259 437 304 331 570 >seqAA <- read.fasta(file = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA")

28 seqinR: ‘c2s’ and ‘s2c’ functions  Given a sequence (character string), how to get a vector of individual characters? Generic R-solution is non-intuitive: unlist(strsplit(…,””))  In seqinR package this is very simple: > s2c("acgggtacggtcccatcgaa") [1] "a" "c" "g" "g" "g" "t" "a" "c" "g" "g" "t" "c" "c" "c" "a" "t" "c" "g" "a" "a“ > a <- s2c("acgggtacggtcccatcgaa") > a [1] "a" "c" "g" "g" "g" "t" "a" "c" "g" "g" "t" "c" "c" "c" "a" "t" "c" "g" "a" "a"  The inverse operation is done by function ‘s2c’ > c2s(a) [1] "acgggtacggtcccatcgaa"

29 seqinR: ‘comp’ function > rev(a) #a function from package base [1] "a" "a" "g" "c" "t" "a" "c" "c" "c" "t" "g" "g" "c" "a" "t" "g" "g" "g" "c" "a" > c2s(rev(a)) [1] "aagctaccctggcatgggca" > ar = c2s(rev(a))  How to get a reverse complement? > comp(ar) Error in s2n(seq) : sequence is not a vector of chars > comp(rev(a)) [1] "t" "t" "c" "g" "a" "t" "g" "g" "g" "a" "c" "c" "g" "t" "a" "c" "c" "c" "g" "t" > print( arc <- c2s(comp(rev(a))) ) ### both assignment and printing [1] "ttcgatgggaccgtacccgt"

30 seqinR: ‘count’ function >a <- s2c("acgggtacggtcccatcgaa") #To count dinucleotide occurrences in sequence a: > count(a, 2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 1 2 0 1 1 2 3 0 1 0 3 2 1 2 0 0 # To count trinucleotide occurrences in sequence a, in frame 2 (frame counting starts from 0): > count(a, 3, 2) aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 cct cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 2 1 1 gtg gtt taa tac tag tat tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0

31 seqinR: ‘count’ function  To count dinucleotide frequencies in sequence a: > count(a, 2, freq = TRUE) aa ac ag at ca cc cg ct 0.05263158 0.10526316 0.00000000 0.05263158 0.05263158 0.10526316 0.15789474 0.00000000 ga gc gg gt ta tc tg tt 0.05263158 0.00000000 0.15789474 0.10526316 0.05263158 0.10526316 0.00000000 0.00000000 > round(count(a, 2, freq = TRUE), 3) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0.053 0.105 0.000 0.053 0.053 0.105 0.158 0.000 0.053 0.000 0.158 0.105 0.053 0.105 0.000 0.000 >?permutation

32 Seqinr: ‘AAstat’ function > seqAA <- read.fasta(file = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA") A function AAstat of package ‘seqinr’ returns a list with a simple protein sequence information including the number of residues, the percentage physico-chemical classes and the theoretical isoelectric point ; > AAstat(seqAA[[1]]) $Compo A C D E F G H I K L M N P Q R S T V W Y 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1 … $Prop$Aliphatic [1] 0.2404372 $Prop$Aromatic [1] 0.06010929... $Pi [1] 8.534902

33 seqinr ‘AAstat’ function (cont.)

34 Exercise 1) Read your favorite DNA sequence a) Find dinucleotide composition in natural frame b) Find frequencies of dinucleotides in the 1 st frame c) The same for trinucleotides 2) Read your favorite AA sequence and learn its composition with AAstat function

1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4

Similar presentations

Presentation on theme: "1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4

Similar presentations

Presentation on theme: "1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4"— Presentation transcript:

Similar presentations

About project

Feedback