Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Lecture 9. Introduction to Statistical.

Similar presentations


Presentation on theme: "Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Lecture 9. Introduction to Statistical."— Presentation transcript:

1 Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Lecture 9. Introduction to Statistical Methods and R-programming MES7594-01 Genome Informatics I (2015 Spring)

2 Notice This class is to introduce basic statistical concepts and methods required for conducting omics data analysis, and to provide students with an opportunity to start R programming. Take ‘ 생의학통계입문 ’ class, if you want in-depth understanding of statistical methods for biomedical data analysis. Take ‘BCU 의학통계교육 ’, if you need broader application of R. MES7594-01 Genome Informatics I (2015 Spring)

3 Summary Statistics Central value: - Mean: sum divided by number of observations - Median: middle value. Less sensitive to outliers

4 Summary Statistics Dispersion: - Variance: the average of the squares of the deviations of the observations from their mean - Standard deviation (SD): square root of the variance - Standard error = SD/sqrt(n) - Interquartile range: difference between Q3 (75%) and Q1 (25%).

5 Summary Statistics Noise vs. Signal: Coefficient of variation = SD/mean ; can be used to detect outliers from biol. replic. (e.g. contaminations) vs. Signal-to-noise ratio= (  A -  B )/(  A +  B ) ; used in GSEA for comparing two groups

6 What is P-value? Probability density of each outcome is computed under null hypothesis Example: You flipped a coin 5 times and get 5 heads. Can you say the coin is fair?

7 What is P-value? Probability density of each outcome is computed under null hypothesis Example: You flipped a coin 5 times and get 5 heads. Can you say the coin is fair? Confidence interval

8 Probability Distributions Discrete values: - Binomial distribution: frequency distribution of exactly m events in n trials when the probability of the event of any given trial is known (e.g. dice). Binomial probability distribution

9 Probability Distributions Discrete values: - Poisson distribution: frequency distribution of the number of times a rare event occurs. (e.g. # phone calls/hr) where k: number of occurrence : expected value of X - Reading material: http://www.life.illinois.edu/mcb/432/ Handouts/Binomial_and_Poisson.p df Poisson probability distribution k: number of occurence : expected value of X

10 Probability Distributions Continuous values: - Normal distribution: bell shape distribution of continuous values - Normal distribution function: where  is mean,  is standard deviation. Gaussian probability distribution

11 Probability Distributions Z score (standard score): - Measure of how many standard deviations away from the mean - Z score cutoff of (+/-) 2 : out of 95% confidence interval. => significant hits

12 Student’s t-test Compares the means of two populations when they follow normal distribution. Example: test drug effects in single cells

13 Wilcoxon Ranksum Test (Mann-Whitney Test) Compares distribution of two populations Nonparametric alternative to t-test that ranks all the values from low to high, computes a P value that depends on the discrepancy between the mean ranks of the two groups. As this uses rank, it is more robust to the presence of outliers. Example: test a drug effect in lung cancer cells vs. breast cancer cells

14 Kolmogoruv Smirnov (K-S) test Nonparametric test that compares the cumulative distribution of the two data sets, computes a P value that depends on the largest discrepancy between distributions. Sensitive to any differences in the two distributions; e.g. shape, spread, median. Example: identify gene sets responsive to a drug when a subset of genes in a gene set respond differently from all other genes in the genome.

15 Kolmogoruv Smirnov (K-S) test Cumulative Distribution Function (CDF) plot Example: customer waiting time in a fast food restaurant Red: expected blue: one day sample (N = 15) No difference in the distribution

16 Hypergeometric Test Probability of k successes in n draws, without replacement, from a finite population of size N (all the balls) containing exactly K successes (red ball), wherein each draw is either a success or a failure. Example: test significant enrichment of hits to a particular gene set. Gene Ontology (GO) analysis.

17 Fisher’s Exact Test for Count Data For analysis of contingency table Example: BrdU incorporation assay with a lung cancer cell line after exposure to gemcitabine. One-tailed Fisher’s exact test is identical to hypergeometric test. Vehicle (DMSO) GemcitabineTotal BrdU + cells15318 BrdU - cells111223 Total261541

18 Pearson’s correlation test Test for (+/-) correlation between paired samples Pearson product-moment correlation coefficient (r) Example: dose dependent increase of drug metabolizing enzyme level

19 Permutation Test First introduced in 1930’s by Ron Fisher. This is a common statistical tool when you have data set without any known error model. The distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points

20 Permutation Test, a toy example

21

22

23

24

25

26

27

28

29 Multiple Testing Correction Multiple hypothesis testing problem: Suppose we consider the safety of a drug in terms of the occurrences of different types of side effects. As more types of side effects are considered, it becomes more likely that the new drug will appear to be less safe than existing drugs in terms of at least one side effect by random chance alone (Wikipedia). In a typical genomic assay, you get > 20K p-values from > 20K hypotheses. These p-values need to be corrected (= adjusted) for increased type I error (= false positives). Popular methods: Bonferroni correction (most stringent, you get least number of hits), False discovery rate (most popular, less stringent). FDR 10%: Statistical confidence level that 10% of the discoveries are false (= 90% discovereis are true).

30 R programming Download and install latest version of R and R-studio to your PC: http://cran.r-project.org, http://www.rstudio.com/products/rstudio/download/ http://cran.r-project.org http://www.rstudio.com/products/rstudio/download/ R tutorials: http://cran.r-project.org/doc/manuals/r-release/R-intro.html, http://www.cyclismo.org/tutorial/R/ http://cran.r-project.org/doc/manuals/r-release/R-intro.html http://www.cyclismo.org/tutorial/R/ MES7594-01 Genome Informatics I (2015 Spring)

31 Basic R grammar Math: +, -, *, /, %, ^, NA Function: sqrt(), abs(), log(), exp() String: “” Logical values: >, <, ==, != Variables: x <- 3 Vector: c(), rep(), seq(), :, [], names(), sum(), max(), min(), mean(), sd(), sort(), append(), rm() Matrix: matrix(), dim(), colMeans(), rowMeans() File: source(filename.R), read.csv(filename.txt), write.txt(), save(), load() Library: install.packages(), library()

32 Basic R grammar ‘if’ statement ‘for’ loop ‘while’ loop Make your own ‘function()’

33 Display commands Dot plot Barplot Box plot Matrix display: contour(), persp(), image()

34 Statistical tests Student’s T-test Wincoxon Ranksum test K-S test Hypergeometric test Fisher’s exact test Pearson correlation test Multiple testing correction

35 Poisson Example: shRNA pooled screen shRNA pooled screen is to identify essential genes in cell lines. Pooled shRNA plasmid library is packaged to lenti-virus and infected to cells. You have to avoid multiple infections per cell.

36 1. Cherry pick 1800 hairpins from TRC library. 2. Isolate plasmids in pool. 3. Package into lentivirus. 4. Determine MOI & selection condition of each cell line. 5. Infect cell lines and culture in selection condition for 16 passages in triplicate. 6. Harvest gDNA at 48 hrs & 16 passages after infection. 7. PCR amplify hairpins with barcode tagging. 8. P5/P7 (sequencing primers) tagging by PCR. 9. NGS sequencing. Poisson Example: shRNA pooled screen

37 Poisson Example: MOI MOI (multiplicity of infection): the ratio of the number of virus particles to the number of target cells Question: for pooled shRNA screens, what is the proper MOI? It is preferable for one cell to be transduced by one shRNA hairpin not by multiple hairpins. (Poisson probability mass function) Where m is the MOI, n is the infected virus per cell, P(n) is the probability that n virus infect a cell. Draw surface map of the probability for varying number of MOI and n.

38 Example: Whole genome siRNA screen Cel l Viability 96 hrs siRNA screen is done in an array format (well-by- well). Comparison with shRNA screens: - Acute and transient knockdown - Easier data processing

39 Example: Z score estimation Question: Let’s say you have finished siRNA toxicity screens in a cancer cell line. What are the cancer essential genes by the Z score cutoff of -3?

40 1.TA for systems biology course: Hyosil Kim, Ph.D. (hyosil7979@gmail.com). 2. Welcome to bring your own labtop for the lab session. MES7594-01 Genome Informatics I (2015 Spring) Notice


Download ppt "Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine Lecture 9. Introduction to Statistical."

Similar presentations


Ads by Google