Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Introduction to Microarry Data Analysis - II BMI 730
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Lecture 9 Today: –Log transformation: interpretation for population inference (3.5) –Rank sum test (4.2) –Wilcoxon signed-rank test (4.4.2) Thursday: –Welch’s.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Student’s t statistic Use Test for equality of two means
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Nonparametrics and goodness of fit Petter Mostad
Chapter 15 Nonparametric Statistics
Hypothesis Tests and Confidence Intervals in Multiple Regressors
Inferential Statistics: SPSS
Hypothesis Testing:.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
NONPARAMETRIC STATISTICS
Gene Set Enrichment Analysis (GSEA)
Essential Statistics in Biology: Getting the Numbers Right
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Fall 2002Biostat Nonparametric Tests Nonparametric tests are useful when normality or the CLT can not be used. Nonparametric tests base inference.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Hypothesis Testing State the hypotheses. Formulate an analysis plan. Analyze sample data. Interpret the results.
Nonparametric Statistics
Jeopardy Hypothesis Testing t-test Basics t for Indep. Samples Related Samples t— Didn’t cover— Skip for now Ancient History $100 $200$200 $300 $500 $400.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Statistical Testing with Genes Saurabh Sinha CS 466.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
The Broad Institute of MIT and Harvard Differential Analysis.
GO enrichment and GOrilla
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Hypothesis Tests u Structure of hypothesis tests 1. choose the appropriate test »based on: data characteristics, study objectives »parametric or nonparametric.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Canadian Bioinformatics Workshops
Module 2: Analyzing gene lists: over-representation analysis
Canadian Bioinformatics Workshops
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

3Module 2: Analyzing Gene Lists Module 2: Analyzing gene lists: over-representation analysis Interpreting Genes from OMICS Studies Quaid Morris

4Module 2: Analyzing Gene Lists Examples of sources of gene lists Source Eisen et al. (1998) PNAS 95 Clustering Genes Thresholding a gene “score” Gene list Source: Gerber et al. (2006) PNAS103 Genes Time Examples of gene scores

5Module 2: Analyzing Gene Lists Over-representation analysis (ORA) in a nutshell Given: 1.Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast), or Gene Scores: RRP6 (4.0), MRD1 (3.0) etc 2.Gene annotations: e.g. Gene ontology, transcription factor binding sites in promoter ORA Question: Are any of the gene annotations surprisingly enriched in the gene list? Details: –How to assess “surprisingly” (statistics) –How to correct for repeating the tests

6Module 2: Analyzing Gene Lists Overview Theory: –Review: What is a P-value? The good ole’ T-test. –Fisher’s Exact Test, the bread and butter of ORA –Enrichment analysis with gene rankings –Correcting for multiple testing Practice –Lab #1: Funspec: ORA using Fisher’s Exact Test –Lab #2: GoMiner: ORA for Gene Ontology –Lab #3: Using GSEA to evaluate ranked lists

7Module 2: Analyzing Gene Lists Overview Theory: –Review: What is a P-value? The good ole’ T-test. –Fisher’s Exact Test, the bread and butter of ORA –Enrichment analysis with gene rankings –Correcting for multiple testing Practice –Lab #1: Funspec: ORA using Fisher’s Exact Test –Lab #2: GoMiner: ORA for Gene Ontology –Lab #3: Using GSEA to evaluate ranked lists

What is a P-value? The P-value is (a bound) on the probability that the “null hypothesis” is true, Calculated by calculating statistics using the data and testing the probability of observing those statistics, or ones more extreme, given a sample of the same size distributed according to the null hypothesis, Intuitively: P-value is the probability of a false positive result (aka “Type I error”) 8Module 2: Analyzing Gene Lists

9Module 2: Analyzing Gene Lists The good old T-test Gene scores Gene score distributions Question: How likely are the observed differences between the two distributions due to chance?

10Module 2: Analyzing Gene Lists ORA using the T-test Gene score distributions Answer: Two-tailed T-test Black: N 1 =500 Red: N 2 =4500 Mean: m 1 = 1.1 Std: s 1 = 0.9 T-statistic = Mean: m 1 = 4.9 Std: s 1 = 1.0 = Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?

11Module 2: Analyzing Gene Lists ORA using the T-test Gene score distributions T-statistic = = T-distribution (d.o.f. N) Probability density T-statistic 0 P-value = shaded area * Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?

12Module 2: Analyzing Gene Lists Overview Theory: –Review: What is a P-value? The good ole’ T-test. –Fisher’s Exact Test, the bread and butter of ORA –Enrichment analysis with gene rankings –Correcting for multiple testing Practice –Lab #1: Funspec: ORA using Fisher’s Exact Test –Lab #2: GoMiner: ORA for Gene Ontology –Lab #3: Using GSEA to evaluate ranked lists

13Module 2: Analyzing Gene Lists Fisher’s exact test: the bread and butter of ORA a.k.a., the hypergeometric test Background population: 500 black genes, 5000 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Formal question: What is the probability of finding 4 or more black genes in a random sample of 5 genes?

14Module 2: Analyzing Gene Lists Fisher’s exact test cont. Background population: 500 black genes, 5000 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 P-value Null distribution Answer = 4.6 x 10 -4

15Module 2: Analyzing Gene Lists Important details To test for under-enrichment of “black”, test for over- enrichment of “red”. Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later***

16Module 2: Analyzing Gene Lists What have we learned? Fisher’s exact test is used for ORA of gene lists for a single type of annotation, P-value for Fisher’s exact test –is “the probability that a random draw of the same size as the gene list from the background population would produce the observed number of annotations in the gene list or more.”, –and depends on size of both gene list and background population as well and # of black genes in gene list and background.

17Module 2: Analyzing Gene Lists Overview Theory: –Review: What is a P-value? The good ole’ T-test. –Fisher’s Exact Test, the bread and butter of ORA –Enrichment analysis with gene rankings –Correcting for multiple testing Practice –Lab #1: Funspec: ORA using Fisher’s Exact Test –Lab #2: GoMiner: ORA for Gene Ontology –Lab #3: Using GSEA to evaluate ranked lists

18Module 2: Analyzing Gene Lists Break for lab #1 Try out an over-representation analysis using Fisher’s exact test Funspec: –

19Module 2: Analyzing Gene Lists Funspec: Simple ORA for yeast Paste gene list here Bonferroni correct? YES! Choose sources of annotation Cavaets: yeast only, last updated 2002

20Module 2: Analyzing Gene Lists Overview Theory: –Review: What is a P-value? The good ole’ T-test. –Fisher’s Exact Test, the bread and butter of ORA –Enrichment analysis with gene rankings –Correcting for multiple testing Practice –Lab #1: Funspec: ORA using Fisher’s Exact Test –Lab #2: GoMiner: ORA for Gene Ontology –Lab #3: Using GSEA to evaluate ranked lists

21Module 2: Analyzing Gene Lists ORA with gene rankings: overview –Why can’t I use the T-test? –The Wilcoxon-Mann-Whitney (WMW) test: a T-test on ranks –The Kolmogorov-Smirnov (KS) test: testing for arbitrary differences between gene score distributions. –GSEA: finding the

22Module 2: Analyzing Gene Lists Reminder: ORA using the T-test Gene score distributions T-statistic = = T-distribution (d.o.f. N) Probability density T-statistic 0 P-value = shaded area * Formal Question: What is the probability of observing the T-statistic or one more extreme if the means of the two distributions were the same?

23Module 2: Analyzing Gene Lists Why can’t we use the T-test? 1.Assumes black and red gene score distributions are both approximately Gaussian (i.e. normal) –Score distribution assumption is often true for: Log ratios from microarrays –Score distribution assumption is rarely true for: Peptide counts, sequence tags (SAGE or NextGen sequencing), transcription factor binding sites hits 2.Tests for significance of difference in means of two distribution but does not test for other differences between distributions.

24Module 2: Analyzing Gene Lists Examples of inappropriate score distributions for T-tests Probability density Gene score  0 Gene scores are positive and have increasing density near zero, e.g. sequence counts Probability density Gene score  Distributions with gene score outliers, or “heavy- tailed” distributions Probability density Gene score  Bimodal “two-bumped” distributions. Solutions: 1) Robust test for difference of medians (WMW) 2) Direct test of difference of distributions (K-S)

25Module 2: Analyzing Gene Lists Wilcoxon-Mann-Whitney (WMW) test aka Mann-Whitney U-test, Wilcoxon rank-sum test 1) Rank gene scores, calculate R B, sum of ranks of black gene scores N 2 red gene scores N 1 black gene scores ranks R B = 21 Z Probability density Gene score  Formal Question: Are the medians of the two distributions significantly different?

26Module 2: Analyzing Gene Lists Wilcoxon-Mann-Whitney (WMW) test aka Mann-Whitney U-test, Wilcoxon rank-sum test R B = 21 Z Probability density Gene score  Formal Question: Are the medians of the two distributions significantly different? 2) Calculate Z-score: mean rank Z Normal distribution Probability density 0 P-value = shaded area * ) Calculate P-value: = -1.4

27Module 2: Analyzing Gene Lists WMW test details Described method is only applicable for large N 1 and N 2 and when there are no tied scores, WMW software uses a tied rank correction In most cases the WMW test is simply a T- test applied to the ranks of the gene scores WMW test is robust to (a few) outliers

28Module 2: Analyzing Gene Lists Kolmogorov-Smirnov (K-S) test Probability density Gene score  0 Question: Are the red and black distributions significantly different? Cumulative probability Gene score  Cumulative distribution 1) Calculate cumulative distributions of red and black

29Module 2: Analyzing Gene Lists Kolmogorov-Smirnov (K-S) test Probability density Gene score  0 Question: Are the red and black distributions significantly different? Cumulative probability Gene score  Cumulative distribution 1) Calculate cumulative distributions of red and black

30Module 2: Analyzing Gene Lists Kolmogorov-Smirnov (K-S) test Probability density Gene score  0 Formal question: Is the length of largest difference between the “empirical distribution functions” statistically significant? Cumulative probability Gene score  Cumulative distribution Length = 0.4

31Module 2: Analyzing Gene Lists WMW and K-S test caveats Neither tests is as sensitive as the T-test, ie they require more data points to detect the same amount of difference, so use the T-test whenever it is valid. K-S test and WMW can give you different answers: K-S detects difference of distributions, WMW detects difference of medians Rare problem: Tied scores and small # of observations can be a problem for some implementations of the WMW test

32Module 2: Analyzing Gene Lists Proper tests for different distributions Probability density Gene score  0 Gene scores are positive and have increasing density near zero, e.g. sequence counts Probability density Gene score  Distributions with gene score outliers, or “heavy- tailed” distributions Probability density Gene score  Bimodal “two-bumped” distributions. WMW or K-S K-S onlyWMW or K-S Recommended test:

A GSEA overview illustrating the method Subramanian A. et.al. PNAS;2005;102: ©2005 by National Academy of Sciences

Testing Gene-set Enrichment: GSEA (Gene-Set Enrichment Analysis) 1.Define a differentiality statistic (e.g. t-test) 2.Rank the genes (e.g. up and down) 3.Compute the Enrichment Score (EM) Navigate the ranked genes Gene in the gene-set  positive score Gene not in the gene-set  negative score Cumulative sum Enriched in UP Enriched in DOWN

Testing Gene-set Enrichment: GSEA (Gene-Set Enrichment Analysis) 4.Pick the EM value with maximum (absolute value) 5.Weight by position on the ranked list 6.Estimate significance (p-value, FDR) by permutations Enriched in UPEnriched in DOWN

36Module 2: Analyzing Gene Lists What have we learned? T-test is not valid when one or both of the score distributions is not normal, If need a “robust” test, or to test for difference of medians use WMW test or GSEA, To test for overall difference between two distributions, use K-S test.

37Module 2: Analyzing Gene Lists Other common tests and distributions Chi-squared (contingency table) test –Useful if there are >2 values of annotation (e.g. red genes, black genes, and blue genes) –Used as an approximation to Fisher’s Exact Test but is inaccurate for small gene lists Binomial test –Tests if gene scores for red and black either come from either N flips of the same coin or different coins. –E.g. black genes are “expressed” in, on average, 5 out of 12 conditions and red genes are expressed in, on average, 2 out of 12 conditions, is the probability of being expressed significantly different for the black and red genes?

38Module 2: Analyzing Gene Lists Overview Theory: –Review: What is a P-value? The good ole’ T-test. –Fisher’s Exact Test, the bread and butter of ORA –Enrichment analysis with gene rankings –Correcting for multiple testing Practice –Lab #1: Funspec: ORA using Fisher’s Exact Test –Lab #2: GoMiner: ORA for Gene Ontology –Lab #3: Using GSEA to evaluate ranked lists

39Module 2: Analyzing Gene Lists Break for lab #3 Try out an over-representation analysis on gene ranks using GSEA GSEA: –

40Module 2: Analyzing Gene Lists Overview Theory: –Review: What is a P-value? The good ole’ T-test. –Fisher’s Exact Test, the bread and butter of ORA –Enrichment analysis with gene rankings –Correcting for multiple testing Practice –Lab #1: Funspec: ORA using Fisher’s Exact Test –Lab #2: GoMiner: ORA for Gene Ontology –Lab #3: Using GSEA to evaluate ranked lists

41Module 2: Analyzing Gene Lists Correcting for multiple testing: overview –Why do we need to correct? Winning the P-value lottery. –Controlling the Family-wise Error Rate (FWER) with the Bonferroni-correction –Controlling the false-discovery rate (FDR): Benjamini-Hochberg, Storey-Tibshirani, Q-values and all that

42Module 2: Analyzing Gene Lists How to win the P-value lottery, part 1 Background population: 500 black genes, 5000 red genes Random draws … 7,834 draws later … Expect a random draw with observed enrichment once every 1 / P-value draws

43Module 2: Analyzing Gene Lists How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw RRP6 MRD1 RRP7 RRP43 RRP42 Different annotations RRP6 MRD1 RRP7 RRP43 RRP42

44Module 2: Analyzing Gene Lists ORA tests need correction From the Gene Ontology website: Current ontology statistics: terms biological process 2101 cellular component 8280 molecular function

45Module 2: Analyzing Gene Lists Two types of multiple test corrections Controlling the Family-Wise Error Rate (FWER) controls the probability that any test is a false positive Controlling the False Discovery Rate (FDR) controls the proportion of positive tests (i.e. rejections of the null hypothesis) that are false positives

46Module 2: Analyzing Gene Lists Controlling Family-Wise Error Rate using the Bonferroni correction If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that any single one of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

47Module 2: Analyzing Gene Lists Bonferroni correction caveats Bonferroni correction is very stringent and can “wash away” real enrichments. Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments.

48Module 2: Analyzing Gene Lists False discovery rate (FDR) FDR is the expected proportion of the observed enrichments that are due to random chance. Compare to Bonferroni correction which is the probability that any one of the observed enrichments is due to random chance.

49Module 2: Analyzing Gene Lists Benjamini-Hochberg (B-H) FDR If  is the desired FDR (ie level of significance), then choose the corresponding cutoff for the original P-values as follows: 1) Rank all “M” P-values P-valueRank … …M1234…M 2) Test each P-value against q =  x (M-Rank+1) / M e.g. Let M = 100,  q Is P-value < q? 0.05 X x X x x 0.01 No Yes … No 3) New P-value cutoff, i.e. “  ”, is lowest ranked P- value to pass the test. P-value cutoff of 0.04 ensures FDR < 0.05

50Module 2: Analyzing Gene Lists Reducing multiple test correction stringency The correction to the P-value threshold  depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be Can control the stringency by reducing the number of tests: e.g. use GO slim or restrict testing to the appropriate GO annotations.

51Module 2: Analyzing Gene Lists Overview Theory: –Review: What is a P-value? The good ole’ T-test. –Fisher’s Exact Test, the bread and butter of ORA –Enrichment analysis with gene rankings –Correcting for multiple testing Practice –Lab #1: Funspec: ORA using Fisher’s Exact Test –Lab #2: GOminer: ORA for Gene Ontology –Lab #3: Using GSEA to evaluate ranked lists

52Module 2: Analyzing Gene Lists What have we learned When testing multiple annotations, need to correct the P-values (or, equivalently,  ) to avoid winning the P-value lottery. There are two types of corrections: –Bonferroni controls the probability any one test is due to random chance (aka FWER) and is very stringent –B-H controls the FDR, i.e., expected proportion of “hits” that are due to random chance Can control stringency by carefully choosing which annotation categories to test.

53Module 2: Analyzing Gene Lists GoMiner, part Click “web interface” 2. Upload names of background genes 3. Upload gene list 4. Choose organism 5. Choose evidence code (All or Level 1)

54Module 2: Analyzing Gene Lists GoMiner, part 2 6. Restrict # of tests via category size 7. Restrict # of tests via GO hierarchy 8. Results ed to this address, in a few minutes

55Module 2: Analyzing Gene Lists DAVID, part 1 Paste list here Choose ID type List type: list or background? DAVID automatically detects organism

56Module 2: Analyzing Gene Lists DAVID, part 2

57Module 2: Analyzing Gene Lists BINGO, an ORA cytoscape plugin Links represent parent-child relationships in GO ontology Colours represent significance of enrichment Nodes represent GO categories

58Module 2: Analyzing Gene Lists What have we learned Web-based ORA tools for gene lists: –Funspec: easy tool for yeast, not maintained, uses GO annotations and some annotations (e.g. protein complexes) –GoMiner: Uses GO annotations, covers many organisms, needs a background set of genes Cytoscape-based ORA tools for gene lists: –BINGO: Does GO annotations and displays enrichment results graphically and visually organizes related categories

59Module 2: Analyzing Gene Lists Break for lab #2 Try out an over-representation analysis with multiple test correction using GOminer

60Module 2: Analyzing Gene Lists Questions?