Module 2: Analyzing gene lists: over-representation analysis

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Continuous Random Variables & The Normal Probability Distribution
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
GO::TermFinder Gavin Sherlock Department of Genetics Stanford University
Gene Set Enrichment Analysis (GSEA)
Essential Statistics in Biology: Getting the Numbers Right
Sections 6-1 and 6-2 Overview Estimating a Population Proportion.
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Statistical Testing with Genes Saurabh Sinha CS 466.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
The Broad Institute of MIT and Harvard Differential Analysis.
GO enrichment and GOrilla
Chapter 13 Understanding research results: statistical inference.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Clench 2.0 A program for cluster enrichment analysis and integrated visualization of expression, annotation and transcription factor binding site data.
Canadian Bioinformatics Workshops
I. ANOVA revisited & reviewed
Introduction to Marketing Research
Data measurement, probability and Spearman’s Rho
a Cytoscape plugin to assess enrichment of
Clustering Manpreet S. Katari.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
Statistics in psychology
Statistical Testing with Genes
Association between two categorical variables
Canadian Bioinformatics Workshops
Testing for a difference
CHOOSING A STATISTICAL TEST
Overview and Basics of Hypothesis Testing
Inferential Statistics
SA3202 Statistical Methods for Social Sciences
Chapter 9 Hypothesis Testing.
Sequence comparison: Multiple testing correction
Elementary Statistics
Hypothesis testing. Chi-square test
Chapter 10 Analyzing the Association Between Categorical Variables
Chi Square (2) Dr. Richard Jackson
Writing the IA Report: Analysis and Evaluation
Chi2 (A.K.A X2).
Data measurement, probability and statistical tests
Sequence comparison: Multiple testing correction
False discovery rate estimation
The Omics Dashboard.
Nonparametric Statistics
Varying Intolerance of Gene Pathways to Mutational Classes Explain Genetic Convergence across Neuropsychiatric Disorders  Shahar Shohat, Eyal Ben-David,
Statistical Testing with Genes
Presentation transcript:

Module 2: Analyzing gene lists: over-representation analysis Interpreting Genes from OMICS Studies Quaid Morris Module 2: Analyzing Gene Lists

Overview The basics of over-representation analysis Lab #1 Gene list statistics: A taxonomy of tests for over-representation Correcting for multiple tests Easy-to-use software tools for over-representation analysis, Lab #2 Module 2: Analyzing Gene Lists

Overview The basics of over-representation analysis Lab #1 Gene list statistics: A taxonomy of tests for over-representation Correcting for multiple tests Easy-to-use software tools for over-representation analysis, Lab #2 Module 2: Analyzing Gene Lists

Over-representation analysis (ORA) in a nutshell Given: Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast) Gene annotations: e.g. Gene ontology, transcription factor binding sites in promoter ORA Question: Are any of the gene annotations surprisingly enriched in the gene list? Details: How to assess “surprisingly” (statistics) How to correct for repeating the tests Module 2: Analyzing Gene Lists

ORA example: Fisher’s exact test a.k.a., the hypergeometric test Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Formal question: What is the probability of finding 4 or more black genes in a random sample of 5 genes? Background population: 500 black genes, 5000 red genes Module 2: Analyzing Gene Lists

ORA example: Fisher’s exact test Gene list Null distribution RRP6 MRD1 RRP7 RRP43 RRP42 P-value Answer = 4.6 x 10-4 Background population: 500 black genes, 5000 red genes Module 2: Analyzing Gene Lists

Important details To test for under-enrichment of “black”, test for over-enrichment of “red”. Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later*** Module 2: Analyzing Gene Lists

What have we learned? Over-representation analysis (ORA) detects surprising enrichment of gene annotations in a gene list. Fisher’s exact test is used for ORA of gene lists for a single type of annotation, P-value for Fisher’s exact test is “the probability that a random draw of the same size as the gene list from the background population would produce the observed number of annotations in the gene list or more.”, and depends on size of both gene list and background population as well and # of black genes in gene list and background. Module 2: Analyzing Gene Lists

Overview The basics of over-representation analysis Lab #1 Gene list statistics: A taxonomy of tests for over-representation Correcting for multiple tests Easy-to-use software tools for over-representation analysis, Lab #2 Module 2: Analyzing Gene Lists

Break for lab #1 Try out an over-representation analysis using Fisher’s exact test Funspec: http://funspec.med.utoronto.ca/ Module 2: Analyzing Gene Lists

Overview The basics of over-representation analysis Lab #1 Gene list statistics: A taxonomy of tests for over-representation Correcting for multiple tests Easy-to-use software tools for over-representation analysis, Lab #2 Module 2: Analyzing Gene Lists

Examples of sources of gene lists Clustering Thresholding a gene “score” Gene list Gene list Genes Genes Examples of gene scores Time Source Eisen et al. (1998) PNAS 95 Source: Gerber et al. (2006) PNAS103 Module 2: Analyzing Gene Lists

ORA using gene scores Gene scores 6 7 5 Gene score distributions 1 2 Question: How likely are the differences between the two distributions due to chance? Module 2: Analyzing Gene Lists

ORA using the T-test Answer: Two-tailed T-test Black: N1=500 Gene score distributions Black: N1=500 Mean: m1 = 1.1 Std: s1 = 0.9 Red: N2=4500 Mean: m1 = 4.9 Std: s1 = 1.0 T-statistic = Formal Question: Are the means of the two distributions significantly different? = -88.5 Module 2: Analyzing Gene Lists

ORA using the T-test P-value = shaded area * 2 -88.5 Gene score distributions T-distribution Probability density T-statistic T-statistic = Formal Question: Are the means of the two distributions significantly different? = -88.5 Module 2: Analyzing Gene Lists

T-test caveats (also see next slide) Assumes black and red gene score distributions are both approximately Gaussian (i.e. normal) Score distribution assumption is often true for: Log ratios from microarrays Score distribution assumption is rarely true for: Peptide counts, sequence tags (SAGE or NextGen sequencing), transcription factor binding sites hits Tests for significance of difference in means of two distribution but does not test for other differences between distributions. I’m Module 2: Analyzing Gene Lists

Examples of inappropriate score distributions for T-tests Probability density Gene score  Distributions with gene score outliers, or “heavy-tailed” distributions Probability density Gene score  Bimodal “two-bumped” distributions. Probability density Gene score  Gene scores are positive and have increasing density near zero, e.g. sequence counts Solutions: 1) Robust test for difference of medians (WMW) 2) Direct test of difference of distributions (K-S) Module 2: Analyzing Gene Lists

Wilcoxon-Mann-Whitney (WMW) test aka Mann-Whitney U-test, Wilcoxon rank-sum test 1) Rank gene scores, calculate RB, sum of ranks of black gene scores ranks 3.2 1.7 6.5 4.5 0.1 2.1 5.6 -1.1 -2.5 -0.5 N2 red gene scores N1 black gene 6.5 5.6 4.5 3.2 2.1 1.7 0.1 -1.1 -2.5 -0.5 1 2 3 4 5 6 7 8 9 10 Probability density RB = 21 Gene score  Formal Question: Are the medians of the two distributions significantly different? Module 2: Analyzing Gene Lists Z

Wilcoxon-Mann-Whitney (WMW) test aka Mann-Whitney U-test, Wilcoxon rank-sum test 2) Calculate Z-score: mean rank RB = 21 = -1.4 Probability density Normal distribution Probability density P-value = shaded area * 2 -1.4 3) Calculate P-value: Gene score  Formal Question: Are the medians of the two distributions significantly different? Module 2: Analyzing Gene Lists Z Z

WMW test details Described method is only applicable for large N1 and N2 and when there are no tied scores Note: WMW test calculates the significance of the difference of medians, T-test calculates the significance of the difference of means WMW test is robust to (a few) outliers Module 2: Analyzing Gene Lists

Kolmogorov-Smirnov (K-S) test Cumulative distribution Probability density Gene score  1.0 Cumulative probability 0.5 Gene score  Question: Are the red and black distributions significantly different? 1) Calculate cumulative distributions of red and black Module 2: Analyzing Gene Lists

Kolmogorov-Smirnov (K-S) test Cumulative distribution Probability density Gene score  1.0 Cumulative probability 0.5 Gene score  Question: Are the red and black distributions significantly different? 1) Calculate cumulative distributions of red and black Module 2: Analyzing Gene Lists

Kolmogorov-Smirnov (K-S) test Cumulative distribution Probability density Gene score  1.0 Cumulative probability 0.5 Length = 0.4 Gene score  Formal question: Is the length of largest difference between the “empirical distribution functions” statistically significant? Module 2: Analyzing Gene Lists

WMW and K-S test caveats Neither tests is as sensitive as the T-test, ie they require more data points to detect the same amount of difference, so use the T-test whenever it is valid. K-S test and WMW can give you different answers: K-S detects difference of distributions, WMW detects difference of medians Rare problem: Tied scores and small # of observations can be a problem for some implementations of the WMW test Module 2: Analyzing Gene Lists

Proper tests for different distributions Probability density Gene score  Distributions with gene score outliers, or “heavy-tailed” distributions Probability density Gene score  Bimodal “two-bumped” distributions. Probability density Gene score  Gene scores are positive and have increasing density near zero, e.g. sequence counts First example goes last Recommended test: WMW or K-S K-S only WMW or K-S Module 2: Analyzing Gene Lists

What have we learned? T-test is not valid when one or both of the score distributions is not normal, If need a “robust” test, or to test for difference of medians use WMW test, To test for overall difference between two distributions, use K-S test. Module 2: Analyzing Gene Lists

Other common tests and distributions Chi-squared (contingency table) test Useful if there are >2 values of annotation (e.g. red genes, black genes, and blue genes) Used as an approximation to Fisher’s Exact Test but is inaccurate for small gene lists Binomial test Tests if gene scores for red and black either come from either N flips of the same coin or different coins. E.g. black genes are “expressed” in, on average, 5 out of 12 conditions and red genes are expressed in, on average, 2 out of 12 conditions, is the probability of being expressed significantly different for the black and red genes? Module 2: Analyzing Gene Lists

Overview The basics of over-representation analysis Lab #1 Gene list statistics: A taxonomy of tests for over-representation Correcting for multiple tests Easy-to-use software tools for over-representation analysis, Lab #2 Module 2: Analyzing Gene Lists

How to win the P-value lottery, part 1 Random draws Expect a random draw with observed enrichment once every 1 / P-value draws … 7,834 draws later … Whenever you do multiple tests, need to correct the p-values Remind audience of p-value definition Motivating example: Do 20 tests at a p-value of 0.05, expect one false positive Background population: 500 black genes, 5000 red genes Module 2: Analyzing Gene Lists

How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw Different annotations RRP6 MRD1 RRP7 RRP43 RRP42 RRP6 MRD1 RRP7 RRP43 RRP42 Use different annotations. “evaluate different annotations” Module 2: Analyzing Gene Lists

ORA tests need correction From the Gene Ontology website: Current ontology statistics: 25206 terms 14825 biological process 2101 cellular component 8280 molecular function Module 2: Analyzing Gene Lists

Simple P-value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that any single one of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)” Module 2: Analyzing Gene Lists

Bonferroni correction caveats Bonferroni correction is very stringent and can “wash away” real enrichments. Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments. Module 2: Analyzing Gene Lists

False discovery rate (FDR) FDR is the expected proportion of the observed enrichments that are due to random chance. Compare to Bonferroni correction which is the probability that any one of the observed enrichments is due to random chance. Module 2: Analyzing Gene Lists

Benjamini-Hochberg (B-H) FDR If a is the desired FDR (ie level of significance), then choose the corresponding cutoff for the original P-values as follows: 2) Test each P-value against q = a x (M-Rank+1) / M e.g. Let M = 100, a = 0.05 q Is P-value < q? 0.05 X 1.00 0.05 x 0.99 0.05 X 0.98 0.05 x 0.97 ... 0.05 x 0.01 No Yes … 1) Rank all “M” P-values P-value Rank 0.9 0.7 0.5 0.04 … 0.005 1 2 3 4 … M 3) New P-value cutoff, i.e. “a”, is first P-value to pass the test. Have numerical example. P-value cutoff of 0.04 ensures FDR < 0.05 Module 2: Analyzing Gene Lists

Reducing multiple test correction stringency The correction to the P-value threshold a depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be Can control the stringency by reducing the number of tests: e.g. use GO slim or restrict testing to the appropriate GO annotations. Module 2: Analyzing Gene Lists

What have we learned When testing multiple annotations, need to correct the P-values (or, equivalently, a) to avoid winning the P-value lottery. There are two types of corrections: Bonferroni controls the probability any one test is due to random chance (aka FWER) and is very stringent B-H controls the FDR, i.e., expected proportion of “hits” that are due to random chance Can control stringency by carefully choosing which annotation categories to test. Module 2: Analyzing Gene Lists

Overview The basics of over-representation analysis Lab #1 Gene list statistics: A taxonomy of tests for over-representation Correcting for multiple tests Easy-to-use software tools for over-representation analysis, Lab #2 Module 2: Analyzing Gene Lists

Funspec: Simple ORA for yeast http://funspec.med.utoronto.ca/ Cavaets: yeast only, last updated 2002 Choose sources of annotation Paste gene list here Bonferroni correct? YES! Module 2: Analyzing Gene Lists

GoMiner, part 1 http://discover.nci.nih.gov/gominer 1. Click “web interface” 2. Upload names of background genes 3. Upload gene list 4. Choose organism 5. Choose evidence code (All or Level 1) Module 2: Analyzing Gene Lists

GoMiner, part 2 6. Restrict # of tests via category size 7. Restrict # of tests via GO hierarchy 8. Results emailed to this address, in a few minutes Module 2: Analyzing Gene Lists

DAVID, part 1 http://david.abcc.ncifcrf.gov/ Paste list here DAVID automatically detects organism Choose ID type List type: list or background? Module 2: Analyzing Gene Lists

DAVID, part 2 http://david.abcc.ncifcrf.gov/ Module 2: Analyzing Gene Lists

BINGO, an ORA cytoscape plugin http://www. psb. ugent Links represent parent-child relationships in GO ontology Colours represent significance of enrichment Nodes represent GO categories Module 2: Analyzing Gene Lists

Other tools GSEA: Gene Set Enrichment Analysis http://www.broad.mit.edu/gsea/ More complex tool that allows gene scores to be analyzed for enrichment Has extensive gene annotations available Module 2: Analyzing Gene Lists

What have we learned Web-based ORA tools for gene lists: Funspec: easy tool for yeast, not maintained, uses GO annotations and some annotations (e.g. protein complexes) GoMiner: Uses GO annotations, covers many organisms, needs a background set of genes Cytoscape-based ORA tools for gene lists: BINGO: Does GO annotations and displays enrichment results graphically and visually organizes related categories Module 2: Analyzing Gene Lists

Overview The basics of over-representation analysis Lab #1 Gene list statistics: A taxonomy of tests for over-representation Correcting for multiple tests Easy-to-use software tools for over-representation analysis, Lab #2 Module 2: Analyzing Gene Lists

Lab #2 Use GoMiner to analyze a yeast gene list. Protocol: Step 1: Get list of all yeast genes from Biomart http://www.biomart.org/biomart/martview Step 2: Translate gene list IDs into gene symbols using Synergizer http://llama.med.harvard.edu/cgi/synergizer/translate Step 3: Do an enrichment analysis using GoMiner http://discover.nci.nih.gov/gominer/ Module 2: Analyzing Gene Lists

Questions? Module 2: Analyzing Gene Lists