Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Introduction to Microarry Data Analysis - II BMI 730
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Decision Tree Models in Data Mining
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Inferential Statistics
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Hypothesis Testing.
1 Using the TFOE transcriptional regulation network spreadsheet tool Tige Rustad, Senior Scientist at Seattle Biomed
Differential Analysis & FDR Correction
Gene Set Enrichment Analysis (GSEA)
Essential Statistics in Biology: Getting the Numbers Right
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
Fundamentals of Data Analysis Lecture 10 Management of data sets and improving the precision of measurement pt. 2.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
ANOVA (Analysis of Variance) by Aziza Munir
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Course on Functional Analysis
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
BIOS6660 shRNAseq Gene Set Enrichment Analysis Tzu L Phang PhD Robert Stearman PhD April 16, 2014.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Statistical Testing with Genes Saurabh Sinha CS 466.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
The Broad Institute of MIT and Harvard Differential Analysis.
GO enrichment and GOrilla
Copyright OpenHelix. No use or reproduction without express written consent1.
CuffDiff ran successfully. Output files include gene_exp.diff What are the next steps? Use Navigation bar to find files; they may be under DNA Subway if.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Gene Set Enrichment Analysis. GSEA: Key Features Ranks all genes on array based on their differential expression Identifies gene sets whose member genes.
Canadian Bioinformatics Workshops
Module 2: Analyzing gene lists: over-representation analysis
a Cytoscape plugin to assess enrichment of
Differential Gene Expression
Canadian Bioinformatics Workshops
Functional enrichment of differentially expressed genes.
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 2 Finding over-represented pathways in gene lists

Module 2 bioinformatics.ca Outline Introduction to enrichment analysis Demo of DAVID, a web tool for doing enrichment analysis Theory of enrichment analysis: – Fisher’s Exact Test – Multiple test corrections – Other enrichment tests (GSEA) DAVID Lab

Module 2 bioinformatics.ca Enrichment Test Spindle Apoptosis Microarray Experiment (gene expression table) Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table

Module 2 bioinformatics.ca 6 Enrichment Analysis Given: 1.Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast) 2.Gene sets or annotations: e.g. Gene ontology, transcription factor binding sites in promoter Question: Are any of the gene annotations surprisingly enriched in the gene list? Details: – Where do the gene lists come from? – How to assess “surprisingly” (statistics) – How to correct for repeating the tests

Module 2 bioinformatics.ca Two-class Design Expression Matrix Class-1Class-2 Genes Ranked by Differential Statistic E.g.: - Fold change - Log (ratio) - t-test -Significance analysis of microarrays UP DOWN UP DOWN Selection by Threshold

Module 2 bioinformatics.ca Time-course Design Expression Matrix t1t1 t2t2 t3t3 …tntn Gene Clusters E.g.: - K-means - K-medoids - SOM

Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP-regulated) Background (all genes on the array)

Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP-regulated) Background (all genes on the array) Gene-set

Module 2 bioinformatics.ca Enrichment Test The P-value assesses the probability that the overlap is at least as large as observed by random sampling the array genes. Random samples of array genes The output of an enrichment test is a P-value

Module 2 bioinformatics.ca DAVID demo

Module 2 bioinformatics.ca Step 1: Define your gene list Either – (a) Copy and paste your list in – (b) Upload a gene list file – (c) Choose an example gene list (so, click “demolist1” on next slide)

Module 2 bioinformatics.ca Step 1: Define your gene list If you choose the wrong identifier, you may need to use the conversion tool Click here to define type of list (now we are doing a gene list, next slide we will define the background)

Module 2 bioinformatics.ca Step 2: Choose background You can either upload a background list (in the upload tab, see previous step) or choose one of the background sets (shown here) Example list is from the Human U95A array, select it here.

Module 2 bioinformatics.ca Step 3: Check list & background

Module 2 bioinformatics.ca Step 4: Run Enrichment Analysis Click here

Module 2 bioinformatics.ca Step 5: Select categories of gene lists to test Red indicates default selections, click check marks if you want to change Click here once you’ve selected your sets Click here to expand

Module 2 bioinformatics.ca Step 6: View enrichment EASE P-value 4.0E-4 means: 4.0 x Gene set name # of genes in set with annotation

Module 2 bioinformatics.ca Step 7: Change parameters if desired Set “count” to 1 and “EASE” to 1 if you want maximum # of categories. Beware, only corrects within category.

Module 2 bioinformatics.ca Step 8: Download results Download spreadsheet (in tab- delimited text format). Beware: if you click you may get text in a browser window, if this happens, “right-click” to save as a file.

Module 2 bioinformatics.ca Step 8a: What can happen if no right-click

Module 2 bioinformatics.ca Step 8b: How it looks in a spreadsheet

Module 2 bioinformatics.ca Outline of theory component Fisher’s Exact Test for calculating enrichment P-values (also used for calculating EASE score) Multiple test corrections: – Bonferroni – Benjamini-Hochberg FDR Other enrichment tests – GSEA for ranked lists (with a short intro)

Module 2 bioinformatics.ca 25 Fisher’s exact test a.k.a., the hypergeometric test Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Null hypothesis: List is a random sample from population Alternative hypothesis: More black genes than expected

Module 2 bioinformatics.ca 26 Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 P-value Null distribution Answer = 4.6 x Fisher’s exact test a.k.a., the hypergeometric test

Module 2 bioinformatics.ca 27 Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 2x2 contingency table for Fisher’s Exact Test In gene listNot in gene list In gene set4496 Not in gene set14499 e.g.:

Module 2 bioinformatics.ca 28 Important details To test for under-enrichment of “black”, test for over- enrichment of “red”. The EASE score used by DAVID subtracts one from the observed overlap between gene list and gene set to ensure >1 from the list is in the gene set. Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later***

Module 2 bioinformatics.ca Multiple test corrections

Module 2 bioinformatics.ca How to win the P-value lottery, part 1 Background population: 500 black genes, 4500 red genes Random draws … 7,834 draws later … Expect a random draw with observed enrichment once every 1 / P-value draws

Module 2 bioinformatics.ca How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw RRP6 MRD1 RRP7 RRP43 RRP42 Different annotation RRP6 MRD1 RRP7 RRP43 RRP42

Module 2 bioinformatics.ca Simple P-value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that one or more of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

Module 2 bioinformatics.ca Bonferroni correction caveats Bonferroni correction is very stringent and can “wash away” real enrichments. Often one is willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments.

Module 2 bioinformatics.ca False discovery rate (FDR) FDR is the expected proportion of the observed enrichments due to random chance. Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance. Typically FDR corrections are calculated using the Benjamini-Hochberg procedure. FDR threshold is often called the “q-value”

Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value … … Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation Sort P-values of all tests in decreasing order

Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value … Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = 0.35 … 0.04 x 53/50 = x 53/51 = x 53/52 = x 53/53 = 0.07 Adjusted P-value = P-value X [# of tests] / Rank …

Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value … Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = 0.35 … 0.04 x 53/50 = x 53/51 = x 53/52 = x 53/53 = 0.07 Q-value = minimum adjusted P-value at given rank or below … …

Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRank FDR / Q-value … Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = 0.35 … 0.04 x 53/50 = x 53/51 = x 53/52 = x 53/53 = 0.07 P-value threshold is highest ranking P-value for which corresponding Q-value is below desired significance threshold … P-value threshold for FDR < 0.05 FDR < 0.05? YYY…YNNNYYY…YNNN …

Module 2 bioinformatics.ca Reducing multiple test correction stringency The correction to the P-value threshold  depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or select only larger GO categories.

Module 2 bioinformatics.ca Multiple test correction in DAVID In DAVID, the “Benjamini-Hochberg” column corresponds to the false discovery rate as it is typically defined. It is unclear what the FDR means. DAVID does multiple test correction separately within each category of gene sets, so adding more categories does not change the FDRs or P-values. Be careful how you report these numbers.

Module 2 bioinformatics.ca Other enrichment tests UP DOWN ENRICHMENT TEST ENRICHMENT TEST Gene list Fisher’s Exact Test, Binomial and Chi- squared. Gene list Fisher’s Exact Test, Binomial and Chi- squared. Ranked list (semi- quantitative) GSEA, Wilcoxon ranksum, Mann-Whitney U, Kolmogorov-Smirnov Ranked list (semi- quantitative) GSEA, Wilcoxon ranksum, Mann-Whitney U, Kolmogorov-Smirnov UP DOWN

Module 2 bioinformatics.ca Why test enrichment in ranked lists? Possible problems with Fisher’s Exact Test – No “natural” value for the threshold – Different results at different threshold settings – Possible loss of statistical power due to thresholding No resolution between significant signals with different strengths Weak signals neglected

Module 2 bioinformatics.ca GSEA Enrichment Test Gene-setp-valueFDR Spindle Apoptosis Gene-set Databases GSEA Enrichment Table Ranked Gene List

Module 2 bioinformatics.ca GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A Oct 25;102(43)

Module 2 bioinformatics.ca GSEA: Method Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end? ES score calculation

Module 2 bioinformatics.ca GSEA: Method Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution ES score Gene rank ES score calculation

Module 2 bioinformatics.ca GSEA: Method MAX (absolute value) running ES score --> Final ES Score ES score calculation

Module 2 bioinformatics.ca GSEA: Method High ES score High local enrichment ES score calculation

Module 2 bioinformatics.ca GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) Distribution of ES from N permutations (e.g. 2000) Counts ES Score

Module 2 bioinformatics.ca GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2000) Counts ES Score Real ES score value

Module 2 bioinformatics.ca GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2,000) Counts ES Score Real ES score value Randomized with ES ≥ real: 4 / 2,000 --> Empirical p-value = 0.002

Module 2 bioinformatics.ca Using GSEA Installation Launch Desktop Application from: Notes: – if you have sufficient RAM (*), go for the 1Gb option – running GSEA will take some time – (2-5 hrs depending on the system and the memory setting) – you need an internet connection to run GSEA (*)WIN: check using ALT+CTRL+CANC/Task Manager MAC: check using Applications/Utilities/Activity Monitor

Module 2 bioinformatics.ca Summary Enrichment analysis: – Statistical tests Gene list: Fisher’s Exact Test Gene rankings: GSEA, also see Wilcoxon ranksum, Mann-Whitney U-test, Kolmogorov-Smirnov test – Multiple test correction Bonferroni: stringent, controls probability of at least one false positive* FDR: more forgiving, controls expected proportion of false positives* -- typically uses Benjamini-Hochberg * Type 1 error, aka probability that observed enrichment if no association

Module 2 bioinformatics.ca Lab Time Part 1 – Try out DAVID – use the yeast gene list and Module 2 gene list – Follow Lab 2 protocol – Part 2 – Continue with the Cytoscape protocol – Cytoscape Part 3 – Try out enrichment map – load the plugin from the plugin manager – Load DAVID results – or - load the GSEA enrichment analysis file - EM_EstrogenMCF7_TestData.zip (unzip) available at

Module 2 bioinformatics.ca If DAVID doesn’t recognize your genes, it can try to detect the correct identifiers to use Upload your gene list

Module 2 bioinformatics.ca Step 1 Gene ID mapping results

Module 2 bioinformatics.ca Run the enrichment analysis

Module 2 bioinformatics.ca Run the enrichment analysis

Module 2 bioinformatics.ca We are on a Coffee Break & Networking Session