Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Module 2 Finding over-represented pathways in gene lists http://morrislab.med.utoronto.ca

4 Module 2 bioinformatics.ca Outline Introduction to enrichment analysis Demo of DAVID, a web tool for doing enrichment analysis Theory of enrichment analysis: – Fisher’s Exact Test – Multiple test corrections – Other enrichment tests (GSEA) DAVID Lab

5 Module 2 bioinformatics.ca Enrichment Test Spindle0.00001 Apoptosis0.00025 Microarray Experiment (gene expression table) Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table

6 Module 2 bioinformatics.ca 6 Enrichment Analysis Given: 1.Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast) 2.Gene sets or annotations: e.g. Gene ontology, transcription factor binding sites in promoter Question: Are any of the gene annotations surprisingly enriched in the gene list? Details: – Where do the gene lists come from? – How to assess “surprisingly” (statistics) – How to correct for repeating the tests

7 Module 2 bioinformatics.ca Two-class Design Expression Matrix Class-1Class-2 Genes Ranked by Differential Statistic E.g.: - Fold change - Log (ratio) - t-test -Significance analysis of microarrays UP DOWN UP DOWN Selection by Threshold

8 Module 2 bioinformatics.ca Time-course Design Expression Matrix t1t1 t2t2 t3t3 …tntn Gene Clusters E.g.: - K-means - K-medoids - SOM

9 Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP-regulated) Background (all genes on the array)

10 Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP-regulated) Background (all genes on the array) Gene-set

11 Module 2 bioinformatics.ca Enrichment Test The P-value assesses the probability that the overlap is at least as large as observed by random sampling the array genes. Random samples of array genes The output of an enrichment test is a P-value

12 Module 2 bioinformatics.ca DAVID demo http://david.abcc.ncifcrf.gov/tools.jsp

13 Module 2 bioinformatics.ca Step 1: Define your gene list Either – (a) Copy and paste your list in – (b) Upload a gene list file – (c) Choose an example gene list (so, click “demolist1” on next slide)

14 Module 2 bioinformatics.ca Step 1: Define your gene list If you choose the wrong identifier, you may need to use the conversion tool Click here to define type of list (now we are doing a gene list, next slide we will define the background)

15 Module 2 bioinformatics.ca Step 2: Choose background You can either upload a background list (in the upload tab, see previous step) or choose one of the background sets (shown here) Example list is from the Human U95A array, select it here.

16 Module 2 bioinformatics.ca Step 3: Check list & background

17 Module 2 bioinformatics.ca Step 4: Run Enrichment Analysis Click here

18 Module 2 bioinformatics.ca Step 5: Select categories of gene lists to test Red indicates default selections, click check marks if you want to change Click here once you’ve selected your sets Click here to expand

19 Module 2 bioinformatics.ca Step 6: View enrichment EASE P-value 4.0E-4 means: 4.0 x 10 -4 Gene set name # of genes in set with annotation

20 Module 2 bioinformatics.ca Step 7: Change parameters if desired Set “count” to 1 and “EASE” to 1 if you want maximum # of categories. Beware, only corrects within category.

21 Module 2 bioinformatics.ca Step 8: Download results Download spreadsheet (in tab- delimited text format). Beware: if you click you may get text in a browser window, if this happens, “right-click” to save as a file.

22 Module 2 bioinformatics.ca Step 8a: What can happen if no right-click

23 Module 2 bioinformatics.ca Step 8b: How it looks in a spreadsheet

24 Module 2 bioinformatics.ca Outline of theory component Fisher’s Exact Test for calculating enrichment P-values (also used for calculating EASE score) Multiple test corrections: – Bonferroni – Benjamini-Hochberg FDR Other enrichment tests – GSEA for ranked lists (with a short intro)

25 Module 2 bioinformatics.ca 25 Fisher’s exact test a.k.a., the hypergeometric test Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Null hypothesis: List is a random sample from population Alternative hypothesis: More black genes than expected

26 Module 2 bioinformatics.ca 26 Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 P-value Null distribution Answer = 4.6 x 10 -4 Fisher’s exact test a.k.a., the hypergeometric test

27 Module 2 bioinformatics.ca 27 Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 2x2 contingency table for Fisher’s Exact Test In gene listNot in gene list In gene set4496 Not in gene set14499 e.g.: http://www.graphpad.com/quickcalcs/contingency1.cfm

28 Module 2 bioinformatics.ca 28 Important details To test for under-enrichment of “black”, test for over- enrichment of “red”. The EASE score used by DAVID subtracts one from the observed overlap between gene list and gene set to ensure >1 from the list is in the gene set. Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later***

29 Module 2 bioinformatics.ca Multiple test corrections

30 Module 2 bioinformatics.ca How to win the P-value lottery, part 1 Background population: 500 black genes, 4500 red genes Random draws … 7,834 draws later … Expect a random draw with observed enrichment once every 1 / P-value draws

31 Module 2 bioinformatics.ca How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw RRP6 MRD1 RRP7 RRP43 RRP42 Different annotation RRP6 MRD1 RRP7 RRP43 RRP42

32 Module 2 bioinformatics.ca Simple P-value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that one or more of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

33 Module 2 bioinformatics.ca Bonferroni correction caveats Bonferroni correction is very stringent and can “wash away” real enrichments. Often one is willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments.

34 Module 2 bioinformatics.ca False discovery rate (FDR) FDR is the expected proportion of the observed enrichments due to random chance. Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance. Typically FDR corrections are calculated using the Benjamini-Hochberg procedure. FDR threshold is often called the “q-value”

35 Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value 1 2 3 … 50 51 52 53 0.001 0.01 0.02 … 0.04 0.05 0.06 0.07 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation Sort P-values of all tests in decreasing order

36 Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … 0.04 x 53/50 = 0.042 0.05 x 53/51 = 0.052 0.06 x 53/52 = 0.061 0.07 x 53/53 = 0.07 Adjusted P-value = P-value X [# of tests] / Rank 0.001 0.01 0.02 … 0.04 0.05 0.06 0.07

37 Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … 0.04 x 53/50 = 0.042 0.05 x 53/51 = 0.052 0.06 x 53/52 = 0.061 0.07 x 53/53 = 0.07 Q-value = minimum adjusted P-value at given rank or below 0.042 … 0.042 0.052 0.061 0.07 0.001 0.01 0.02 … 0.04 0.05 0.06 0.07

38 Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRank FDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … 0.04 x 53/50 = 0.042 0.05 x 53/51 = 0.052 0.06 x 53/52 = 0.061 0.07 x 53/53 = 0.07 P-value threshold is highest ranking P-value for which corresponding Q-value is below desired significance threshold 0.042 … 0.042 0.052 0.061 0.07 P-value threshold for FDR < 0.05 FDR < 0.05? YYY…YNNNYYY…YNNN 0.001 0.01 0.02 … 0.04 0.05 0.06 0.07

39 Module 2 bioinformatics.ca Reducing multiple test correction stringency The correction to the P-value threshold  depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or select only larger GO categories.

40 Module 2 bioinformatics.ca Multiple test correction in DAVID In DAVID, the “Benjamini-Hochberg” column corresponds to the false discovery rate as it is typically defined. It is unclear what the FDR means. DAVID does multiple test correction separately within each category of gene sets, so adding more categories does not change the FDRs or P-values. Be careful how you report these numbers.

41 Module 2 bioinformatics.ca Other enrichment tests UP DOWN ENRICHMENT TEST ENRICHMENT TEST Gene list Fisher’s Exact Test, Binomial and Chi- squared. Gene list Fisher’s Exact Test, Binomial and Chi- squared. Ranked list (semi- quantitative) GSEA, Wilcoxon ranksum, Mann-Whitney U, Kolmogorov-Smirnov Ranked list (semi- quantitative) GSEA, Wilcoxon ranksum, Mann-Whitney U, Kolmogorov-Smirnov UP DOWN

42 Module 2 bioinformatics.ca Why test enrichment in ranked lists? Possible problems with Fisher’s Exact Test – No “natural” value for the threshold – Different results at different threshold settings – Possible loss of statistical power due to thresholding No resolution between significant signals with different strengths Weak signals neglected

43 Module 2 bioinformatics.ca GSEA Enrichment Test Gene-setp-valueFDR Spindle0.00010.01 Apoptosis0.0250.09 Gene-set Databases GSEA Enrichment Table Ranked Gene List

44 Module 2 bioinformatics.ca GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43)

45 Module 2 bioinformatics.ca GSEA: Method Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end? ES score calculation

46 Module 2 bioinformatics.ca GSEA: Method Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution ES score Gene rank 110014001 ES score calculation

47 Module 2 bioinformatics.ca GSEA: Method MAX (absolute value) running ES score --> Final ES Score ES score calculation

48 Module 2 bioinformatics.ca GSEA: Method High ES score High local enrichment ES score calculation

49 Module 2 bioinformatics.ca GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) Distribution of ES from N permutations (e.g. 2000) Counts ES Score

50 Module 2 bioinformatics.ca GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2000) Counts ES Score Real ES score value

51 Module 2 bioinformatics.ca GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2,000) Counts ES Score Real ES score value Randomized with ES ≥ real: 4 / 2,000 --> Empirical p-value = 0.002

52 Module 2 bioinformatics.ca Using GSEA Installation Launch Desktop Application from: http://www.broadinstitute.org/gsea/msigdb/downloads.jsp Notes: – if you have sufficient RAM (*), go for the 1Gb option – running GSEA will take some time – (2-5 hrs depending on the system and the memory setting) – you need an internet connection to run GSEA (*)WIN: check using ALT+CTRL+CANC/Task Manager MAC: check using Applications/Utilities/Activity Monitor

53 Module 2 bioinformatics.ca Summary Enrichment analysis: – Statistical tests Gene list: Fisher’s Exact Test Gene rankings: GSEA, also see Wilcoxon ranksum, Mann-Whitney U-test, Kolmogorov-Smirnov test – Multiple test correction Bonferroni: stringent, controls probability of at least one false positive* FDR: more forgiving, controls expected proportion of false positives* -- typically uses Benjamini-Hochberg * Type 1 error, aka probability that observed enrichment if no association

54 Module 2 bioinformatics.ca Lab Time Part 1 – Try out DAVID – use the yeast gene list and Module 2 gene list – Follow Lab 2 protocol – http://david.abcc.ncifcrf.gov/ Part 2 – Continue with the Cytoscape protocol – http://opentutorials.rbvi.ucsf.edu/index.php/Tutorial:Introduction_to_ Cytoscape Part 3 – Try out enrichment map – load the plugin from the plugin manager – Load DAVID results – or - load the GSEA enrichment analysis file - EM_EstrogenMCF7_TestData.zip (unzip) available at http://baderlab.org/Software/EnrichmentMap

55 Module 2 bioinformatics.ca If DAVID doesn’t recognize your genes, it can try to detect the correct identifiers to use Upload your gene list

56 Module 2 bioinformatics.ca Step 1 Gene ID mapping results

57 Module 2 bioinformatics.ca Run the enrichment analysis

58 Module 2 bioinformatics.ca Run the enrichment analysis

59 Module 2 bioinformatics.ca We are on a Coffee Break & Networking Session


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google