Gene Set Enrichment Analysis (GSEA)

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Detecting active subnetworks in molecular interaction networks with missing data Luke Hunter Texas A&M University SHURP 2007 Student.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis (GSEA)
CAVEAT 1 MICROARRAY EXPERIMENTS ARE EXPENSIVE AND COMPLICATED. MICROARRAY EXPERIMENTS ARE THE STARTING POINT FOR RESEARCH. MICROARRAY EXPERIMENTS CANNOT.
Gene Ontology John Pinney
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
26/11/2009Bioinformatics Workshop, Brno Pathway and Gene Set Analysis of Microarray Data Claus-D. Mayer Biomathematics & Statistics Scotland (BioSS)
Pathway analysis Daniel Hurley Pathway analysis: summary A popular buzzword… but what does it mean? A popular buzzword… but what does it mean? How do.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Statistical hypothesis testing – Inferential statistics I.
>>> Korean BioInformation Center >>> KRIBB Korea Research institute of Bioscience and Biotechnology GS2PATH: Linking Gene Ontology and Pathways Jin Ok.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Metagenomic Analysis Using MEGAN4
GO::TermFinder Gavin Sherlock Department of Genetics Stanford University
1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004.
Frédéric Schütz Statistics and bioinformatics applied to –omics technologies Part II: Integrating biological knowledge Center.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Networks and Interactions Boo Virk v1.0.
Reconstructing gene networks Analysing the properties of gene networks Gene Networks Using gene expression data to reconstruct gene networks.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
UBio Training Courses Micro-RNA web tools Gonzalo
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Bioinformatics lectures at Rice University Li Zhang Lecture 9: Networks and integrative genomic analysis
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Summarizing Differential Expression Using Mann-Whitney U-tests.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
The Broad Institute of MIT and Harvard Differential Analysis.
Day 10 Analysing usability test results. Objectives  To learn more about how to understand and report quantitative test results  To learn about some.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
A Bayesian Method for Rank Agreggation Xuxin Liu, Jiong Du, Ke Deng, and Jun S Liu Department of Statistics Harvard University.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
URBDP 591 A Lecture 16: Research Validity and Replication Objectives Guidelines for Writing Final Paper Statistical Conclusion Validity Montecarlo Simulation/Randomization.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Module 2: Analyzing gene lists: over-representation analysis
GO : the Gene Ontology & Functional enrichment analysis
Statistical Testing with Genes
Genesets and Enrichment
GSEA-Pro Tutorial Gene Set Enrichment Analysis for Prokaryotes
Statistical Testing with Genes
Presentation transcript:

Gene Set Enrichment Analysis (GSEA)

Gene expression analysis (Microarray & RNA-seq) Condition A (untreated) Condition B treated Gene expression matrix k genes (p)

Typical results: biological relevance? If we are lucky, some of the top genes mean something to us But what if they don’t? And how what are the results for other genes with similar biological functions

Gene Set Enrichment Analysis (GSEA)? Using prior knowledge about the genes to infer new information from a gene expression analysis experiment Gene set: a set of genes! All genes involved in a pathway are an example of a Gene Set All genes corresponding to a Gene Ontology term are a Gene Set All genes mentioned in a paper might form a Gene Set The aim is to give one number (score or p-value) to a Gene Set as a whole Are many genes in the pathway differentially expressed (up-regulated/down-regulated)? Can we give a number (p-value) to the probability of observing these changes just by chance?

What is a pathway? No clear definition Metabolic pathways are series of chemical reactions occurring within a cell. These pathways describe enzymes and metabolites. Extended to other biological processes, e.g. signalling pathways; gene regulatory networks; protein complexes In all cases a pathway describes a biological function / process very specifically

Overview Where to get gene sets: Pathway and Gene Set data resources GO, KeGG, Wikipathways, MSigDB, etc Self contained vs competitive tests Examples

Gene Set data resources The Gene Ontology (GO) database http://www.geneontology.org/ GO offers a relational/hierarchical database Parent nodes: more general terms Child nodes: more specific terms At the end of the hierarchy there are genes/proteins At the top there are 3 parent nodes: biological process, molecular function and cellular component Example: we search the database for the term “inflammation”

The genes on our array that code for one of the 44 gene products would form the corresponding “inflammation” gene set

KEGG pathway database KEGG = Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/pathway.html The pathway database gives far more detailed information than GO Relationships between genes and gene products But: this detailed information is only available for selected organisms and processes Example: Adipocytokine signaling pathway

Wikipathways http://www.wikipathways.org A wikipedia for pathways One can see and download pathways But also edit and contribute pathways The project is linked to the GenMAPP and Pathvisio analysis/visualisation tools

MSigDB MSigDB = Molecular Signature Database http://www.broadinstitute.org/gsea/msigdb Related to the the analysis program GSEA MSigDB offers gene sets based on various groupings Pathways GO terms Chromosomal position,…

GSEA Reminder: The aim is to give one number (score, p-value) to a Gene Set/Pathway Are many genes in the pathway differentially expressed (up-regulated/down-regulated)? Can we give a number (p-value) to the probability of observing these changes just by chance? Similar to single gene analysis, statistical hypothesis testing methods are often used

General differences between analysis tools Self contained vs competitive test The distinction between “self-contained” and “competitive” methods goes back to Goeman and Buehlman (2007) A self-contained method only uses the values for the genes of a gene set The nullhypothesis here is: H = {“No genes in the Gene Set are differentially expressed”} A competitive method compares the genes within the gene set with the other genes on the arrays Here we test against H: {“The genes in the Gene Set are not more differentially expressed than other genes”}

Example: Analysis for the GO-Term “inflammatory response” (GO:0006954)

Using Bioconductor software we can find 96 probesets on the array corresponding to this term 8 out of these have a p-value < 5% How many significant genes would we expect by chance? Depends on how we define “by chance”

The “self-contained” version By chance (i.e. if it is NOT differentially expressed) a gene should be significant with a probability of 5% We would expect 96 x 5% = 4.8 significant genes Using the binomial distribution we can calculate the probability of observing 8 or more significant genes as p = 10.8%, i.e. not quite significant

The “competitive” version: Overall 1272 out of 12639 genes are significant in this data set (10.1%) If we randomly pick 96 genes we would expect 96 x 10.1% = 9.7 genes to be significant “by chance” A p-value can be calculated based on the 2x2 table Tests for asscociation: Chi-Square-Test or Fisher’s exact test P-value from Fisher’s exact test (one-sided): 73.3%, i.e very far from being significant

Competitive results depend highly on how many genes are on the array and previous filtering On a small targeted array where all genes are changed, a competitive method might detect no differential Gene Sets at all Competitive tests can also be used with small sample sizes, even for n=1 BUT: The result gives no indication of whether it holds for a wider population of subjects, the p-value concerns a population of genes! Competitive tests typically give less significant results than self-contained (see our example) Fisher’s exact test (competitive) is probably the most widely used method!

Some general issues Direction of change Multiple Testing In our example we didn’t differentiate between up or down-regulated genes That can be achieved by repeating the analysis for p-values from one-sided test Eg. we could find GO-Terms that are significantly up-regulated With most software both approaches are possible Multiple Testing As we are testing many Gene Sets, we expect some significant findings “by chance” (false positives) Controlling the false discovery rate is tricky: The gene sets do overlap, so they will not be independent! Even more tricky in GO analysis where certain GO terms are subset of others The Bonferroni-Method is most conservative, but always works!

Dependence between genes All tests we discussed so far assumed that genes within the gene set are statistically independent That is highly unlikely! If genes are correlated the p-values of the gene set tests (eg. Fisher’s exact test) will be incorrect This can be addressed by resampling methods Reshuffle the group labels (Condition A vs. B) Repeat analysis Compare reshuffled with observed data Note: reshuffling the genes does not solve the problem!

Table of methods (from Nam & Kim, Brief in Bioinfo, 2008)

Table of software (from Nam & Kim)

Gene Set Enrichment Analysis (GSEA) http://www.broadinstitute.org/gsea/index.jsp GSEA allows to analyse any kind of gene set: pathways, GO terms, etc… It is available as a standalone program, but there are also versions of GSEA available within R/Bioconductor GSEA has many options and is a mix of a competitive and self-contained method The main idea is to use a Kolmogorov Smirnov-type statistic to test the distribution of the gene set in the ranked gene list (competitive) Typically that statistic (“enrichment score”) is tested by permuting/reshuffling the group labels (self-contained)

http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp