Gene Set Analysis using R and Bioconductor Daniel Gusenleitner

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Relating Gene Expression to a Phenotype and External Biological Information Richard Simon, D.Sc. Chief, Biometric Research Branch, NCI
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
Gene Set Enrichment Analysis (GSEA)
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Gene Ontology John Pinney
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Distinguishing Regulators of Biomolecular Pathways Mentor: Dr. Xiwei Wu City of Hope Sean Caonguyen SoCalBSI 8/21/08.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Differentially expressed genes
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Final Project Week 3 - 5/7/09 GSEA and Cluster Computing in Protein Research Leon Kay, Yan Tran, Chris Thomas Yan Gary Chris Leon.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
Multiple testing in high- throughput biology Petter Mostad.
Differential Analysis & FDR Correction
1 Identifying differentially expressed sets of genes in microarray experiments Lecture 23, Statistics 246, April 15, 2004.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Bioinformatics Dr. Víctor Treviño BT4007
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Gene Expression Data Analysis Lab Session CAD course Jian Li
Gene expression analysis
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics lectures at Rice University Li Zhang Lecture 9: Networks and integrative genomic analysis
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
SRI International Bioinformatics 1 SmartTables & Enrichment Analysis Peter Karp SRI Bioinformatics Research Group September 2015.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Statistical Testing with Genes Saurabh Sinha CS 466.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Chapter Eight: Using Statistics to Answer Questions.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
The Broad Institute of MIT and Harvard Differential Analysis.
GO enrichment and GOrilla
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Gene Set Enrichment Analysis. GSEA: Key Features Ranks all genes on array based on their differential expression Identifies gene sets whose member genes.
Canadian Bioinformatics Workshops
Networks and Interactions
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
GO : the Gene Ontology & Functional enrichment analysis
Statistical Testing with Genes
Genome Wide Association Studies using SNP
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Gene expression analysis
Statistical Testing with Genes
Presentation transcript:

Gene Set Analysis using R and Bioconductor Daniel Gusenleitner

Why Gene Sets? Phenotypic characteristics or clinical diseases can only rarely be defined by one single gene Most diseases, are complex and involve multiple genes Genes usually do not work independently; they work as parts of a functional unit

Genes and Proteins talk in Pathways

Definition of Gene Sets Gene sets are loosely defined as groups of genes that share biological mechanisms or characteristic They represent the distilled base of biological knowledge and act as an aid for theoretical and experimental research

There are different kinds of gene sets Data-driven gene sets usually use high-throughput experiments in order to derive and identify sets of related genes. Knowledge-driven gene sets require expert knowledge to construct gene sets. These are usually specific to domains of interest.

Resources for Gene Sets

2. Extract gene signatures from tables figures or supplement 1. SearchPubmedwith pre- defined search criteria 3. Annotate each gene signature4. Map allmappableidentifiers to genome to create standardized gene lists NameDescription PMIDPubmedidentifier TissueName of search term set used to searchPubMED. OrganismSpecies common name (human, mouse, etc) Platform Name of microarray or other experimental technique used to derive gene signature Platform Description Description of platform Genes Articles Number of genes in gene signature Sig ID Signature identifier, in the format PMID-XXX, where XXX is the gene signature table, figure or supplementary file e.g Table3 Sig Name Name of gene signature, in the formatTissue_AuthorYear_ NumberofGenes_Description. Description is optional. e.g. Breast_Bertucci08_75genes Sig Description Description of gene signature, typically extracted from table or figure legend (free text) File Associated Name of tab delimited file gene signature file. Format is SigID.txt URLURL from where gene signature was downloaded Column Mappings Content of each column in gene signature file (selection from constrained list in Table 1b)

initiative to unify the representation of gene and gene product attributes across all species Maintains a controlled vocabulary of gene and gene product attributes Provides an annotation for genes and gene products Provides tools for easy access to all aspects of the data provided by the project

cellular components, the parts of a cell or its extracellular environment molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis biological process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units

connects known information on molecular interaction networks It contains: genes and proteins biochemical compounds and reactions pathways and complexes

Version 3.0 (September 2010) Warehouse of 6769 annotated gene sets

Divided in 5 major collections: – C1: positional gene sets for each human chromosome and cytogenetic band (326 gene sets) – C2: curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts (3272 gene sets) – C3: motif gene sets based on conserved cis-regulatory motifs (836 gene sets) – C4: computational gene sets defined by mining large collections of cancer-oriented microarray data (881 gene sets) – C5: GO gene sets consists of genes annotated by the same GO terms (1454 gene sets)

Gene Set Analysis

Statistical Methods Fisher’s Exact Test EASE: the Expression Analysis Systematic Explorer

Gene Set Analysis (GSA) using Gene Expression Data shifts the analyses more towards biology- driven approaches utilizes functional related groups of genes in order to analyze gene expression datasets more robust than single gene analyses

Competitive vs. Self-Contained Hypothesis GSA differ in the definition of the null hypothesis: Self-contained tests just compare the gene expression within the gene set across the given samples Comparative tests compare differential expression of the gene set to either all or the complement of the genes represented on a microarray

Gene Set Enrichment Analysis (GSEA) Mootha et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes, Nature Genetics, 2003, 34-3 Subramanian et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, PNAS, 2005, Oron et al. Gene set enrichment analysis using linear models and diagnostics, Bioinformatics, 2008, Bioconductor Package: GSEAlm - Linear Model Toolset for Gene Set Enrichment Analysis

Aims of a Gene Set Enrichment Analysis Looking for up or down regulated gene sets between two tested classes Testing if a gene set of interest is differentially regulated between two tested phenotypes

Testing different Phenotypes Pair-wise Tests: Normal versus Low grade Normal versus High grade Low grade versus High grade SampleDisease Type S1Normal breast tissue S2Normal breast tissue S3Normal breast tissue S4Low grade cancer (Luminal A) S5Low grade cancer (Luminal A) S6Low grade cancer (Luminal A) S7High grade cancer (Basal) S8High grade cancer (Basal) S9High grade cancer (Basal) Gene Expression Data Clinical Data Combined Tests: Normal versus Low/High grade Normal/low grade versus High grade

I.) Ranking the genes according to differential expression using t-test or linear models Gene Set Enrichment Analysis (GSEA)

II.) Include gene set membership information

Enrichment Score (ES) reflects the degree to which a set S is overrepresented at the extremes of the entire ranked list L. The score is calculated by walking down the list L The enrichment score is the maximum deviation from zero encountered in the random walk; It corresponds to a weighted Kolmogorov–Smirnov-like statistic

Gene Set Enrichment Analysis (GSEA)

Subramanian A et al. PNAS 2005;102:

Permutation test to estimate the significance The significance of the ES has to be estimated Class label permutation versus gene label permutation Calculation the ES of the gene set for the permuted data, which generates a null distribution for the ES The empirical, nominal P value of the observed ES is then calculated relative to this null distribution

Gene Set Enrichment Analysis (GSEA)

Interpretation of the results

Correction for Multiple Testing When an entire database of gene sets is evaluated, we have to adjust the estimated significance level to account for multiple hypothesis testing Control for false discovery rate (FDR) The FDR is the estimated probability that a set with a given ES represents a false positive finding

Interpretation of the Results

Tutorial