Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Slides:



Advertisements
Similar presentations
Parsimony Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Microarray Data Analysis Day 2
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Gene Ontology John Pinney
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
1 Using Gene Ontology. 2 Assigning (or Hypothesizing About) Biological Meaning to Clusters What do you want to be able to to? –Identify over-represented.
COG and GO tutorial.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)
Internet tools for genomic analysis: part 2
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein and Function Databases
Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Automatic methods for functional annotation of sequences Petri Törönen.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Ontology and Phylogeny: Ontologies as research tools linking phylogenies, systematics, phenotypes, and genomics Brent D. Mishler University of California,
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Gene expression analysis
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Analysis of GO annotation at cluster level by Agnieszka S. Juncker.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Flat clustering approaches
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
GO enrichment and GOrilla
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
2/3/2005 Gene Ontology (GO) The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions.
Canadian Bioinformatics Workshops
The TDR Targets Database Prioritizing potential drug targets in complete genomes.
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Module 2: Analyzing gene lists: over-representation analysis
Clustering Manpreet S. Katari.
Statistical Testing with Genes
Genome Annotation Continued
Analysis of GO annotation at cluster level by Agnieszka S. Juncker
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Statistical Testing with Genes
Presentation transcript:

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

 The parsimony principle:  Find the tree that requires the fewest evolutionary changes!  A fundamentally different method:  Search rather than reconstruct  Parsimony algorithm 1.Construct all possible trees 2.For each site in the alignment and for each tree count the minimal number of changes required 3.Add sites to obtain the total number of changes required for each tree 4.Pick the tree with the lowest score A quick review

 Small vs. large parsimony  Fitch’s algorithm: 1.Bottom-up phase: Determine the set of possible states 2.Top-down phase: Pick a state for each internal node  Searching the tree space:  Exhaustive search, branch and bound  Hill climbing with Nearest-Neighbor Interchange  Branch confidence and bootstrap support A quick review – cont’

From sequence to function Which molecular processes/functions are involved in a certain phenotype - disease, response, development, etc. (what is the cell doing vs. what it could possibly do) Gene expression profiling

 Measuring gene expression:  (Northern blots and RT-qPCR)  Microarray  RNA-Seq  Experimental conditions:  Disease vs. control  Across tissues  Across time  Across environments  Many more … Gene expression profiling

Different techniques, same structure “conditions” “genes”

1.Find the set of differentially expressed genes. 2.Survey the literature to obtain insights about the functions that differentially expressed genes are involved in. 3.Group together genes with similar functions. 4.Identify functional categories with many differentially expressed genes. Conclude that these functions are important in disease/condition under study Back in the good old days …

Time-consuming Not systematic Extremely subjective No statistical validation The good old days were not so good!

What do we need?  A shared functional vocabulary  Systematic linkage between genes and functions  A way to identify genes relevant to the condition under study  Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study)  A way to identify “related” genes

What do we need?  A shared functional vocabulary  Systematic linkage between genes and functions  A way to identify genes relevant to the condition under study  Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study)  A way to identify “related” genes Gene Ontology Annotation Fold change, Ranking, ANOVA Clustering, classification Enrichment analysis, GSEA

 A major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases.  Three goals: 1.Maintain and further develop its controlled vocabulary of gene and gene product attributes 2.Annotate genes and gene products, and assimilate and disseminate annotation data 3.Provide tools to facilitate access to all aspects of the data provided by the Gene Ontology project The Gene Ontology (GO) Project

 The Gene Ontology (GO) is a controlled vocabulary, a set of standard terms (words and phrases) used for indexing and retrieving information. GO terms

 GO also defines the relationships between the terms, making it a structured vocabulary.  GO is structured as a directed acyclic graph, and each term has defined relationships to one or more other terms. Ontology structure

 Three ontology domains: 1.Molecular function: basic activity or task e.g. catalytic activity, calcium ion binding 2.Biological process: broad objective or goal e.g. signal transduction, immune response 3.Cellular component: location or complex e.g. nucleus, mitochondrion  Genes can have multiple annotations: For example, the gene product cytochrome c can be described by the molecular function term oxidoreductase activity, the biological process termsoxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane. GO domains

Go domains Molecular function Biological process Cellular component

Ontology and annotation databases Clusters of Orthologous Groups (COG) eggNOG “The nice thing about standards is that there are so many to choose from” Andrew S. Tanenbaum

 A shared functional vocabulary  Systematic linkage between genes and functions  A way to identify genes relevant to the condition under study  Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study)  A way to identify “related” genes What do we need?  A shared functional vocabulary  Systematic linkage between genes and functions  A way to identify genes relevant to the condition under study GO annotation

Picking “relevant” genes  In most cases, we will consider differential expression as a marker:  Fold change cutoff (e.g., > two fold change)  Fold change rank (e.g., top 10%)  Significant differential expression (e.g., ANOVA) (don’t forget to correct for multiple testing, e.g., Bonferroni or FDR) Gene study set

Enrichment analysis Functional category # of genes in the study set % Signaling Metabolism Others Trans factors Transporters Proteases Protein synthesis Adhesion Oxidation Cell structure Secretion6 2.0 Detoxification6 2.0 Signalling category contains 27.6% of all genes in the study set - by far the largest category. Reasonable to conclude that signaling may be important in the condition under study

Enrichment analysis – the wrong way Functional category # of genes in the study set % Signaling Metabolism Others Trans factors Transporters Proteases Protein synthesis Adhesion Oxidation Cell structure Secretion6 2.0 Detoxification6 2.0 Signaling category contains 27.6% of all genes in the study set - by far the largest category. Reasonable to conclude that signaling may be important in the condition under study

 What if ~27% of the genes on the array are involved in signaling?  The number of signaling genes in the set is what expected by chance.  We need to consider not only the number of genes in the set for each category, but also the total number on the array.  We want to know which category is over-represented (occurs more times than expected by chance). Enrichment analysis – the wrong way Functional category # of genes in the study set % on array Signaling %26% Metabolism %15% Others %11% Trans factors28 9.4%10% Transporters26 8.8%2% Proteases20 6.7%7% Protein synthesis19 6.4%7% Adhesion16 5.4%6% Oxidation13 4.4%4% Cell structure10 3.4%8% Secretion6 2.0%2% Detoxification6 2.0%2%

Enrichment analysis – the right way Say, the microarray contains 50 genes, 10 of which are annotated as ‘signaling’. Your expression analysis reveals 8 differentially expressed genes, 4 of which are annotated as ‘signaling’. Is this significant? A statistical test, based on a null model Assume the study set has nothing to do with the specific function at hand and was selected randomly, would we be surprised to see this number of genes annotated with this function in the study set? The “urn” version: You pick a ranndon set of 8 balls from an urn that contains 50 balls: 40 white and 10 blue. How surprised will you be to find that 4 of the balls you picked are blue?

A quick review: Modified Fisher's exact test Differentially expressed (DE) genes/balls 4 out of 810 out of 50 2 out of 8 4 out of 81 out of 82 out of 85 out of 83 out of 8 Null model: the 8 genes/balls are selected randomly … So, if you have 50 balls, 10 of them are blue, and you pick 8 balls randomly, what is the probability that k of them are blue? Do I have a surprisingly high number of blue genes? Genes/balls

A quick review: Modified Fisher's exact test m=50, m t =10, n=8 Hypergeometric distribution So … do I have a surprisingly high number of blue genes? What is the probability of getting at least 4 blue genes in the null model? P(σ t >=4) Probability k

Modified Fisher's Exact Test  Let m denote the total number of genes in the array and n the number of genes in the study set.  Let m t denote the total number of genes annotated with function t and n t the number of genes in the study set annotated with this function.  We are interested in knowing the probability of seeing n t or more annotated genes! (This is equivalent to a one-sided Fisher exact test)

 A shared functional vocabulary  Systematic linkage between genes and functions  A way to identify genes relevant to the condition under study  Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study)  A way to identify “related” genes So … what do we have so far?  A shared functional vocabulary  Systematic linkage between genes and functions  A way to identify genes relevant to the condition under study  Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study)

Still far from being perfect!  A shared functional vocabulary  Systematic linkage between genes and functions  A way to identify genes relevant to the condition under study  Statistical analysis (combining all of the above to identify cellular functions that contributed to the disease or condition under study)  A way to identify “related” genes Arbitrary! Considers only a few genes Simplistic null model! Ignores links between GO categories Limited hypotheses