Asking translational research questions using ontology enrichment analysis Nigam Shah

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
13:10:58 A New Tool for Mapping Microarray Data onto the Gene Ontology Structure ( Abstract e GOn (explore Gene Ontology) is a.
Microarray Data Analysis Day 2
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Rama Balakrishnan Saccharomyces Genome Database Stanford University
Gene Ontology John Pinney
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Introduction of bioinformatics and Biological Database 高雄醫學大學 生物醫學暨環境生物學系 助理教授 張學偉 2006/08/08.
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
1 Using Gene Ontology. 2 Assigning (or Hypothesizing About) Biological Meaning to Clusters What do you want to be able to to? –Identify over-represented.
Statistical Analysis of Microarray Data
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Tutorial 5 Motif discovery.
Proteins and Protein Function Charles Yan Spring 2006.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
>>> Korean BioInformation Center >>> KRIBB Korea Research institute of Bioscience and Biotechnology GS2PATH: Linking Gene Ontology and Pathways Jin Ok.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
(2) Ratio statistics of gene expression levels and applications to microarray data analysis Bioinformatics, Vol. 18, no. 9, 2002 Yidong Chen, Vishnu Kamat,
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Automatic methods for functional annotation of sequences Petri Törönen.
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Networks and Interactions Boo Virk v1.0.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Inferring Function From Known Genes Naomi Altman Nov. 06.
Proteomics and annotation. Definition of proteomics Study of all the proteins in an organism Derived from genomics all the DNA in an organsim On some.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.
Ontology based analyses methods ++ develop a grammar for making productions using mf, bp, cl: –derive a higher level grammar for next level of productions.
Protein and RNA Families
Motif discovery and Protein Databases Tutorial 5.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
Computations using pathways and networks Nigam Shah
Central dogma: the story of life RNA DNA Protein.
Statistical Testing with Genes Saurabh Sinha CS 466.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
GO enrichment and GOrilla
CuffDiff ran successfully. Output files include gene_exp.diff What are the next steps? Use Navigation bar to find files; they may be under DNA Subway if.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
2/3/2005 Gene Ontology (GO) The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions.
HOMER – a one stop shop for ChIP-Seq analysis
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Clench 2.0 A program for cluster enrichment analysis and integrated visualization of expression, annotation and transcription factor binding site data.
a Cytoscape plugin to assess enrichment of
The Transcriptional Landscape of the Mammalian Genome
Networks and Interactions
Clustering Manpreet S. Katari.
Statistical Testing with Genes
Department of Genetics • Stanford University School of Medicine
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Statistical Testing with Genes
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Asking translational research questions using ontology enrichment analysis Nigam Shah

High throughput data “high throughput” is one of those fuzzy terms that is never really defined anywhere Genomics data is considered high throughput if: You can not “look” at your data to interpret it Generally speaking it means ~ 1000 or more genes and 20 or more samples. There are about 40 different high throughput genomics data generation technologies. DNA, mRNA, proteins, metabolites … all can be measured

How do ontologies help? An ontology provides a organizing framework for creating “abstractions” of the high throughput data The simplest ontologies (i.e. terminologies, controlled vocabularies) provide the most bang- for-the-buck Gene Ontology (GO) is the prime example More structured ontologies – such as those that represent pathways and more higher order biological concepts – still have to demonstrate real utility.

Black box of Analysis Analyzing Microarray data Preprocessing: Spike Normalization Flag ‘bad’ spots Handling duplicates Filtering Transformations Raw Data: Lists of “Significantly changing” Genes. End up: ‘Story telling’

Gene Ontology to interpret microarray data

What is Gene Ontology? An ontology is a specification of the concepts & relationships that can exist in a domain of discourse. (There are different ontologies for various purposes) The Gene Ontology (GO) project is an effort to provide consistent descriptions of gene products. The project began as a collaboration between three model organism databases: FlyBase (Drosophila),the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD) in Since then, the GO Consortium has grown to include most model organism databases.Muse Genome Database GO creates terms for: Biological Process (BP), Molecular Function (MF), Cellular Component (CC).

Structure of GO relationships

Generic GO based analysis routine Get annotations for each gene in list Count the occurrence (x) of each annotation term Count (or look up) the occurrence (y) of that term in some background set (whole genome?) Estimate how “surprising” it is to find x, given y. Present the results visually.

GO based analyses tools – time line Khatri and Draghici, Bioinformatics, vol 21, no. 18, 2005, pg

Clench inputs 1.A list of ‘background genes’, one per line. 2.A list of ‘cluster genes’, one per line. 3.A FASTA format file containing the promoter sequences of the genes under study. 4.A tab delimited file containing the TF sites (consensus sequence) to search for in the promoters of genes. 5.A tab delimited file containing the expression data for the cluster genes.

P-values and False Discover rates Uses a theoretical distribution to estimate: “How surprising is it that n genes from my cluster are annotated as ‘yyyy’ when m genes are annotated as ‘yyyy’ in the background set” CLENCH uses the hypergeometric, chi-square and the binomial distributions. Clench performs simulations to estimate the False Discovery Rate (FDR) at a p- value cutoff of If the FDR is too high, Clench will reduce the p-value cutoff till the FDR is acceptable The FDR can also be reduced by using GO - Slim: M N m n

Results

DAG of GO terms The graph shows relations between enriched GO terms. Red  Enriched terms Cyan  Informative high level terms with a large number of genes but not statistically enriched. White  Non informative terms (defined as an ‘ignore list’ by the user)

GO – TermFinder

Lots of assumptions! 1.That the GO categories are independent Which they are not 2.That statistically “surprising” is biologically meaningful 3.Annotations are complete and accurate There is a lot of annotation bias 4.Multiple functions, context dependent functions are ignored 5.“Quality” of annotation is ignored

Paper about the “null” assumption

Teasers and food for thought

What about the temporal dimension? Overlay time course data onto the GO tree. See how the ‘enriched’ categories change over time.

What about 3D structure?

How about time and structure?

Side note: GO to analyze literature

How does the GO help? If we explicitly articulate ‘what is known’, in an organizing framework, it serves as a reference for integrating new data with prior knowledge. Such a framework allows formulation of more specific queries to the available data, which return more specific results and increase our ability to fit the results into the “big picture”.

The Gene Ontology provides “structure” to annotations

A bit more structure than GO…

“Functional” Grouping

… still more structure ? in

Between-ontology structure

Literature is the ultimate source of annotations … but it is unstructured!

Text mining for “interpreting” data The goal is to analyze a body of text to find disproportionately high co- occurrences of known terms and gene names. Or analyze a body of text and hope that the group of genes as a whole gets associated with a list of terms that identify themes about the genes. ABCDE Label Label Label Label Label XPAB ERCC1 DE Label Label Mismatch repair Label Nucleotide Excision repair AB C DE Recombination Xeroderma Pigmentosum Mismatch repair DNA repair Nucleotide Excision repair

Pathway analysis