Automatic methods for functional annotation of sequences Petri Törönen.

Slides:



Advertisements
Similar presentations
Using Ontology Reasoning to Classify Protein Phosphatases K.Wolstencroft, P.Lord, L.tabernero, A.brass, R.stevens University of Manchester.
Advertisements

CAVEAT 1 MICROARRAY EXPERIMENTS ARE EXPENSIVE AND COMPLICATED. MICROARRAY EXPERIMENTS ARE THE STARTING POINT FOR RESEARCH. MICROARRAY EXPERIMENTS CANNOT.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
1 Using Gene Ontology. 2 Assigning (or Hypothesizing About) Biological Meaning to Clusters What do you want to be able to to? –Identify over-represented.
Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Similar Sequence Similar Function Charles Yan Spring 2006.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein and Function Databases
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Inferring Function From Known Genes Naomi Altman Nov. 06.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
Gene expression analysis
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Condor: BLAST Rob Quick Open Science Grid Indiana University.
Motif discovery and Protein Databases Tutorial 5.
Statistical Testing with Genes Saurabh Sinha CS 466.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Construction of Substitution matrices
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Predicting Protein Function Annotation using Protein- Protein Interaction Networks By Tamar Eldad Advisor: Dr. Yanay Ofran Computational Biology.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
a Cytoscape plugin to assess enrichment of
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Statistical Testing with Genes
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Dr Tan Tin Wee Director Bioinformatics Centre
Statistical Testing with Genes
Condor: BLAST Tuesday, Dec 7th, 10:45am
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

Automatic methods for functional annotation of sequences Petri Törönen

What, Why, How??? Functional annotation of sequence (seq.) – Definition of description line – Mapping seq. to functional categories Simple solutions are error-sensitive Review some available tools in the exercises

Old, simple way Do a Sequence Search (SS), like BLAST, with your sequence Find the best match Transfer all the info from the best match to your sequence Everything done? Finished?

Problems First hit is unknown seq. First hit is misannotated seq. – an increasing problem!! No significant matches found Strong, but only local matches => impurities in search Inpurities in query seq.

Why manual analysis is hard? Large size of gene lists (SS result list) False positives among observed results

Each gene can have multiple functions -the important common theme among the genes can go easily unnoticed. Requires detailed knowledge of genes varying representations for same function in description lines Objectivity Why manual analysis is hard?

Gene Ontology (GO) A controlled vocabulary of gene product roles in cells and the role associations The roles can be applied to all organisms Three main hierarchies: biological process, cellular component and molecular function include currently about 19,000 classes (=roles) -usually only a small portion of these classes is in use with one organism (example: chloroplasts related functions are important only within plants)

Structure of GO GO graph: Hierarchical structure of linked nodes -each node presents one class that is part of its parental class Direct Acylic Graph (DAG) -a tree-structure where branches can also merge when going from parental nodes to child nodes. Genes can be linked to many classes in the GO structure Starting node root of hierarchical structure More detailed classes Less detailed classes

How GO helps GO presents a terminology for presentation of known information of the gene GO classifies genes according to their known/predicted functions Classes represent varying detail Classifications can be used to find over- represented functions in the results

How GO helps Look over-represented GO classes from the gene list Sampling w/o replacements answers to: How many ways there are to select 8 balls so that two of them are white and rest are black from the whole data? we would like to ask: what is the probability of observing the number of class members like we have in the cluster by random? Solution from the statistics is the sampling without replacement

Methods that predict protein function Methods that summarize the SS result list Methods that use profile searches Methods that use sequence features Methods based on sequence patterns Methods based on sequence phylogeny

SS list summarization Consensus analysis of SS list Do the SS Look repetitively occuring descriptions /GO classes Over-representation of GO classes (BLAST2GO) Tools performing this: Our method PANNZER (Koskinen et al. unpubl.) BLAST2GO ( ) ConFunc

Profile search methods Use profile searches instead of SS Some positions are more conserved in the seq. PFAM ConFunc

ConFunc in detail BLAST search with query seq. Obtain a result list Seq:s in result list are clustered to seq:s with similar function (same GO classes) Each cluster is used as a seed for a profile search Test how well the query seq matches to each profile Use link:

Sequence feature methods Look for sequence features Features: Secondary structure, protein domains Compare sequences by looking which features they have in common Methods that do this: FACT Limited search possibilities with FACT

Sequence pattern methods Pattern => frequently observed short motif from seq. DB InterProScan BioDictionary from IBM Computational Biology ( – Extraction of most of the patterns from swissprot – Linking of each pattern to keywords, seen in the seq:s where pattern was – Query seq. is linked to keywords via patterns it has

Phylogeny based methods Shortly: Include the species tree to the annotation of the sequences. Evolutionary distance is taken into account Compara from ENSEMBL

Tip for testing the tools For testing with purely random sequence For testing partially random sequence