Protein function and classification www.ebi.ac.uk/interpro Hsin-Yu Chang www.ebi.ac.uk.

Slides:



Advertisements
Similar presentations
Using Ontology Reasoning to Classify Protein Phosphatases K.Wolstencroft, P.Lord, L.tabernero, A.brass, R.stevens University of Manchester.
Advertisements

Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Protein function and classification Hsin-Yu Chang
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Corrections. N-linked glycosylation (GlcNac): Look at the Swiss-Prot annotation (in a random ‘glycosylated’ entry)
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein and Function Databases
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Protein function and classification Hsin-Yu Chang
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Understanding proteins: resources for identification and annotation.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
EBI web resources II: Ensembl and InterPro Yanbin Yin Fall
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Copyright OpenHelix. No use or reproduction without express written consent1.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Sandra Orchard EMBL-EBI
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
PIR: Protein Information Resource
There are four levels of structure in proteins
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Presentation transcript:

Protein function and classification Hsin-Yu Chang

Greider and Balckburn discovered telomerase in 1984 and were awarded Nobel prize in Which model organism they used for this study ? 1. Tetrahymena 2. Saccharomyces cerevisiae 3. Mouse 4. Human

A single Tetrahymena cell has 40,000 telomeres, whereas a human cell only has Discovery of telomerase Greider and Blackburn 1989 Telomere hypothesis of cell senescence Szostak 1995 Clone hTR 1995/1997 Clone hTERT 1997 Telomerase knockout mouse 1998 Ectopic expression of telomerase in normal fibroblasts and epithelial cells bypasses the Hayflick’s limit 1999/2000… Telomerase/telomere dysfunctions and cancer Gilson and Ségal-Bendirdjian, Biochimie, 2010.

Therefore, protein classification could help scientists to gain information about protein functions.

In the lab, what do we usually do to analyse protein sequences and find out their functions?

Protein BLAST Publications - text books or papers UniProt PDB Specialized protein databases such as SGD, the human protein atlas, etc. What I used to do:

BLAST it? Advantages: Relatively fast User friendly Very good at recognising similarity between closely related sequences Drawbacks: sometimes struggle with multi-domain proteins less useful for weakly- similar sequences (e.g., divergent homologues)

Using BLAST to find clues of protein functions -when it goes well

Pairwise alignment of two proteins: CD4 from two closely-related species

Using BLAST to find clues of protein functions -when it does not give you much information

Because BLAST performs local pairwise alignment, it: Cannot encode the information found in an multiple sequence alignment that show you conserved sites.

60S acidic ribosomal protein P0: multiple sequence alignment Using pairwise alignment could miss out on conserved residues

An alternative approach: protein signature search Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment Use these models to infer relationships with the characterised sequences (from which the alignment was constructed) This is the approach taken by protein signature databases

Three different protein signature approaches Patterns Single motif methods Fingerprints Multiple motif methods Profiles & HMMs hidden Markov models Full alignment methods

Patterns Sequence alignment Motif Pattern signature [AC] – x -V- x(4) - {ED} Regular expression PS00000 Pattern sequences ALVKLISG AIVHESAT CHVRDLSC CPVESTIS Patterns are usually directed against functional sequence features such as: active sites, binding sites, etc.

Patterns Advantages: Can anchor the match to the extremity of a sequence <M-R-[DE]-x(2,4)-[ALT]-{AM} Strict - a pattern with very little variability and forbidden residues can produce highly accurate matches Drawbacks: Simple but less flexible

Fingerprints: a multiple motif approach Sequence alignment Motif 2Motif 3Motif 1 Define motifs Fingerprint signature PR00000 Motif sequences xxxxxx Weight matrices

The significance of motif context order interval Identify small conserved regions in proteins Several motifs  characterise family Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours 1 2 3

Good at modeling the often small differences between closely related proteins Distinguish individual subfamilies within protein families, allowing functional characterisation of sequences at a high level of specificity Fingerprints

Sequence alignment Entire domain Define coverage Whole protein Use entire alignment of domain or protein family xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx Build model Profile or HMM signature Profiles & HMMs

Profiles Start with a multiple sequence alignment Amino acids at each position in the alignment are scored according to the frequency with which they occur Scores are weighted according to evolutionary distance using a BLOSUM matrix Good at identifying homologues

HMMs Amino acid frequency at each position in the alignment and their transition probabilities are encoded Insertions and deletions are also modelled Start with a multiple sequence alignment Very good at identifying evolutionarily distant homologues Can model very divergent regions of alignment

Three different protein signature approaches Patterns Single motif methods Fingerprints Multiple motif methods Profiles & HMMs hidden Markov models Full alignment methods

InterPro The aim of InterPro

What is InterPro? InterPro is an integrated sequence analysis resource It combines predictive models (known as signatures) from different databases to provide functional analysis of protein sequences by classifying them into families and predicting domains and important sites

First release in partner databases Forms part of the automated system that adds annotation to UniProtKB/TrEMBL Provides matches to over 80% of UniProtKB Source of >60 million Gene Ontology (GO) mappings to >17 million distinct UniProtKB sequences 50,000 unique visitors to the web site per month> 2 million sequences searched online per month. Plus offline searches with downloadable version of software Facts about InterPro

Structural domains Functional annotation of families/domains Protein features (sites) Hidden Markov Models Finger prints Profiles Patterns HAMAP

Signatures are provided by member databases They are scanned against the UniProt database to see which sequences they match Curators manually inspect the matches before integrating the signatures into InterPro InterPro signature integration process  Signatures representing the same entity are integrated together  Relationships between entries are traced, where possible  Curators add literature referenced abstracts, cross-refs to other databases, and GO terms

Search using protein sequences

Family

Type

InterPro entry types Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Family Distinct functional, structural or sequence units that may exist in a variety of biological contexts Domain Short sequences typically repeated within a protein Repeats PTM Active Site Binding Site Conserved Site Sites

Type Name Identifier Contributing signatures Description GO terms References

Type Name Identifier Contributing signatures Description References Relationships

InterPro family and domain relationships

Family relationships in InterPro: Interleukin-15/Interleukin-21 family Interleukin-15 avian Interleukin-15 fish Interleukin-15 mammal

Relationships

InterPro relationships: domains Protein kinase-like domain Protein kinase catalytic domain Serine/threonine kinase catalytic domain Tyrosine kinase catalytic domain

A brief diversion into the Gene Ontology...

Gene Ontology Allow cross-species and/or cross-database comparisons Unify the representation of gene and gene product attributes across species

A way to capture biological knowledge in a written and computable form The Gene Ontology A set of concepts and their relationships to each other arranged as a hierarchy Less specific concepts More specific concepts

The Concepts in GO 1. Molecular Function 2. Biological Process 3. Cellular Component protein kinase activity insulin receptor activity Cell cycle Microtubule cytoskeleton organisation

GO: Immune response GO: membrane

Summary Its member databases all have their particular niche or focus......but InterPro offers a combination of all their areas of expertise! InterPro is a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites It uses protein signatures based on different methodologies from different member databases

Why use InterPro? Large amounts of manually curated data 35,634 signatures integrated into 25,214 entries Cites 38,877 PubMed publications Large coverage of protein sequence space Regularly updated ~ 8 week release schedule New signatures added Scanned against latest version of UniProtKB

Caution We need your feedback! missing/additional references reporting problems requests InterPro is a predictive protein signature database - results are predictions, and should be treated as such InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry! EBI support pageEBI support page. And one more thing…..

The InterPro Team: Amaia Sangrador Craig McAnulla Matthew Fraser Maxim Scheremetjew Siew-Yit Yong Alex Mitchell Sebastien Pesseat Sarah Hunter Gift Nuka Hsin-Yu Chang Louise Daugherty

DatabaseBasisInstitution Built from FocusURL PfamHMMSanger Institute Sequence alignment Family & Domain based on conserved sequence Gene3DHMMUCL Structure alignment Structural Domain c.uk/Gene3D/ SuperfamilyHMMUni. of Bristol Structure alignment Evolutionary domain relationships SUPERFAMILY/ SMARTHMMEMBL Heidelberg Sequence alignment Functional domain annotation heidelberg.de/ TIGRFAMHMMJ. Craig Venter Inst. Sequence alignment Microbial Functional Family Classification arch/projects/tigrfams/overv iew/ PantherHMMUni. S. California Sequence alignment Family functional classification PIRSFHMM PIR, Georgetown, Washington D.C. Sequence alignment Functional classification www/dbinfo/pirsf.shtml PRINTS Fingerprints Uni. of Manchester Sequence alignment Family functional classification r.ac.uk/dbbrowser/PRINTS/i ndex.php PROSITE Patterns & Profiles SIB Sequence alignment Functional annotation HAMAPProfilesSIB Sequence alignment Microbial protein family classification ap/ ProDom Sequence clustering PRABI : Rhône-Alpes Bioinformatics Center Sequence alignment Conserved domain prediction m/current/html/home.php

Thank you! Facebook: EMBLEBI YouTube: EMBLMedia