Protein function and classification Hsin-Yu Chang www.ebi.ac.uk.

Slides:



Advertisements
Similar presentations
Using Ontology Reasoning to Classify Protein Phosphatases K.Wolstencroft, P.Lord, L.tabernero, A.brass, R.stevens University of Manchester.
Advertisements

Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Protein function and classification Hsin-Yu Chang
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Corrections. N-linked glycosylation (GlcNac): Look at the Swiss-Prot annotation (in a random ‘glycosylated’ entry)
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Protein function and classification Hsin-Yu Chang
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Understanding proteins: resources for identification and annotation.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
EBI web resources II: Ensembl and InterPro Yanbin Yin Fall
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Copyright OpenHelix. No use or reproduction without express written consent1.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Step 3: Tools Database Searching
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Sandra Orchard EMBL-EBI
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
Presentation transcript:

Protein function and classification Hsin-Yu Chang

Classifying proteins into families and identifying protein homologues can help scientists to characterise unknown proteins.

Greider and Blackburn discovered telomerase in 1984 and were awarded Nobel prize in Which model organism they used for this study ? 1. Tetrahymena thermophila 2. Saccharomyces cerevisiae 3. Mouse 4. Human

A single Tetrahymena thermophila cell has 40,000 telomeres, whereas a human cell only has Discovery of telomerase Greider and Blackburn 1989 Telomere hypothesis of cell senescence Szostak 1995 Clone hTR 1995/1997 Clone hTERT 1997 Telomerase knockout mouse 1998 Ectopic expression of telomerase in normal human epithelial cells cause the extension of their lifespan 1999/2000… Telomerase/telomere dysfunctions and cancer Gilson and Ségal-Bendirdjian, Biochimie, 2010.

Can we identify human telomerase from Tetrahymea protein sequence?

Let’s pretend that human telomerase has not been identified and we only know the protein sequences of Tetrahymena telomerase. How can we find the human telomerase?

BLAST (Basic Local Alignment Tool) : compares protein sequences to sequence databases and calculates the statistical significance of matches.

BLAST Advantages: Relatively fast User friendly Very good at recognising similarity between closely related sequences Drawbacks: sometimes struggle with multi-domain proteins less useful for weakly- similar sequences (e.g., divergent homologues)

Using Tetrahymena telomerase protein sequences as a query in BLAST, you will find a few human proteins that have very low identity.

Tetrahymena and putative human telomerase (AAC ) have poor protein sequence match.

Can we presume this protein is a telomerase homologue from humans? Can we find more information about it before pursuing it further?

Telomerase ribonucleoprotein complex - RNA binding domain Reverse transcriptase domain Search for protein signatures (such as domains) in AAC

Plan experiments and find out more! AAC shares 23% identity with Tetrahymena telomerase. It also contains the same domains as telomerase.

But, where can we search for information about the protein domains?

Structural domains Functional annotation of families/domains Protein features (sites) Hidden Markov Models Finger prints Profiles Patterns Protein databases that use signature approaches HAMAP

Construction of protein signatures Construction of a multiple sequence alignment (MSA) from characterised protein sequences. Modelling the pattern of conserved amino acids at specific positions within a MSA. Use these models to infer relationships with the characterised sequences

Three different protein signature approaches Patterns Single motif methods Fingerprints Multiple motif methods Profiles & Hidden Markov Models (HMMs) Full alignment methods Sequence alignment

Patterns

Sequence alignment Motif Pattern signature [AC] – x -V- x(4) - {ED} Regular expression PS00000 Pattern sequences ALVKLISG AIVHESAT CHVRDLSC CPVESTIS Patterns are usually directed against functional sequence features such as: active sites, binding sites, etc.

PDOC00199 [SAG]-G-G-T-G-[SA]-G Tubulin signature A conserved motif in tubulins

Patterns Advantages: Strict - a pattern with very little variability and can produce highly accurate matches Drawbacks: Simple but less flexible

Fingerprints

Fingerprints: a multiple motif approach Sequence alignment Motif 2Motif 3Motif 1 Define motifs Fingerprint signature PR00000 Motif sequences xxxxxx Weight matrices

Telomerase signature (PR01365) Motif 1Motif 2 Motif 3 Motif 4

The significance of motif context order interval Identify small conserved regions in proteins Several motifs  characterise family 1 2 3

Good at modeling the often small differences between closely related proteins Distinguish individual subfamilies within protein families, allowing functional characterisation of sequences at a high level of specificity Fingerprints Amino acids relatively well conserved across all chloride channel protein family members Amino acids uniquely conserved in chloride channel protein 3 subfamily members.

Profiles & HMMs

Sequence alignment Entire domain Define coverage Whole protein Use entire alignment of domain or protein family xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx Build model (Profile or HMMs) Profile or HMM signature Profiles & HMMs

Profiles Start with a multiple sequence alignment Amino acids at each position in the alignment are scored according to the frequency with which they occur Scores are weighted according to evolutionary distance using a BLOSUM matrix Good at identifying homologues

HMMs Amino acid frequency at each position in the alignment and their transition probabilities are encoded Insertions and deletions are also modelled Start with a multiple sequence alignment Very good at identifying evolutionarily distant homologues Can model very divergent regions of alignment Advantages

Three different protein signature approaches Patterns Single motif methods Fingerprints Multiple motif methods Profiles & HMMs hidden Markov models Full alignment methods

Fingerprints Patterns Profiles & HMMs hidden Markov models

Structural domains Functional annotation of families/domains Protein features (sites) Hidden Markov Models Finger prints Profiles Patterns HAMAP

The aim of InterPro Family entry: description, proteins matched and more information. Domain entry: description, proteins matched and more information. Site entry: description, proteins matched and more information. Protein sequences

What is InterPro? InterPro is an integrated sequence analysis resource It combines predictive models (known as signatures) from different databases It provides functional analysis of protein sequences by classifying them into families and predicting domains and important sites

First release in partner databases Add annotation to UniProtKB/TrEMBL Provides matches to over 80% of UniProtKB Source of >85 million Gene Ontology (GO) mappings to >24 million distinct UniProtKB sequences 50,000 unique visitors to the web site per month> 2 million sequences searched online per month. Plus offline searches with downloadable version of software Facts about InterPro

Signatures are provided by member databases They are scanned against the UniProt database to see which sequences they match Curators manually inspect the matches before integrating the signatures into InterPro InterPro signature integration process InterPro curators

InterPro signature integration process Signatures representing the same entity are integrated together Relationships between entries are traced, where possible Curators add literature referenced abstracts, cross-refs to other databases, and GO terms

Search using protein sequences

Family

Type

InterPro entry types Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure. Ex. Telomerase family. Family Distinct functional, structural or sequence units that may exist in a variety of biological contexts. Ex. DNA binding domain. Domain Short sequences typically repeated within a protein. Ex. Tubulin binding repeats in microtubule associated protein Tau. Repeats PTM Active Site Binding Site Conserved Site Sites Ex. Phosphorylation sites, ion binding sites, tubulin conserved site.

Type Name Identifier Contributing signatures Description GO terms References

Type Name Identifier Contributing signatures Description References Relationships

InterPro family and domain relationships

Family relationships in InterPro: Interleukin-15/Interleukin-21 family (IPR003443) Interleukin-15 (IPR020439) Interleukin-15 Avian (IPR020451) Interleukin-15 Fish (IPR020410) Interleukin-15 Mammal (IPR020466) Interleukin-21 (IPR028151)

Relationships

InterPro relationships: domains Protein kinase-like domain Protein kinase domain Serine/threonine kinase catalytic domain Tyrosine kinase catalytic domain

Gene Ontology Allow cross-species and/or cross-database comparisons Unify the representation of gene and gene product attributes across species

The Concepts in GO 1. Molecular Function 2. Biological Process 3. Cellular Component protein kinase activity insulin receptor activity Cell cycle Microtubule cytoskeleton organisation

GO: DNA binding GO: telomeric template RNA reverse transcriptase activity GO: Nucleus

Search using keywords

Summary Protein classification could help scientists to gain information about protein functions. Blast is fast and easy to use but has its drawbacks. Alternative approach: protein signature databases build models (protein signatures) by using different methods (patterns, fingerprints, profile and HMMs). InterPro integrates these signatures from 11 member databases. It serves as a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites.

Why use InterPro? Large amounts of manually curated data 35,634 signatures integrated into 25,214 entries Cites 38,877 PubMed publications Large coverage of protein sequence space Regularly updated ~ 8 week release schedule New signatures added Scanned against latest version of UniProtKB

Caution We need your feedback! missing/additional references reporting problems requests InterPro is a predictive protein signature database - results are predictions, and should be treated as such InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry! EBI support pageEBI support page. And one more thing…..

The InterPro Team: Amaia Sangrador Craig McAnulla Matthew Fraser Maxim Scheremetjew Siew-Yit Yong Alex Mitchell Sebastien Pesseat Sarah Hunter Gift Nuka Hsin-Yu Chang

DatabaseBasisInstitution Built from FocusURL PfamHMMSanger Institute Sequence alignment Family & Domain based on conserved sequence Gene3DHMMUCL Structure alignment Structural Domain c.uk/Gene3D/ SuperfamilyHMMUni. of Bristol Structure alignment Evolutionary domain relationships SUPERFAMILY/ SMARTHMMEMBL Heidelberg Sequence alignment Functional domain annotation heidelberg.de/ TIGRFAMHMMJ. Craig Venter Inst. Sequence alignment Microbial Functional Family Classification arch/projects/tigrfams/overv iew/ PantherHMMUni. S. California Sequence alignment Family functional classification PIRSFHMM PIR, Georgetown, Washington D.C. Sequence alignment Functional classification www/dbinfo/pirsf.shtml PRINTS Fingerprints Uni. of Manchester Sequence alignment Family functional classification r.ac.uk/dbbrowser/PRINTS/i ndex.php PROSITE Patterns & Profiles SIB Sequence alignment Functional annotation HAMAPProfilesSIB Sequence alignment Microbial protein family classification ap/ ProDom Sequence clustering PRABI : Rhône-Alpes Bioinformatics Center Sequence alignment Conserved domain prediction m/current/html/home.php

Thank you! Facebook: EMBLEBI YouTube: EMBLMedia

The BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences.