EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.

Slides:



Advertisements
Similar presentations
Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Protein function and classification Hsin-Yu Chang
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
EBI is an Outstation of the European Molecular Biology Laboratory. InterPro Database Protein Functional Analysis Jennifer McDowall, Ph.D. Senior InterPro.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
UniProt - The Universal Protein Resource
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Protein function and classification Hsin-Yu Chang
Protein function and classification Hsin-Yu Chang
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
EBI web resources II: Ensembl and InterPro Yanbin Yin Fall
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Protein databases Henrik Nielsen
Demo: Protein Information Resource
Biological Sequence Databases
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Sandra Orchard EMBL-EBI
Genome Annotation Continued
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
InterPro An Introduction
Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro

What is InterPro? DIAGNOSTICS RESOURCE : InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins * Provides functional analysis of proteins by classifying them into families and predicting domains and important sites * Adds information about the signatures and the types of proteins they match

InterPro Consortium Consortium of 11 major signature databases

Why do we need predictive annotation tools?

Based on the original work on PIR, Swiss-Prot and TrEMBL Collaboration between EBI, SIB and PIR The mission of UniProt is to provide the scientific community with aUniProt comprehensive, high-quality and freely accessible resource of protein sequence and functional information. What is UniProt?

UniParc - Sequence archive Current and obsolete sequences UniMES Metagenomic and environmental sample sequences UniProtKB/Swiss-Prot Reviewed UniProtKB/TrEMBL Unreviewed UniProtKB Protein knowledgebase EMBL/GenBank/DDBJ, Ensembl, RefSeq, PDB, other resources UniRef Sequence clusters UniRef100 UniRef90 UniRef50 High-quality manual annotation Automatic annotation

Annotation using InterPro Swiss-Prot groups of related proteins (same family or share domains) TrEMBL uncharacterised sequence protein signatures InterPro automatic annotation pipeline CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence

Protein family classification Given a set of sequences, we usually want to know: –what are these proteins; to what family do they belong? –what is their function; how can we explain this in structural terms?

Protein family classification : BLAST ( Protein family classification : BLAST ( pairwise comparisons )

Protein family classification: BLAST

Limitations with Pairwise comparisons BLAST alignment of 2 proteins: 60S acidic ribosomal protein P0 from 2 species

Limitations with Pairwise comparisons

Protein family classification: signature databases Alternatively, we can seek ‘patterns’ that will allow us to infer relationships with previously-characterised sequences This is the approach taken by ‘signature’ databases

Protein signatures More sensitive homology searches Each member database creates signatures using different methods and methodologies:  manually-created sequence alignments  automatic processes with some human input and correction  entirely automatically.

What are protein signatures? Multiple sequence alignment Protein family/domain Build model Search Mature model ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK UniProt it. Significant match Protein analysis

Member databases Hidden Markov Models Finger- Prints ProfilesPatterns Sequence Clusters Structural Domains Functional annotation of families/domains Prediction of conserved domains Protein features (active sites…) METHODS

Full domain alignment methods Single motif methods Multiple motif methods Regex patterns (PROSITE) Profiles (Profile Library) HMMs (Pfam) Identity matrices (PRINTS) Diagnostic approaches (sequence-based)

Patterns Extract pattern sequences xxxxxx Sequence alignment Motif Define pattern Pattern signature C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression PS 00000

Patterns Patterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites Anchoring the match to the extremity of a sequence <M-R-[DE]-x(2,4)-[ALT]-{AM} Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C Drawbacks Simple but less powerful Advantages

>sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SA NGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCE LDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGF GENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGI EERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATK AAVEEGILPGGG VALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGA VIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLP KDESESGAAGAGMGGMGGMDY EXAMPLE: PS00296; Chaperonins cpn60 signature (PATTERN)PS00296 A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA] Pattern/motif in sequence  regular expression Prosite patterns

Fingerprints Sequence alignment Correct order Correct spacing Motif 2Motif 3Motif 1 Define motifs Fingerprint signature 123 PR Extract motif sequences xxxxxx Weight matrices

The significance of motif context order interval Identify small conserved regions in proteins Several motifs  characterise family Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours

PRINTS families are hierarchical Different motifs describe subfamilies G protein-coupled receptors rhodospin-likesecretin-like cAMP receptors metabotropic glutamate receptors etc adenosine receptors opsin receptors dopamine receptors somatostatin receptors histamine receptors etc somatostatin receptor type 1 somatostatin receptor type 2 somatostatin receptor type 3 etc

Profiles & HMMs Sequence alignment Entire domain Define coverage Whole protein Use entire alignment for domain or protein xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx Build model Models insertions and deletions Profile or HMM signature

Hidden Markov Models (HMM) Models insertions and deletions More flexible (can use partial alignments) Profiles Built using weight matrices More sophisticated algorithm

PROSITE domains: high quality manually curated seeds (using biologically characterized UniProtKB/Swiss-Prot entries), documentation and annotation rules. Oriented toward functional domain discrimination. HAMAP families: manually curated bacterial, archaeal and plastid protein families (represented by profiles and associated rules), covering some highly conserved proteins and functions. PROSITE and HAMAP profiles: a functional annotation perspective

HMM databases Sequence-based PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship PANTHER : families/subfamilies model the divergence of specific functions TIGRFAM: microbial functional family classification PFAM : families & domains based on conserved sequence SMART: functional domain annotation Structure-based SUPERFAMILY : models correspond to SCOP domains GENE3D : models correspond to CATH domains

Why we created InterPro By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database –to simplify & rationalise protein analysis –to facilitate automatic functional annotation of uncharacterised proteins –to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross- references to other databases

InterPro entry

The InterPro entry: types Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Family Distinct functional, structural or sequence units that may exist in a variety of biological contexts Domain Short sequences typically repeated within a protein Repeats PTM Active Site Binding Site Conserved Site Sites

InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases  Quality control  Removes redundancy

InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases  Hierarchical classification

Interpro hierarchies: Families FAMILIES can have parent/child relationships with other Families Parent/Child relationships are based on: Comparison of protein hits  child should be a subset of parent  siblings should not have matches in common Existing hierarchies in member databases Biological knowledge of curators

Interpro hierarchies: Domains DOMAINS can have parent/child relationships with other domains

Domains and Families may be linked through Domain Organisation Hierarchy

InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases

InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics

InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases UniProt KEGG... Reactome... IntAct... UniProt taxonomy PANDIT... MEROPS... Pfam clans... Pubmed

InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases PDB 3-D Structures SCOP Structural domains CATH Structural domain classification

Understanding signatures:

Non-overlapping signatures can be describing the same thing Not always possible to use signature overlap to determine how family signatures are related PF protein hits PR protein hits Two very different signatures both describing the same thing! e.g. High molecular weight glutenins

PFAM shows domain is composed of two types of repeated sequence motifs SUPERFAMILY shows the potential domain boundaries Some signatures give us similar, but complementary information

4) Non-contiguous domains 3) Repeated elements 2) Duplicated domains 1) Signature method Discontinuous Signatures Require Interpretation

e.g. PRINTS – discrete motifs Signature method 1) Signature method 3) Repeated elements 2) Duplicated domains 4) Non-contiguous domains Discontinuous Signatures Require Interpretation

1) Signature method Duplicated domains 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains e.g. SSF - duplication consisting of 2 domains with same fold Discontinuous Signatures Require Interpretation

Repeated elements 3) Repeated elements 2) Duplicated domains e.g. Kringle,WD40 4) Non-contiguous domains 1) Signature method Discontinuous Signatures Require Interpretation

3) Repeats Non-contiguous domains 4) Non-contiguous domains 2) Duplicated domains 1) Signature method Structural domains can consist of non-contiguous sequence Discontinuous Signatures Require Interpretation

4) Non-contiguous domains 3) Repeats 2) Duplicated domains 1) Signature method

Searching InterPro:

WHEN TO USE INTERPRO Use InterPro to predict family, domain or active site information for a given protein or amino acid sequence. You can search InterPro if you have a protein sequence a UniProtKB protein identifier,UniProtKB a Gene Ontology term, a protein structure code a general search term keyword short phrase and require further information regarding your protein of interest.

Search tools include: Text Search InterProScan (sequence search) BioMart (builds queries) Beta version:

InterPro Search wwwdev.ebi.ac.uk/interpro Search using: text protein ID InterPro ID GO term ID: GO: Name : apoptosis

InterPro Search Search results for GO: (apoptosis )

InterPro Search wwwdev.ebi.ac.uk/interpro protein ID

InterPro Search Results Structural data Link to PDBe Unintegrated signatures Domains and sites Family

Structural information CATH and SCOP divide PDB structures into domains Swiss-Model and ModBase can predict structure for regions not covered by PDB Note that one domain is discontiguous

Searching InterPro: InterProScan

InterProScan – Searching New Sequence wwwdev.ebi.ac.uk/interpro Paste in unknown sequence Additional options

InterProScan New Search Results Links to signature database s Link to InterPro entry

Searching InterPro: BioMart

Large volumes of data can be queried efficiently The interface is shared with many other bioinformatics resources It allows federation with other databases  PRIDE (mass spectrometry-derived proteins and peptides  REACTOME (biological pathways) BioMart Search BioMart allows more powerful and flexible queries

BioMart Search 1)Choose Dataset a. Choose InterPro BioMart

BioMart Search 1)Choose Dataset a. Choose InterPro BioMart b. Choose InterPro entries or protein matches

BioMart Search 2)Choose Filters  Search specific entries, signatures or proteins

BioMart Search 2)Choose Filters  e.g. Filter by specific proteins

BioMart Search 3)Choose Attributes  What results you want

BioMart Search 4)Choose additional Dataset (optional)  This is where you link results to Pride and Reactome

BioMart Search Results User manual HTML = web-formatted table CSV = comma-separated values TSV = tab-separated values XLS = excel spreadsheet Click to view results

InterPro – the numbers Our member databases all have their particular niche or focus......but InterPro is a combination of all their areas of expertise! InterPro 32.0: entries signatures covering 85.5% of UniProtKB Frequent releases – both protein and method updates unique visitors per month The database has grown almost 10-fold in ~11 years

Caveats We need your feedback! missing/additional references reporting problems requests InterPro is a predictive protein signature database. Small changes with a large impact may not be well represented. for example, inactive peptidases, such as Q8N3Z0, Q9W3H0Q8N3Z0Q9W3H0 InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry! EBI support pageEBI support page.

InterPro Team: Acknowledgements Amaia Sangrador David Lonsdale Craig McAnulla Matthew Fraser Anthony Quinn Maxim Scheremetjew Phil Jones Siew-Yit Yong Alex Mitchell Sebastien Pesseat Prudence Mutowo Sarah Hunter Christopher Hunter