Presentation is loading. Please wait.

Presentation is loading. Please wait.

EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.

Similar presentations


Presentation on theme: "EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro."— Presentation transcript:

1 EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator amaia@ebi.ac.uk Introduction to InterPro

2 What is InterPro? DIAGNOSTICS RESOURCE : InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins * Provides functional analysis of proteins by classifying them into families and predicting domains and important sites * Adds information about the signatures and the types of proteins they match

3 InterPro Consortium Consortium of 11 major signature databases

4 Why do we need predictive annotation tools?

5 Based on the original work on PIR, Swiss-Prot and TrEMBL Collaboration between EBI, SIB and PIR The mission of UniProt is to provide the scientific community with aUniProt comprehensive, high-quality and freely accessible resource of protein sequence and functional information. What is UniProt?

6 UniParc - Sequence archive Current and obsolete sequences UniMES Metagenomic and environmental sample sequences UniProtKB/Swiss-Prot Reviewed UniProtKB/TrEMBL Unreviewed UniProtKB Protein knowledgebase EMBL/GenBank/DDBJ, Ensembl, RefSeq, PDB, other resources UniRef Sequence clusters UniRef100 UniRef90 UniRef50 High-quality manual annotation Automatic annotation

7 Annotation using InterPro Swiss-Prot groups of related proteins (same family or share domains) TrEMBL uncharacterised sequence protein signatures InterPro automatic annotation pipeline CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence

8 Protein family classification Given a set of sequences, we usually want to know: –what are these proteins; to what family do they belong? –what is their function; how can we explain this in structural terms?

9 Protein family classification : BLAST ( Protein family classification : BLAST ( pairwise comparisons )

10 Protein family classification: BLAST

11 Limitations with Pairwise comparisons BLAST alignment of 2 proteins: 60S acidic ribosomal protein P0 from 2 species

12 Limitations with Pairwise comparisons

13 Protein family classification: signature databases Alternatively, we can seek ‘patterns’ that will allow us to infer relationships with previously-characterised sequences This is the approach taken by ‘signature’ databases

14 Protein signatures More sensitive homology searches Each member database creates signatures using different methods and methodologies:  manually-created sequence alignments  automatic processes with some human input and correction  entirely automatically.

15 What are protein signatures? Multiple sequence alignment Protein family/domain Build model Search Mature model ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK UniProt it. Significant match Protein analysis

16 Member databases Hidden Markov Models Finger- Prints ProfilesPatterns Sequence Clusters Structural Domains Functional annotation of families/domains Prediction of conserved domains Protein features (active sites…) METHODS

17 Full domain alignment methods Single motif methods Multiple motif methods Regex patterns (PROSITE) Profiles (Profile Library) HMMs (Pfam) Identity matrices (PRINTS) Diagnostic approaches (sequence-based)

18 Patterns Extract pattern sequences xxxxxx Sequence alignment Motif Define pattern Pattern signature C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression PS 00000

19 Patterns Patterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites Anchoring the match to the extremity of a sequence <M-R-[DE]-x(2,4)-[ALT]-{AM} Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C Drawbacks Simple but less powerful Advantages

20 >sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SA NGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCE LDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGF GENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGI EERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATK AAVEEGILPGGG VALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGA VIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLP KDESESGAAGAGMGGMGGMDY EXAMPLE: PS00296; Chaperonins cpn60 signature (PATTERN)PS00296 A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA] Pattern/motif in sequence  regular expression Prosite patterns

21 Fingerprints Sequence alignment Correct order Correct spacing Motif 2Motif 3Motif 1 Define motifs Fingerprint signature 123 PR 00000 Extract motif sequences xxxxxx Weight matrices

22 The significance of motif context order interval Identify small conserved regions in proteins Several motifs  characterise family Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours

23 PRINTS families are hierarchical Different motifs describe subfamilies G protein-coupled receptors rhodospin-likesecretin-like cAMP receptors metabotropic glutamate receptors etc adenosine receptors opsin receptors dopamine receptors somatostatin receptors histamine receptors etc somatostatin receptor type 1 somatostatin receptor type 2 somatostatin receptor type 3 etc

24 Profiles & HMMs Sequence alignment Entire domain Define coverage Whole protein Use entire alignment for domain or protein xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx Build model Models insertions and deletions Profile or HMM signature

25 Hidden Markov Models (HMM) Models insertions and deletions More flexible (can use partial alignments) Profiles Built using weight matrices More sophisticated algorithm

26 PROSITE domains: high quality manually curated seeds (using biologically characterized UniProtKB/Swiss-Prot entries), documentation and annotation rules. Oriented toward functional domain discrimination. HAMAP families: manually curated bacterial, archaeal and plastid protein families (represented by profiles and associated rules), covering some highly conserved proteins and functions. PROSITE and HAMAP profiles: a functional annotation perspective

27 HMM databases Sequence-based PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship PANTHER : families/subfamilies model the divergence of specific functions TIGRFAM: microbial functional family classification PFAM : families & domains based on conserved sequence SMART: functional domain annotation Structure-based SUPERFAMILY : models correspond to SCOP domains GENE3D : models correspond to CATH domains

28 Why we created InterPro By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database –to simplify & rationalise protein analysis –to facilitate automatic functional annotation of uncharacterised proteins –to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross- references to other databases

29 InterPro entry

30

31 The InterPro entry: types Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Family Distinct functional, structural or sequence units that may exist in a variety of biological contexts Domain Short sequences typically repeated within a protein Repeats PTM Active Site Binding Site Conserved Site Sites

32 InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases  Quality control  Removes redundancy

33 InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases  Hierarchical classification

34 Interpro hierarchies: Families FAMILIES can have parent/child relationships with other Families Parent/Child relationships are based on: Comparison of protein hits  child should be a subset of parent  siblings should not have matches in common Existing hierarchies in member databases Biological knowledge of curators

35 Interpro hierarchies: Domains DOMAINS can have parent/child relationships with other domains

36 Domains and Families may be linked through Domain Organisation Hierarchy

37 InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases

38 InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics

39 InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases UniProt KEGG... Reactome... IntAct... UniProt taxonomy PANDIT... MEROPS... Pfam clans... Pubmed

40 InterPro Entry Adds extensive annotation Links to other databases Structural information and viewers Groups similar signatures together Adds extensive annotation Links to other databases PDB 3-D Structures SCOP Structural domains CATH Structural domain classification

41 Understanding signatures:

42 Non-overlapping signatures can be describing the same thing Not always possible to use signature overlap to determine how family signatures are related PF03157 336 protein hits PR00210 331 protein hits Two very different signatures both describing the same thing! e.g. High molecular weight glutenins

43 PFAM shows domain is composed of two types of repeated sequence motifs SUPERFAMILY shows the potential domain boundaries www.ebi.ac.uk/interpro Some signatures give us similar, but complementary information

44 4) Non-contiguous domains 3) Repeated elements 2) Duplicated domains 1) Signature method www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation

45 e.g. PRINTS – discrete motifs Signature method 1) Signature method 3) Repeated elements 2) Duplicated domains 4) Non-contiguous domains www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation

46 1) Signature method Duplicated domains 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains e.g. SSF - duplication consisting of 2 domains with same fold www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation

47 Repeated elements 3) Repeated elements 2) Duplicated domains e.g. Kringle,WD40 4) Non-contiguous domains 1) Signature method www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation

48 3) Repeats Non-contiguous domains 4) Non-contiguous domains 2) Duplicated domains 1) Signature method Structural domains can consist of non-contiguous sequence www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation

49 4) Non-contiguous domains 3) Repeats 2) Duplicated domains 1) Signature method www.ebi.ac.uk/interpro

50 Searching InterPro:

51 WHEN TO USE INTERPRO Use InterPro to predict family, domain or active site information for a given protein or amino acid sequence. You can search InterPro if you have a protein sequence a UniProtKB protein identifier,UniProtKB a Gene Ontology term, a protein structure code a general search term keyword short phrase and require further information regarding your protein of interest.

52 http://www.ebi.ac.uk/interpro/ Search tools include: Text Search InterProScan (sequence search) BioMart (builds queries) Beta version: http://wwwdev.ebi.ac.uk/interpro/

53 InterPro Search wwwdev.ebi.ac.uk/interpro Search using: text protein ID InterPro ID GO term ID: GO:0006915 Name : apoptosis

54 InterPro Search Search results for GO:0006915 (apoptosis )

55 InterPro Search wwwdev.ebi.ac.uk/interpro protein ID

56 InterPro Search Results Structural data Link to PDBe Unintegrated signatures Domains and sites Family

57 Structural information CATH and SCOP divide PDB structures into domains Swiss-Model and ModBase can predict structure for regions not covered by PDB Note that one domain is discontiguous

58 Searching InterPro: InterProScan

59 InterProScan – Searching New Sequence wwwdev.ebi.ac.uk/interpro Paste in unknown sequence Additional options

60 InterProScan New Search Results Links to signature database s Link to InterPro entry

61 Searching InterPro: BioMart

62 Large volumes of data can be queried efficiently The interface is shared with many other bioinformatics resources It allows federation with other databases  PRIDE (mass spectrometry-derived proteins and peptides  REACTOME (biological pathways) BioMart Search BioMart allows more powerful and flexible queries

63 BioMart Search 1)Choose Dataset a. Choose InterPro BioMart

64 BioMart Search 1)Choose Dataset a. Choose InterPro BioMart b. Choose InterPro entries or protein matches

65 BioMart Search 2)Choose Filters  Search specific entries, signatures or proteins

66 BioMart Search 2)Choose Filters  e.g. Filter by specific proteins

67 BioMart Search 3)Choose Attributes  What results you want

68 BioMart Search 4)Choose additional Dataset (optional)  This is where you link results to Pride and Reactome

69 BioMart Search Results User manual HTML = web-formatted table CSV = comma-separated values TSV = tab-separated values XLS = excel spreadsheet Click to view results

70 InterPro – the numbers Our member databases all have their particular niche or focus......but InterPro is a combination of all their areas of expertise! InterPro 32.0: 21516 entries 101175 signatures covering 85.5% of UniProtKB Frequent releases – both protein and method updates 45 000 unique visitors per month The database has grown almost 10-fold in ~11 years

71 Caveats We need your feedback! missing/additional references reporting problems requests InterPro is a predictive protein signature database. Small changes with a large impact may not be well represented. for example, inactive peptidases, such as Q8N3Z0, Q9W3H0Q8N3Z0Q9W3H0 InterPro entries are based on signatures supplied to us by our member databases....this means no signature, no entry! EBI support pageEBI support page.

72 InterPro Team: Acknowledgements Amaia Sangrador David Lonsdale Craig McAnulla Matthew Fraser Anthony Quinn Maxim Scheremetjew Phil Jones Siew-Yit Yong Alex Mitchell Sebastien Pesseat Prudence Mutowo Sarah Hunter Christopher Hunter


Download ppt "EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro."

Similar presentations


Ads by Google