Presentation is loading. Please wait.

Presentation is loading. Please wait.

Proteins to Proteomes The InterPro Database

Similar presentations


Presentation on theme: "Proteins to Proteomes The InterPro Database"— Presentation transcript:

1 Proteins to Proteomes The InterPro Database

2 Origins of InterPro raw data UniProt Swiss-Prot TrEMBL 5M ??? InterPro
290K annotated 5M ??? automated annotation InterPro

3 uncharacterised sequence feed back common annotation
Curated Annotation in InterPro TrEMBL uncharacterised sequence TrEMBL feed back common annotation multiple signatures InterPro groups of related proteins (same family or share domains) annotated sequence Swiss-Prot

4 Finding Conserved Signatures
Pattern Simplest (limited) Fingerprint Sequence clustering HMM More information

5 Patterns Pattern/motif in sequence  regular expression
Can define important sites Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: Insulin

6 Patterns Pattern/motif in sequence  regular expression
Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

7 Patterns Pattern/motif in sequence  regular expression
Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N

8 Patterns C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Pattern/motif in sequence  regular expression Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N Regular expression C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

9 Patterns PS00000 C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Sequence alignment Insulin family motif Define pattern Extract pattern sequences xxxxxx C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression Pattern signature PS00000

10 Fingerprints Several motifs  characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

11 Fingerprints Several motifs  characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site

12 Fingerprints Several motifs  characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site Ser phosphorylation site

13 Fingerprints Several motifs  characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE His phosphorylation site Conserved site Ser phosphorylation site

14 Fingerprints Several motifs  characterise family
Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE 1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE 3-motif fingerprint

15 Fingerprints 1 2 3 PR00000 Correct order Correct spacing
Ser phosphorylation site Conserved site His phosphorylation site Define motifs Sequence alignment Extract motif sequences xxxxxx Fingerprint signature 1 2 3 Correct order Correct spacing PR00000

16 Recruit homologous domains
Sequence clustering Automatic clustering of homologous domains **Rarely covers entire domain (conserved core) **Signature size can change with release Known domain families Recruit homologous domains PSI-BLAST MKDOM2 Automatic clustering ProDomAlign Align domain families

17 Hidden Markov Models (HMM)
Can characterise protein over entire length Models conserved and divergent regions (position-specific scoring) Models insertions and deletions Outperform in sensitivity and specificity More flexible (can use partial alignments)

18 (residue frequency at each position in alignment)
Sequence alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Scoring matrix (residue frequency at each position in alignment) Profile

19 Phe, Tyr and Leu found at position 1 of alignment
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Phe, Tyr and Leu found at position 1 of alignment Phe most conserved highest match value

20 Probability method gauges scoring parameters
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Tyr and Leu found at equal frequency at position 1 Tyr closer to Phe than Leu Scores: F > Y > L Probability method gauges scoring parameters

21 Hidden Markov Models (HMM)
Sequence alignment M1 M2 M3 M4 Begin End M = match state

22 Hidden Markov Models (HMM)
I = insert state, I2 I3 M1 M2 M3 M4 Begin End D1 D4 D2 D = delete state D3 M = match state,

23 SAM Profile HMMs Homologous structural superfamilies
Start with single seed sequence Create 1 model for every protein in superfamily  combine results Few proteins in family have PDB structures Proteins in superfamily may have low sequence identity

24 Specialisation of Databases
PRINTS Describe sibling families PROSITE Identify binding and active sites PRODOM Describe conserved core of domains PFAM Wide coverage of domains & families SMART Signalling, extracellular & nuclear domains TIGRFAM Functional classification of families PIRSF Families conserved in domain composition PANTHER Functional classification of families GENE3D Structural-based domain classification Superfam Structural-based domain classification

25 Foundations of InterPro
Integration of signatures InterPro Manual curation

26 InterPro Entry Groups similar signature together
Links related signatures Adds extensive annotation Linked to other databases Structural information and viewers

27 Assigning Type Family Full-length signatures grouping related proteins
Domain Biological units with defined boundaries Repeat Signature repeated as a series of short motifs Site Protein feature described by a Prosite pattern Region Any signature that doesn’t fit the above

28 Grouping Signatures Together
PFAM PROSITE 1) (100) Same positions Same protein hits IPR000001 Same positions Different protein hits 2) PFAM PROSITE (100) (50) IPR000001 IPR000002 PROSITE PFAM 3) (100) Different positions Same protein hits IPR000001 IPR000002 Different positions 4) PFAM PROSITE (100) IPR000001 IPR000002

29 Applies to domains and families
Link related signatures - relationships 1) Parent - Child (subgroup of more closely related proteins) * PFAM (100) Protein kinase PFAM (75) (100) SMART Protein kinase Serine kinase PFAM Protein kinase SMART PROSITE Serine kinase Tyrosine kinase Parent Children PROSITE (25) Tyrosine kinase No proteins in common SMART PROSITE Applies to domains and families

30 Both families and domains can contain domains
Link related signatures - relationships 2) Contains – Found in (Describes domain composition) PFAM Receptor family PROSITE C-terminal domain SMART N-terminal domain Found in (Pfam) Contains (Smart and Prosite) PFAM Receptor Family SMART PROSITE N-terminal domain C-terminal domain Both families and domains can contain domains

31 Link related signatures - relationships
2) Contains – Found in Coverage Signature must cover the entire (>90%) sequence of contained signature Contains PFAM Found in SMART PFAM SMART Contains Found in Overlapping

32 Criteria for Signature InterPro Relationship
Relationships – evolutionary context Criteria for Signature InterPro Relationship Structural family Grandparent GENE3D Parents PFAM Sequence families Children TIGRFAM Functional families Unique to InterPro

33 Extensive Annotation Annotation Fields in InterPro Name and short name
Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

34 Select species-specific protein sets
Extensive Annotation Annotation Fields in InterPro Name and short name Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications Select species-specific protein sets

35 Links to Other Databases
Annotation Fields in InterPro Blocks (family alignments) IntEnz (enzymes) Prosite documents COME (bioinorganic motifs) CAZy (carbohydrate-active enzymes) IUPHAR (GPCR receptors) CluS-Tr (protein clusters) Pandit (phylogenetic trees of PFAMs) Merops (peptidases & inhibitors)

36 Structural information
Structures PDB Classification CATH SCOP Homology Models Swiss-Model ModBase

37 Sequence-Structure Display
Signatures predictive of protein annotation Structural data for specific proteins AstexViewer® for structure

38 Structure Viewer Manipulate structures
Navigate between structure and sequence

39 Other Features – splice variants

40 Each ‘balloon’ represents a linked InterPro domain
Other Features – domain architecture Each ‘balloon’ represents a linked InterPro domain Select data set of these proteins

41 Other Features – protein-protein interactions
Lists proteins in entry known to be involved in protein-protein interactions IntAct database of interactions

42 Protein Sequence Coverage
InterPro signatures cover: 95% of UniProt/Swiss-Prot proteins 79% of UniProt/TrEMBL proteins >4 million matches in InterPro >50,000 signature methods >16,000 InterPro entries

43 Searching InterPro Search tools include: Text Search
InterProScan (sequence search)

44 InterPro Text Search Text search box Search results Search using: text
protein ID InterPro ID GO term Search results Direct links to entry

45 Use ftp site to run multiple sequences simultaneously
InterProScan Search Use ftp site to run multiple sequences simultaneously Member database search engines Paste in sequence (protein/nucleotide)

46 Direct links to signature databases
InterProScan Search Results single InterPro entry Direct links to entry Direct links to signature databases


Download ppt "Proteins to Proteomes The InterPro Database"

Similar presentations


Ads by Google