Proteins to Proteomes The InterPro Database
Origins of InterPro raw data UniProt Swiss-Prot TrEMBL 5M ??? InterPro 290K annotated 5M ??? automated annotation InterPro
uncharacterised sequence feed back common annotation Curated Annotation in InterPro TrEMBL uncharacterised sequence TrEMBL feed back common annotation multiple signatures InterPro groups of related proteins (same family or share domains) annotated sequence Swiss-Prot
Finding Conserved Signatures Pattern Simplest (limited) Fingerprint Sequence clustering HMM More information
Patterns Pattern/motif in sequence regular expression Can define important sites Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: Insulin
Patterns Pattern/motif in sequence regular expression Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
Patterns Pattern/motif in sequence regular expression Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
Patterns C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Pattern/motif in sequence regular expression Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N Regular expression C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Patterns PS00000 C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Sequence alignment Insulin family motif Define pattern Extract pattern sequences xxxxxx C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression Pattern signature PS00000
Fingerprints Several motifs characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
Fingerprints Several motifs characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site
Fingerprints Several motifs characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site Ser phosphorylation site
Fingerprints Several motifs characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE His phosphorylation site Conserved site Ser phosphorylation site
Fingerprints Several motifs characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE 1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE 3-motif fingerprint
Fingerprints 1 2 3 PR00000 Correct order Correct spacing Ser phosphorylation site Conserved site His phosphorylation site Define motifs Sequence alignment Extract motif sequences xxxxxx Fingerprint signature 1 2 3 Correct order Correct spacing PR00000
Recruit homologous domains Sequence clustering Automatic clustering of homologous domains **Rarely covers entire domain (conserved core) **Signature size can change with release Known domain families Recruit homologous domains PSI-BLAST MKDOM2 Automatic clustering ProDomAlign Align domain families
Hidden Markov Models (HMM) Can characterise protein over entire length Models conserved and divergent regions (position-specific scoring) Models insertions and deletions Outperform in sensitivity and specificity More flexible (can use partial alignments)
(residue frequency at each position in alignment) Sequence alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Scoring matrix (residue frequency at each position in alignment) Profile
Phe, Tyr and Leu found at position 1 of alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Phe, Tyr and Leu found at position 1 of alignment Phe most conserved highest match value
Probability method gauges scoring parameters Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Tyr and Leu found at equal frequency at position 1 Tyr closer to Phe than Leu Scores: F > Y > L Probability method gauges scoring parameters
Hidden Markov Models (HMM) Sequence alignment M1 M2 M3 M4 Begin End M = match state
Hidden Markov Models (HMM) I = insert state, I2 I3 M1 M2 M3 M4 Begin End D1 D4 D2 D = delete state D3 M = match state,
SAM Profile HMMs Homologous structural superfamilies Start with single seed sequence Create 1 model for every protein in superfamily combine results Few proteins in family have PDB structures Proteins in superfamily may have low sequence identity
Specialisation of Databases PRINTS Describe sibling families PROSITE Identify binding and active sites PRODOM Describe conserved core of domains PFAM Wide coverage of domains & families SMART Signalling, extracellular & nuclear domains TIGRFAM Functional classification of families PIRSF Families conserved in domain composition PANTHER Functional classification of families GENE3D Structural-based domain classification Superfam Structural-based domain classification
Foundations of InterPro Integration of signatures InterPro Manual curation
InterPro Entry Groups similar signature together Links related signatures Adds extensive annotation Linked to other databases Structural information and viewers
Assigning Type Family Full-length signatures grouping related proteins Domain Biological units with defined boundaries Repeat Signature repeated as a series of short motifs Site Protein feature described by a Prosite pattern Region Any signature that doesn’t fit the above
Grouping Signatures Together PFAM PROSITE 1) (100) Same positions Same protein hits IPR000001 Same positions Different protein hits 2) PFAM PROSITE (100) (50) IPR000001 IPR000002 PROSITE PFAM 3) (100) Different positions Same protein hits IPR000001 IPR000002 Different positions 4) PFAM PROSITE (100) IPR000001 IPR000002
Applies to domains and families Link related signatures - relationships 1) Parent - Child (subgroup of more closely related proteins) * PFAM (100) Protein kinase PFAM (75) (100) SMART Protein kinase Serine kinase PFAM Protein kinase SMART PROSITE Serine kinase Tyrosine kinase Parent Children PROSITE (25) Tyrosine kinase No proteins in common SMART PROSITE Applies to domains and families
Both families and domains can contain domains Link related signatures - relationships 2) Contains – Found in (Describes domain composition) PFAM Receptor family PROSITE C-terminal domain SMART N-terminal domain Found in (Pfam) Contains (Smart and Prosite) PFAM Receptor Family SMART PROSITE N-terminal domain C-terminal domain Both families and domains can contain domains
Link related signatures - relationships 2) Contains – Found in Coverage Signature must cover the entire (>90%) sequence of contained signature Contains PFAM Found in SMART PFAM SMART Contains Found in Overlapping
Criteria for Signature InterPro Relationship Relationships – evolutionary context Criteria for Signature InterPro Relationship Structural family Grandparent GENE3D Parents PFAM Sequence families Children TIGRFAM Functional families Unique to InterPro
Extensive Annotation Annotation Fields in InterPro Name and short name Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Select species-specific protein sets Extensive Annotation Annotation Fields in InterPro Name and short name Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications Select species-specific protein sets
Links to Other Databases Annotation Fields in InterPro Blocks (family alignments) IntEnz (enzymes) Prosite documents COME (bioinorganic motifs) CAZy (carbohydrate-active enzymes) IUPHAR (GPCR receptors) CluS-Tr (protein clusters) Pandit (phylogenetic trees of PFAMs) Merops (peptidases & inhibitors)
Structural information Structures PDB Classification CATH SCOP Homology Models Swiss-Model ModBase
Sequence-Structure Display Signatures predictive of protein annotation Structural data for specific proteins AstexViewer® for structure
Structure Viewer Manipulate structures Navigate between structure and sequence
Other Features – splice variants
Each ‘balloon’ represents a linked InterPro domain Other Features – domain architecture Each ‘balloon’ represents a linked InterPro domain Select data set of these proteins
Other Features – protein-protein interactions Lists proteins in entry known to be involved in protein-protein interactions IntAct database of interactions
Protein Sequence Coverage InterPro signatures cover: 95% of UniProt/Swiss-Prot proteins 79% of UniProt/TrEMBL proteins >4 million matches in InterPro >50,000 signature methods >16,000 InterPro entries
Searching InterPro Search tools include: Text Search InterProScan (sequence search) http://www.ebi.ac.uk/interpro/
InterPro Text Search Text search box Search results Search using: text protein ID InterPro ID GO term Search results Direct links to entry
Use ftp site to run multiple sequences simultaneously InterProScan Search Use ftp site to run multiple sequences simultaneously Member database search engines Paste in sequence (protein/nucleotide)
Direct links to signature databases InterProScan Search Results single InterPro entry Direct links to entry Direct links to signature databases