EBI is an Outstation of the European Molecular Biology Laboratory. InterPro Database Protein Functional Analysis Jennifer McDowall, Ph.D. Senior InterPro Curator
EBI Sequence Databases UniProtKB Swiss-Prot manual annotation UniProtKB TrEMBL protein sequence translate (GenBank, DDBJ) nucleotide sequence EMBL CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG >7M >400,000
EBI Sequence Databases UniProtKB Swiss-Prot manual annotation UniProtKB TrEMBL protein sequence translate InterPro Protein signatures protein annotation (GenBank, DDBJ) nucleotide sequence EMBL CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG groups of related proteins (same family or share domains)
UniProtKB UniProt/ SwissProt proteins InterPro ~370,000 ~400,000 Signature matches InterPro ~80% Protein Coverage UniMESS Metagenomic proteins >6M Available 2009 UniProt/ TrEMBL proteins >5.3M >7M
What are protein signatures? Multiple sequence alignment A signature describes the pattern of a set of conserved residues in a group of proteins Define a protein family Define a protein feature (domain or conserved site)
More sensitive homology searches Find more distant homologues than BLAST What value are signatures?
More sensitive homology searches What value are signatures? Classification of proteins Associate proteins that share: Function Domains Sequence Structure
What value are signatures? Annotation of protein sequences Define conserved regions of a protein -e.g. location and type of domains key structural or functional sites Classification of proteins More sensitive homology searches
What value are signatures? Transfer additional (automatic) annotation Associate TrEMBL proteins with well- annotated SwissProt proteins Transfer annotation More sensitive homology searches Classification of proteins Annotation of protein sequences
Signature methods Pattern Fingerprint Sequence clustering HMM SAM
Patterns Pattern/motif in sequence regular expression Can define important sites Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: Insulin
Patterns Pattern/motif in sequence regular expression Can define important sites MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLV EALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPG AGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: PS00262 Insulin family signature
Patterns Pattern/motif in sequence regular expression Can define important sites B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: PS00262 Insulin family signature MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLV EALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPG AGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
Patterns Pattern/motif in sequence regular expression Can define important sites B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: PS00262 Insulin family signature C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Regular expression MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLV EALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPG AGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N
Patterns – understanding a regular expression C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C Strictly conserved site; only one amino acid is accepted at this position Curly brackets denote amino acids that cannot occur at a single position x denotes any amino acid can occur at a single position There are dashes between each position
Patterns – understanding a regular expression C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C X(2) – therefore any amino acid can occur at the next two position Square brackets denote range of amino acids that occur at a single position
Patterns Extract pattern sequences xxxxxx Sequence alignment Insulin family motif Define pattern Pattern signature C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression PS 00000
Fingerprints Several motifs characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
Fingerprints Several motifs characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site
Fingerprints Several motifs characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: His phosphorylation site Ser phosphorylation site MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE
Fingerprints Several motifs characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: His phosphorylation site Ser phosphorylation site Conserved site MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE
Fingerprints Several motifs characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE 1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE 3-motif fingerprint
Fingerprints Extract motif sequences xxxxxx Sequence alignment Correct order Correct spacing Ser phosphorylation site Conserved site His phosphorylation site Define motifs Fingerprint signature 123 PR 00000
Sequence clustering Automatic clustering of homologous domains **Rarely covers entire domain (conserved core) **Signature size can change with release Known domain families Recruit homologous domains PSI-BLAST MKDOM2 Automatic clustering ProDomAlign Align domain families
Hidden Markov Models (HMM) Can characterise protein over entire length Models conserved and divergent regions (position-specific scoring) Models insertions and deletions Outperform in sensitivity and specificity More flexible (can use partial alignments)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Sequence alignment Scoring matrix (residue frequency at each position in alignment) Profile Hidden Markov Models (HMM) Bayesian statistics probability scoring
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: M = match state M1 Hidden Markov Models (HMM)
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: M1 Hidden Markov Models (HMM) M2 M = match state
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: M1 Hidden Markov Models (HMM) M2M3 M = match state
Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: M1 Hidden Markov Models (HMM) M2M3M4M5M6M7M8M9M10M4M5M6M7M8M9M10 M = match state
M1M2M3M4M5M6M7M8M9M10M4M5M6M7M8M9M10 I = insert state I1I2I3I4I5I6I7I8I9 D = delete state D2D3D4D5D6D7D8D9 Hidden Markov Models (HMM)
Hidden Markov Models (HMM) HMM databases: PIR SUPERFAMILY PANTHER TIGRFAM PFAM SMART SUPERFAMILY GENE3D Domains conserved in sequence Families conserved in sequence Domains conserved in structure
SAM Profile HMMs Homologous structural superfamilies Start with single seed sequence Proteins in superfamily may have low sequence identity Few proteins in family have PDB structures Create 1 model for every protein in superfamily combine results
SAM Profile models T99 script: Low identity matches Close homologues WU-BLASTP search Final model Single seed sequence GIHARPATLLVQTASKF Initial model GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF New larger alignment GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF
Signatures Methods Pattern Fingerprint Sequence clustering HMM SAM Describe protein features: active sites, binding sites… Describe families and sibling subfamilies Predicts conserved domains
Signature Methods Pattern Fingerprint Sequence clustering HMM SAM Functional classification of families Functional domain annotation Structural domain annotation
Comprehensive annotation InterPro removes redundancy SWIB/MDM2 domain RanBP2-type zinc finger RING-type zinc finger Domain annotation
Comprehensive annotation Conserved site within zinc finger Annotate features
Comprehensive annotation Mdm2/Mdm4 family Mdm4 subfamily Parent Child Family classification
Domain Boundaries Gene3D (and SSF) determines domain structural boundaries Pfam trims domains to regions of good sequence conservation ProDom displays shortest conserved sequence
Fragmented Signatures 4) Non-contiguous domains 3) Repeated elements 2) Duplicated domains 1) Signature method
Fragmented Signatures e.g. PRINTS – discrete motifs Signature method 1) Signature method 3) Repeated elements 2) Duplicated domains 4) Non-contiguous domains
Fragmented Signatures 1) Signature method Duplicated domains 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains e.g. SSF - duplication consisting of 2 domains with same fold
Fragmented Signatures Repeated elements 3) Repeated elements 2) Duplicated domains e.g. Kringle,WD40 4) Non-contiguous domains 1) Signature method
Fragmented Signatures 3) Repeats Non-contiguous domains 4) Non-contiguous domains 2) Duplicated domains 1) Signature method Structural domains can consist of non-contiguous sequence
Fragmented Signatures 4) Non-contiguous domains 3) Repeats 2) Duplicated domains 1) Signature method
Complementary Annotation Sequence-based signature (Pfam) shows that the domain is made up of repeating sequence elements Beta-propeller repeat Structural-based signature (SSF) shows boundaries of structural domain 7-blade beta-propeller
Complementary Annotation PFAM shows domain is composed of two types of repeated sequence motifs SUPERFAMILY shows the potential domain boundaries
Complementary Annotation GENE3D shows that these domains share homologous structure PFAM/SMART show 2 domains from distinct sequence families
Searching InterPro: InterProScan sequence search
Searching InterPro Search tools include: Text Search InterProScan (sequence search)
InterPro Text Search Text search box Search using: text protein ID InterPro ID GO term Search results Direct links to entry
InterProScan Search Use ftp site to run multiple sequences simultaneously Member database search engines Paste in sequence (protein/nucleotide)
InterProScan Search Results single InterPro entry Direct links to entry Direct links to signature databases
EXERCISE 1
Exploring InterPro entries
InterPro Entry Groups similar signatures together Adds extensive annotation Linked to other databases Structural information and viewers Links related signatures
Grouping Signatures Together Same positions Different protein hits 2) PFAM PROSITE (100) (50) PFAM PROSITE 1) (100) Same positions Same protein hits IPR IPR IPR IPR IPR IPR Different positions 4) PFAM PROSITE (100) PROSITE PFAM 3)(100) Different positions Same protein hits
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications Short names appear in UniProt entries
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications DomainBiological units with defined boundaries Full-length signatures grouping related proteinsFamily RegionAny signature that doesn’t fit the above Repeat Site Signature repeated as a series of short motifs Protein feature described by a Prosite pattern
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications
Links to Other Databases Additional annotation from databases: Blocks (family alignments) IntEnz (enzymes) Prosite documents COME (bioinorganic motifs) CAZy (carbohydrate-active enzymes) IUPHAR (GPCR receptors) CluS-Tr (protein clusters) Pandit (phylogenetic trees of PFAMs) Merops (peptidases & inhibitors)
Links to Structural Databases SCOP (structural classification of proteins) CATH (structural classification of proteins) PDB (protein structure databank) List of proteins with structural data PDB database of structures
Links to Structural Databases SCOP (structural classification of proteins) CATH (structural classification of proteins) PDB (protein structure databank) Links to structural classification
Links to Structural Databases SCOP (structural classification of proteins) CATH (structural classification of proteins) PDB (protein structure databank) Links to structural classification
Links to Interaction Databases IntAct (protein-protein interactions) Lists proteins in entry known to be involved in protein-protein interactions IntAct database of interactions
EXERCISE 2
Exploring InterPro relationships
InterPro Relationships Parent/Child Contains/Found in Hierarchical subdivision into more closely related groups Domain/subdomain composition OverlappingRemaining relationships
Link related signatures - relationships 1) Parent - Child (subgroup of more closely related proteins) PFAM (75) (100) SMART Protein kinase Serine kinase PROSITE (25) Tyrosine kinase * PFAM (100)Protein kinase * No proteins in common SMARTPROSITE Parent Children PFAM Protein kinase SMARTPROSITE Serine kinase Tyrosine kinase (IPR000001) (IPR000002)(IPR000003)
Relationships – evolutionary context GENE3D Grandparent Parents Children InterPro Relationship Criteria for Signature Structural family PFAM Sequence families TIGRFAM Functional families Unique to InterPro
IPR Protein kinase-like IPR PI 3/4 kinase IPR Protein kinase IPR Tyr kinase IPR Ser/Thr kinase-rel IPR TNK1 kin IPR ATMRK kin IPR APH kinase IPR ABC-1 IPR EF2 kinase IPR Actin-fragmin kin IPR CHK kinase IPR Ser/Thr kin IPR GCN2 IPR Hrmn Rcpt IPR Activin Rcpt IPR TGFb2 Rcpt IPR BMPRII IPR MAPK3 kin IPR IL1 kin IPR ERK3 MAPK IPR PSKH kin IPR Ca-dep kin4 IPR Ca-dep kin1 IPR Leu zip kin IPR Plant kin IPR MAPKKK4 IPR MAPKKK3 IPR MAPKKK1 IPR Pak kin IPR Myosin kin IPR JNK kin Example hierarchy: IPR RIO-like kin IPR RIO kin IPR Choline kinase IPR ERK1 kin IPR Hydroxyurea kin IPR MethylTR kin IPR Thiamine kin IPR Lipopoly syn IPR DUF IPR Put kinase
Different entries not redundant Parent/child – evolutionary context
Most specific subfamily classification Superfamily classification Parent/child – evolutionary context
2) Contains – Found in PROSITE C-terminal domain SMART N-terminal domain PFAM Receptor family PFAM Receptor Family SMARTPROSITE N-terminal domainC-terminal domain Found in (Pfam) Contains (Smart and Prosite) Link related signatures - relationships (Describes domain composition)
2) Contains – Found in Link related signatures - relationships CoverageSignature must cover the entire (>90%) sequence of contained signature PFAM SMART Contains Found in PFAM SMART Contains Found in
3) Overlapping Link related signatures - relationships All remaining relationships PROSITE SMART Overlapping
EXERCISE 3
Exploring InterPro taxonomy
InterPro taxonomy Select species- specific protein sets
InterPro taxonomy
InterPro taxonomy
EXERCISE 4
Exploring protein structure in InterPro
Structural information PDB Classification Structures CATH SCOP Homology Models Swiss-Model ModBase
Structural information CATH and SCOP divide PDB structures into domains Swiss-Model and ModBase predict structure for regions not covered by PDB Note that one domain is discontiguous
Sequence-Structure Display Signatures predictive of protein annotation Structural data for specific proteins AstexViewer® for structure
Structure Viewer Navigate between structure and sequence Manipulate structures
EXERCISE 5
Exploring splice variants in InterPro
Other Features – splice variants Splice variants
EXERCISE 6
Exploring InterPro Domain Architecture
Other Features – domain architecture Select data set of these proteins Each ‘balloon’ represents a linked InterPro domain
EXERCISE 7
Protein Sequence Coverage InterPro signatures cover: 95% of UniProt/Swiss-Prot proteins 79% of UniProt/TrEMBL proteins >5 million matches in InterPro ~17,000 InterPro entries >57,500 signature methods
InterPro Team: InterPro Consortium: Team leader: Sarah Hunter Acknowledgements David Lonsdale Louise Daugherty Jennifer McDowall Craig McAnulla David Binns Ujjwal Das Anthony Quinn John Maslen Manjula Thimma Phil Jones