EBI is an Outstation of the European Molecular Biology Laboratory. InterPro Database Protein Functional Analysis Jennifer McDowall, Ph.D. Senior InterPro.

Slides:

Advertisements

Similar presentations

Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.

Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.

Pfam(Protein families )

EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.

©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.

© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.

Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.

InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.

Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.

Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.

Matching Problems in Bioinformatics Charles Yan Fall 2008.

The Protein Data Bank (PDB)

Protein Modules An Introduction to Bioinformatics.

Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.

Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:

Protein and Function Databases

Single Motif Charles Yan Spring Single Motif.

Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.

Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.

Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.

Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD

Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

Protein function and classification Hsin-Yu Chang

Protein function and classification Hsin-Yu Chang

Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.

Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.

Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.

Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.

Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.

BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.

Protein and RNA Families

Proteins to Proteomes The InterPro Database

Motif discovery and Protein Databases Tutorial 5.

Protein Domain Database

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Sequence Based Analysis Tutorial

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.

EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.

Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis

Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

InterPro Sandra Orchard.

Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

Protein families, domains and motifs in functional prediction May 31, 2016.

Sequence similarity, BLAST alignments & multiple sequence alignments

Protein families, domains and motifs in functional prediction

Bio/Chem-informatics

Protein Families, Motifs & Domains.

Demo: Protein Information Resource

Pfam: multiple sequence alignments and HMM-profiles of protein domains

Sandra Orchard EMBL-EBI

Genome Annotation Continued

SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.

Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. InterPro Database Protein Functional Analysis Jennifer McDowall, Ph.D. Senior InterPro Curator

EBI Sequence Databases UniProtKB Swiss-Prot manual annotation UniProtKB TrEMBL protein sequence translate (GenBank, DDBJ) nucleotide sequence EMBL CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG >7M >400,000

EBI Sequence Databases UniProtKB Swiss-Prot manual annotation UniProtKB TrEMBL protein sequence translate InterPro Protein signatures protein annotation (GenBank, DDBJ) nucleotide sequence EMBL CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG groups of related proteins (same family or share domains)

UniProtKB UniProt/ SwissProt proteins InterPro ~370,000 ~400,000 Signature matches InterPro ~80% Protein Coverage UniMESS Metagenomic proteins >6M Available 2009 UniProt/ TrEMBL proteins >5.3M >7M

What are protein signatures? Multiple sequence alignment A signature describes the pattern of a set of conserved residues in a group of proteins  Define a protein family  Define a protein feature (domain or conserved site)

More sensitive homology searches  Find more distant homologues than BLAST What value are signatures?

More sensitive homology searches What value are signatures? Classification of proteins  Associate proteins that share: Function Domains Sequence Structure

What value are signatures? Annotation of protein sequences  Define conserved regions of a protein -e.g. location and type of domains key structural or functional sites Classification of proteins More sensitive homology searches

What value are signatures? Transfer additional (automatic) annotation  Associate TrEMBL proteins with well- annotated SwissProt proteins Transfer annotation More sensitive homology searches Classification of proteins Annotation of protein sequences

Signature methods Pattern Fingerprint Sequence clustering HMM SAM

Patterns Pattern/motif in sequence  regular expression Can define important sites Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: Insulin

Patterns Pattern/motif in sequence  regular expression Can define important sites MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLV EALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPG AGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: PS00262 Insulin family signature

Patterns Pattern/motif in sequence  regular expression Can define important sites B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: PS00262 Insulin family signature MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLV EALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPG AGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N

Patterns Pattern/motif in sequence  regular expression Can define important sites B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: PS00262 Insulin family signature C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Regular expression MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLV EALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPG AGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N

Patterns – understanding a regular expression C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C Strictly conserved site; only one amino acid is accepted at this position Curly brackets denote amino acids that cannot occur at a single position x denotes any amino acid can occur at a single position There are dashes between each position

Patterns – understanding a regular expression C - C - {P} - x(2) - C - [STDNEKPI] - x(3) - [LIVMFS] - x(3) - C X(2) – therefore any amino acid can occur at the next two position Square brackets denote range of amino acids that occur at a single position

Patterns Extract pattern sequences xxxxxx Sequence alignment Insulin family motif Define pattern Pattern signature C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression PS 00000

Fingerprints Several motifs  characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

Fingerprints Several motifs  characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site

Fingerprints Several motifs  characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: His phosphorylation site Ser phosphorylation site MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

Fingerprints Several motifs  characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: His phosphorylation site Ser phosphorylation site Conserved site MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE

Fingerprints Several motifs  characterise family Different combinations of motifs describe subfamilies Identify small conserved regions in divergent proteins EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE 1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE 3-motif fingerprint

Fingerprints Extract motif sequences xxxxxx Sequence alignment Correct order Correct spacing Ser phosphorylation site Conserved site His phosphorylation site Define motifs Fingerprint signature 123 PR 00000

Sequence clustering Automatic clustering of homologous domains **Rarely covers entire domain (conserved core) **Signature size can change with release Known domain families Recruit homologous domains PSI-BLAST MKDOM2 Automatic clustering ProDomAlign Align domain families

Hidden Markov Models (HMM) Can characterise protein over entire length Models conserved and divergent regions (position-specific scoring) Models insertions and deletions  Outperform in sensitivity and specificity  More flexible (can use partial alignments)

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Sequence alignment Scoring matrix (residue frequency at each position in alignment) Profile Hidden Markov Models (HMM) Bayesian statistics probability scoring

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: M = match state M1 Hidden Markov Models (HMM)

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: M1 Hidden Markov Models (HMM) M2 M = match state

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: M1 Hidden Markov Models (HMM) M2M3 M = match state

Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: M1 Hidden Markov Models (HMM) M2M3M4M5M6M7M8M9M10M4M5M6M7M8M9M10 M = match state

M1M2M3M4M5M6M7M8M9M10M4M5M6M7M8M9M10 I = insert state I1I2I3I4I5I6I7I8I9 D = delete state D2D3D4D5D6D7D8D9 Hidden Markov Models (HMM)

Hidden Markov Models (HMM) HMM databases: PIR SUPERFAMILY PANTHER TIGRFAM PFAM SMART SUPERFAMILY GENE3D Domains conserved in sequence Families conserved in sequence Domains conserved in structure

SAM Profile HMMs Homologous structural superfamilies Start with single seed sequence Proteins in superfamily may have low sequence identity Few proteins in family have PDB structures Create 1 model for every protein in superfamily  combine results

SAM Profile models T99 script: Low identity matches Close homologues WU-BLASTP search Final model Single seed sequence GIHARPATLLVQTASKF Initial model GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF New larger alignment GIHARPATLLVQTASKF GIHARPATLLVQTASKF GIHARPATLLVQTASKF

Signatures Methods Pattern Fingerprint Sequence clustering HMM SAM Describe protein features: active sites, binding sites… Describe families and sibling subfamilies Predicts conserved domains

Signature Methods Pattern Fingerprint Sequence clustering HMM SAM Functional classification of families Functional domain annotation Structural domain annotation

Comprehensive annotation InterPro removes redundancy SWIB/MDM2 domain RanBP2-type zinc finger RING-type zinc finger Domain annotation

Comprehensive annotation Conserved site within zinc finger Annotate features

Comprehensive annotation Mdm2/Mdm4 family Mdm4 subfamily Parent Child Family classification

Domain Boundaries Gene3D (and SSF) determines domain structural boundaries Pfam trims domains to regions of good sequence conservation ProDom displays shortest conserved sequence

Fragmented Signatures 4) Non-contiguous domains 3) Repeated elements 2) Duplicated domains 1) Signature method

Fragmented Signatures e.g. PRINTS – discrete motifs Signature method 1) Signature method 3) Repeated elements 2) Duplicated domains 4) Non-contiguous domains

Fragmented Signatures 1) Signature method Duplicated domains 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains e.g. SSF - duplication consisting of 2 domains with same fold

Fragmented Signatures Repeated elements 3) Repeated elements 2) Duplicated domains e.g. Kringle,WD40 4) Non-contiguous domains 1) Signature method

Fragmented Signatures 3) Repeats Non-contiguous domains 4) Non-contiguous domains 2) Duplicated domains 1) Signature method Structural domains can consist of non-contiguous sequence

Fragmented Signatures 4) Non-contiguous domains 3) Repeats 2) Duplicated domains 1) Signature method

Complementary Annotation  Sequence-based signature (Pfam) shows that the domain is made up of repeating sequence elements Beta-propeller repeat  Structural-based signature (SSF) shows boundaries of structural domain 7-blade beta-propeller

Complementary Annotation PFAM shows domain is composed of two types of repeated sequence motifs SUPERFAMILY shows the potential domain boundaries

Complementary Annotation GENE3D shows that these domains share homologous structure PFAM/SMART show 2 domains from distinct sequence families

Searching InterPro: InterProScan sequence search

Searching InterPro Search tools include: Text Search InterProScan (sequence search)

InterPro Text Search Text search box Search using: text protein ID InterPro ID GO term Search results Direct links to entry

InterProScan Search Use ftp site to run multiple sequences simultaneously Member database search engines Paste in sequence (protein/nucleotide)

InterProScan Search Results single InterPro entry Direct links to entry Direct links to signature databases

EXERCISE 1

Exploring InterPro entries

InterPro Entry Groups similar signatures together Adds extensive annotation Linked to other databases Structural information and viewers Links related signatures

Grouping Signatures Together Same positions Different protein hits 2) PFAM PROSITE (100) (50) PFAM PROSITE 1) (100) Same positions Same protein hits IPR IPR IPR IPR IPR IPR Different positions 4) PFAM PROSITE (100) PROSITE PFAM 3)(100) Different positions Same protein hits

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications Short names appear in UniProt entries

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications DomainBiological units with defined boundaries Full-length signatures grouping related proteinsFamily RegionAny signature that doesn’t fit the above Repeat Site Signature repeated as a series of short motifs Protein feature described by a Prosite pattern

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Extensive Annotation Annotation Fields in InterPro Name and short name List of signatures (links to member databases) Entry type (family, domain, site) Relationships (links related signatures) GO mapping (  large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Links to Other Databases Additional annotation from databases: Blocks (family alignments) IntEnz (enzymes) Prosite documents COME (bioinorganic motifs) CAZy (carbohydrate-active enzymes) IUPHAR (GPCR receptors) CluS-Tr (protein clusters) Pandit (phylogenetic trees of PFAMs) Merops (peptidases & inhibitors)

Links to Structural Databases SCOP (structural classification of proteins) CATH (structural classification of proteins) PDB (protein structure databank) List of proteins with structural data PDB database of structures

Links to Structural Databases SCOP (structural classification of proteins) CATH (structural classification of proteins) PDB (protein structure databank) Links to structural classification

Links to Structural Databases SCOP (structural classification of proteins) CATH (structural classification of proteins) PDB (protein structure databank) Links to structural classification

Links to Interaction Databases IntAct (protein-protein interactions) Lists proteins in entry known to be involved in protein-protein interactions IntAct database of interactions

EXERCISE 2

Exploring InterPro relationships

InterPro Relationships Parent/Child Contains/Found in Hierarchical subdivision into more closely related groups Domain/subdomain composition OverlappingRemaining relationships

Link related signatures - relationships 1) Parent - Child (subgroup of more closely related proteins) PFAM (75) (100) SMART Protein kinase Serine kinase PROSITE (25) Tyrosine kinase * PFAM (100)Protein kinase * No proteins in common SMARTPROSITE Parent Children PFAM Protein kinase SMARTPROSITE Serine kinase Tyrosine kinase (IPR000001) (IPR000002)(IPR000003)

Relationships – evolutionary context GENE3D Grandparent Parents Children InterPro Relationship Criteria for Signature Structural family PFAM Sequence families TIGRFAM Functional families Unique to InterPro

IPR Protein kinase-like IPR PI 3/4 kinase IPR Protein kinase IPR Tyr kinase IPR Ser/Thr kinase-rel IPR TNK1 kin IPR ATMRK kin IPR APH kinase IPR ABC-1 IPR EF2 kinase IPR Actin-fragmin kin IPR CHK kinase IPR Ser/Thr kin IPR GCN2 IPR Hrmn Rcpt IPR Activin Rcpt IPR TGFb2 Rcpt IPR BMPRII IPR MAPK3 kin IPR IL1 kin IPR ERK3 MAPK IPR PSKH kin IPR Ca-dep kin4 IPR Ca-dep kin1 IPR Leu zip kin IPR Plant kin IPR MAPKKK4 IPR MAPKKK3 IPR MAPKKK1 IPR Pak kin IPR Myosin kin IPR JNK kin Example hierarchy: IPR RIO-like kin IPR RIO kin IPR Choline kinase IPR ERK1 kin IPR Hydroxyurea kin IPR MethylTR kin IPR Thiamine kin IPR Lipopoly syn IPR DUF IPR Put kinase

Different entries  not redundant Parent/child – evolutionary context

Most specific subfamily classification Superfamily classification Parent/child – evolutionary context

2) Contains – Found in PROSITE C-terminal domain SMART N-terminal domain PFAM Receptor family PFAM Receptor Family SMARTPROSITE N-terminal domainC-terminal domain Found in (Pfam) Contains (Smart and Prosite) Link related signatures - relationships (Describes domain composition)

2) Contains – Found in Link related signatures - relationships CoverageSignature must cover the entire (>90%) sequence of contained signature PFAM SMART Contains Found in PFAM SMART Contains Found in

3) Overlapping Link related signatures - relationships All remaining relationships PROSITE SMART Overlapping

EXERCISE 3

Exploring InterPro taxonomy

InterPro taxonomy Select species- specific protein sets

InterPro taxonomy

InterPro taxonomy

EXERCISE 4

Exploring protein structure in InterPro

Structural information PDB Classification Structures CATH SCOP Homology Models Swiss-Model ModBase

Structural information CATH and SCOP divide PDB structures into domains Swiss-Model and ModBase predict structure for regions not covered by PDB Note that one domain is discontiguous

Sequence-Structure Display Signatures predictive of protein annotation Structural data for specific proteins AstexViewer® for structure

Structure Viewer Navigate between structure and sequence Manipulate structures

EXERCISE 5

Exploring splice variants in InterPro

Other Features – splice variants Splice variants

EXERCISE 6

Exploring InterPro Domain Architecture

Other Features – domain architecture Select data set of these proteins Each ‘balloon’ represents a linked InterPro domain

EXERCISE 7

Protein Sequence Coverage InterPro signatures cover: 95% of UniProt/Swiss-Prot proteins 79% of UniProt/TrEMBL proteins >5 million matches in InterPro ~17,000 InterPro entries >57,500 signature methods

InterPro Team: InterPro Consortium: Team leader: Sarah Hunter Acknowledgements David Lonsdale Louise Daugherty Jennifer McDowall Craig McAnulla David Binns Ujjwal Das Anthony Quinn John Maslen Manjula Thimma Phil Jones