Proteins to Proteomes The InterPro Database

Slides:



Advertisements
Similar presentations
Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.
Advertisements

Pfam(Protein families )
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
EBI is an Outstation of the European Molecular Biology Laboratory. InterPro Database Protein Functional Analysis Jennifer McDowall, Ph.D. Senior InterPro.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Protein Structure Prediction II
Protein and Function Databases
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Motif discovery and Protein Databases Tutorial 5.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
EBI is an Outstation of the European Molecular Biology Laboratory. Amaia Sangrador InterPro curator Introduction to InterPro.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Presentation transcript:

Proteins to Proteomes The InterPro Database

Origins of InterPro raw data UniProt Swiss-Prot TrEMBL 5M ??? InterPro 290K annotated 5M ??? automated annotation InterPro

uncharacterised sequence feed back common annotation Curated Annotation in InterPro TrEMBL uncharacterised sequence TrEMBL feed back common annotation multiple signatures InterPro groups of related proteins (same family or share domains) annotated sequence Swiss-Prot

Finding Conserved Signatures Pattern Simplest (limited) Fingerprint Sequence clustering HMM More information

Patterns Pattern/motif in sequence  regular expression Can define important sites Enzyme catalytic site Prosthetic group attachment Metal ion binding site Cysteines for disulphide bonds Protein or molecule binding B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | EXAMPLE: Insulin

Patterns Pattern/motif in sequence  regular expression Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

Patterns Pattern/motif in sequence  regular expression Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N

Patterns C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Pattern/motif in sequence  regular expression Can define important sites EXAMPLE: PS00262 Insulin family signature B chain xxxxxxCxxxxxxxxxxxxCxxxxxxxxx A chain xxxxxCCxxxCxxxxxxxxCx | | MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQ CCTSICSLYQLENYC N Regular expression C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Patterns PS00000 C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Sequence alignment Insulin family motif Define pattern Extract pattern sequences xxxxxx C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Build regular expression Pattern signature PS00000

Fingerprints Several motifs  characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

Fingerprints Several motifs  characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEYKGKSVNLK SIMGVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site

Fingerprints Several motifs  characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASKF NSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE His phosphorylation site Ser phosphorylation site

Fingerprints Several motifs  characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE His phosphorylation site Conserved site Ser phosphorylation site

Fingerprints Several motifs  characterise family Identify small conserved regions in divergent proteins Different combinations of motifs describe subfamilies EXAMPLE: PR00107 Phosphocarrier HPr signature PTHP_ENTFA: MEKKEFHIVAET GIHARPATLLVQTASK FNSDINLEY KGKSVNLK SIMGVMSL GVGQGSDVTITVDGADE AEGMAAIVETLQKEGLAE 1) GIHARPATLLVQTASKF 2) KGKSVNLKSIMGVMSL 3) LGVGQGSDVTITVDGADE 3-motif fingerprint

Fingerprints 1 2 3 PR00000 Correct order Correct spacing Ser phosphorylation site Conserved site His phosphorylation site Define motifs Sequence alignment Extract motif sequences xxxxxx Fingerprint signature 1 2 3 Correct order Correct spacing PR00000

Recruit homologous domains Sequence clustering Automatic clustering of homologous domains **Rarely covers entire domain (conserved core) **Signature size can change with release Known domain families Recruit homologous domains PSI-BLAST MKDOM2 Automatic clustering ProDomAlign Align domain families

Hidden Markov Models (HMM) Can characterise protein over entire length Models conserved and divergent regions (position-specific scoring) Models insertions and deletions Outperform in sensitivity and specificity More flexible (can use partial alignments)

(residue frequency at each position in alignment) Sequence alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Scoring matrix (residue frequency at each position in alignment) Profile

Phe, Tyr and Leu found at position 1 of alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Phe, Tyr and Leu found at position 1 of alignment Phe most conserved highest match value

Probability method gauges scoring parameters Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Tyr and Leu found at equal frequency at position 1 Tyr closer to Phe than Leu Scores: F > Y > L Probability method gauges scoring parameters

Hidden Markov Models (HMM) Sequence alignment M1 M2 M3 M4 Begin End M = match state

Hidden Markov Models (HMM) I = insert state, I2 I3 M1 M2 M3 M4 Begin End D1 D4 D2 D = delete state D3 M = match state,

SAM Profile HMMs Homologous structural superfamilies Start with single seed sequence Create 1 model for every protein in superfamily  combine results Few proteins in family have PDB structures Proteins in superfamily may have low sequence identity

Specialisation of Databases PRINTS Describe sibling families PROSITE Identify binding and active sites PRODOM Describe conserved core of domains PFAM Wide coverage of domains & families SMART Signalling, extracellular & nuclear domains TIGRFAM Functional classification of families PIRSF Families conserved in domain composition PANTHER Functional classification of families GENE3D Structural-based domain classification Superfam Structural-based domain classification

Foundations of InterPro Integration of signatures InterPro Manual curation

InterPro Entry Groups similar signature together Links related signatures Adds extensive annotation Linked to other databases Structural information and viewers

Assigning Type Family Full-length signatures grouping related proteins Domain Biological units with defined boundaries Repeat Signature repeated as a series of short motifs Site Protein feature described by a Prosite pattern Region Any signature that doesn’t fit the above

Grouping Signatures Together PFAM PROSITE 1) (100) Same positions Same protein hits IPR000001 Same positions Different protein hits 2) PFAM PROSITE (100) (50) IPR000001 IPR000002 PROSITE PFAM 3) (100) Different positions Same protein hits IPR000001 IPR000002 Different positions 4) PFAM PROSITE (100) IPR000001 IPR000002

Applies to domains and families Link related signatures - relationships 1) Parent - Child (subgroup of more closely related proteins) * PFAM (100) Protein kinase PFAM (75) (100) SMART Protein kinase Serine kinase PFAM Protein kinase SMART PROSITE Serine kinase Tyrosine kinase Parent Children PROSITE (25) Tyrosine kinase No proteins in common SMART PROSITE Applies to domains and families

Both families and domains can contain domains Link related signatures - relationships 2) Contains – Found in (Describes domain composition) PFAM Receptor family PROSITE C-terminal domain SMART N-terminal domain Found in (Pfam) Contains (Smart and Prosite) PFAM Receptor Family SMART PROSITE N-terminal domain C-terminal domain Both families and domains can contain domains

Link related signatures - relationships 2) Contains – Found in Coverage Signature must cover the entire (>90%) sequence of contained signature Contains PFAM Found in SMART PFAM SMART Contains Found in Overlapping

Criteria for Signature InterPro Relationship Relationships – evolutionary context Criteria for Signature InterPro Relationship Structural family Grandparent GENE3D Parents PFAM Sequence families Children TIGRFAM Functional families Unique to InterPro

Extensive Annotation Annotation Fields in InterPro Name and short name Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications

Select species-specific protein sets Extensive Annotation Annotation Fields in InterPro Name and short name Entry type (family, domain, site) Relationships (links related signatures) GO mapping ( large scale classification) Abstract Taxonomy (search/download using taxonomy) Examples Publications Select species-specific protein sets

Links to Other Databases Annotation Fields in InterPro Blocks (family alignments) IntEnz (enzymes) Prosite documents COME (bioinorganic motifs) CAZy (carbohydrate-active enzymes) IUPHAR (GPCR receptors) CluS-Tr (protein clusters) Pandit (phylogenetic trees of PFAMs) Merops (peptidases & inhibitors)

Structural information Structures PDB Classification CATH SCOP Homology Models Swiss-Model ModBase

Sequence-Structure Display Signatures predictive of protein annotation Structural data for specific proteins AstexViewer® for structure

Structure Viewer Manipulate structures Navigate between structure and sequence

Other Features – splice variants

Each ‘balloon’ represents a linked InterPro domain Other Features – domain architecture Each ‘balloon’ represents a linked InterPro domain Select data set of these proteins

Other Features – protein-protein interactions Lists proteins in entry known to be involved in protein-protein interactions IntAct database of interactions

Protein Sequence Coverage InterPro signatures cover: 95% of UniProt/Swiss-Prot proteins 79% of UniProt/TrEMBL proteins >4 million matches in InterPro >50,000 signature methods >16,000 InterPro entries

Searching InterPro Search tools include: Text Search InterProScan (sequence search) http://www.ebi.ac.uk/interpro/

InterPro Text Search Text search box Search results Search using: text protein ID InterPro ID GO term Search results Direct links to entry

Use ftp site to run multiple sequences simultaneously InterProScan Search Use ftp site to run multiple sequences simultaneously Member database search engines Paste in sequence (protein/nucleotide)

Direct links to signature databases InterProScan Search Results single InterPro entry Direct links to entry Direct links to signature databases