Sandra Orchard EMBL-EBI

Slides:



Advertisements
Similar presentations
Duncan Legge EMBL-EBI. Introduction to InterPro Introduction to InterPro Introduction to Protein Signatures & InterPro.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Proteins and Protein Function Charles Yan Spring 2006.
Protein Modules An Introduction to Bioinformatics.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
UniProt - The Universal Protein Resource
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Protein function and classification Hsin-Yu Chang
Understanding proteins: resources for identification and annotation.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Protein databases Henrik Nielsen
Bio/Chem-informatics
Demo: Protein Information Resource
Archives and Information Retrieval
Department of Genetics • Stanford University School of Medicine
UniProt: Universal Protein Resource
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
Predicting Active Site Residue Annotations in the Pfam Database
PIR: Protein Information Resource
Introduction to Bioinformatics
Protein Sequence Analysis - Overview -
Sequence Based Analysis Tutorial
BLAST.
Protein Sequence Analysis - Overview -
Applying principles of computer science in a biological context
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Sandra Orchard EMBL-EBI This presentation introduces the protein database UniProt. I’ll start with an introduction to UniProt giving a bit of background and describing what we’re trying to achieve. Then I’ll go through the various sections of a UniProtKB\Swiss-Prot entry to show you what kind of biological information we capture and present. I’ll explain our Automatic Annotation systems, this is the way in which we copy data from well studies proteins to those which haven’t been studied but when predict have a similar function. [Drop this for patent talks - I’ll talk a little bit of Proteomics, what is a proteome and how to access/download a proteome from UniProt.] The to finish off I’ll go through the functionality of the uniprot.org webiste.

Background of UniProt Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD Prior to 2002 there used to be 3 separate protein databases located around the globe each with its own data and rules for annotation. In 2002 these databases merged to provide and unified source with common annotation standards and datasets. UniProt is funded mainly by…. Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database

We Aim To Provide… A high quality protein sequence database A non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential. Easy protein identification Stable identifiers and consistent nomenclature/controlled vocabularies Thorough protein annotation Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source

The Two Sides of UniProtKB UniProtKB/TrEMBL 1 entry per nucleotide submission UniProtKB/Swiss-Prot 1 entry per protein This is a very important aspect of is that there is two sides to the databases. Basically, UniProt entries which have been looked at by a Curator and those which haven’t. Those which haven’t been manually reviewed by a Curator reside in TrEMBL. These are automatic translations of ENA the European Nucleotide Archive (formerly know as EMBL). The European Nucleotide Archive is stuctured such that each sequence is owned by the person which submitted it. Thus sequences can not be merged. Hence why when sequences are translated and incorporated in to UniProt they can be redundancy by that I mean more than one UniProt entry per protein. Also TrEMBL entries are unreviewed, thus they do not have annotation added from literature only minimal annotation from computational algorithms. By contrast, once a Curator has reviewed an entry and added information from all relevant literature the entry swaps to the Swiss-Prot side of the UniProt database. This is non-redundant as all entries for a protein are merged. Indicators of which part of UniProt an entry belongs inlcude the colour of the starts and the ID.... Redundant, automatically annotated - unreviewed Non-redundant, high-quality manual annotation - reviewed

Curation of a UniProt/SwissProt entry Sequence UniProt/TrEMBL References Sequence variants Literature Annotations Nomenclature Ontologies Sequence features We’ll look at this in more detail after the break but briefly UniProt is the gold-standard resource for information on proteins. Every entry initially receives automatic annotation so its not just a bare sequence, there’s a team of Curators that undertake manual curation using the literature and sequence analysis. We also use in-house bioinformatics tools for protein classification and domain prediction. Data from other databases is imported and cross-referenced. It comprises three different databases, but I haven’t shown all three here for the sake of simplicity. UniProtKB is the central database of protein sequences with accurate, consistent, and rich sequence and functional annotation. It comprises the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section. The UniProt archive is an archive of all the protein sequences in the public domain, and the UniRef databases are a series of three databases that store sequences of 100%, 90% and 50% identity in the same records to speed up searching without losing information. UniProtKB contains more than 29 million cross references to over 100 other data resources; a few key ones are shown here. UniProt/SwissProt 12.11.2018

Searching UniProt – Simple Search Text-based searching Logical operators ‘&’ (and), ‘|’ Master headline

Searching UniProt – Advanced Search Master headline

Each linked to the UniProt entry Searching UniProt – Search Results Each linked to the UniProt entry Master headline

Searching UniProt – Search Results Master headline

Searching UniProt – Search Results Master headline

Searching UniProt – Blast Search Master headline

As on slide Just go through data types: > Entry Name – if the entry has been reviewed then the first part should represent the gene name. The second part denotes the organism, usually it consists of the first 3 letters of the genus name and the first 2 letters of the species name, so for the Fruit fly Drosophila melanogaster all entries will be *****_DROME. Exceptions are HUMAN, MOUSE and RAT. > Accession – an entry can have more than one accession if it is a Swiss-Prot. > Entry history – (pretty self-explanatory) the Complete history takes you to page where any two versions from each UniProt release can be selected and compared. > Entry status – whether its Reviewed (UniProtKB/Swiss-Prot) or Unreviewed (UniProtKB/TrEMBL). > Annotation project – usually species specific. > Disclaimer – only required for Human entry in case users try to self medicate base on the information we provide.

Protein names – There’s always a recommended name and if the protien is an enzyme the Enzyme Commision number is given which is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. Alternative names found in the literature are also added to aid finding the entry from a search. Gene names – usually tries to represent the protein name but not always possible Organism – with link to Complete proteome Tax ID – Numerical UniProt taxon number and link to NCBI taxonomy browser Taxonomic lineage.

Annotation comments FUNCTION PTM SUBCELLULAR LOCATION RNA EDITING ALTERNATIVE PRODUCTS MASS SPECTROMETRY TISSUE SPECIFICITY DOMAIN DEVELOPMENTAL STAGE POLYMORPHISM INDUCTION DISRUPTION PHENOTYPE SIMILARITY ALLERGEN CATALYTIC ACTIVITY DISEASE COFACTOR TOXIC DOSE ENZYME REGULATION BIOTECHNOLOGY BIOPHYSICOCHEMICAL- PROPERTIES PHARMACEUTICAL MISCELLANEOUS PATHWAY CAUTION SUBUNIT SEQUENCE CAUTION INTERACTION WEB RESOURCE There’s a wide range of topics for us to captures as much information as possible from the literature, not every topic is present in every entry. The green ones are very general, the blue ones are used a lot for enzymes and the pink ones are relevant to proteins involved in pathology/medicine.

Evidence tags to show source Controlled vocabularies used whenever possible Its in this section that all the information obtained from the literature is summarized and organized via specific fields. Where possible we try to be consistent between protein with same function, this is aided by the use of controlled vocabularies. Evidence tags are listed at the end of the section to show the source of the information.

As far as I am aware UniProt is unique in annotating features to areas of amino acid sequence. As shown above we show regions of biological interest like those required for an interaction and involved in subcellular localization. We also show smaller sites such as single amino acids that are modified post- translation. Master headline

Automatic Annotation for UniProtKB/TrEMBL Automatic annotation is a method of copying annotation to similar proteins, so those proteins that have not been studied gain some information.

UniProtKB/Swiss-Prot Manually annotated UniProtKB/TrEMBL Computationally annotated UniProtKB/Swiss-Prot Manually annotated These graphs illustrate the need for automatic annotation. The red dot puts the Swiss-Prot figure in the context of the TrEMBL graph The message here is that there are a lot more entries TrEMBL than Swiss-Prot so we need to find a way to transfer annotation from Swiss-Prot entries to those that haven’t been reviewed in TrEMBL. Web-pages for up to date Stats UniProt/TrEMBL - http://www.ebi.ac.uk/swissprot/sptr_stats/index.html UniProt/SwissProt - http://www.expasy.org/sprot/relnotes/relstat.html Last updated 21/02/2012

InterPro Master headline

Master headline

Automatic Annotation UniProtKB employs two prediction programs which are referred to as UniRule and SAAS. SAAS, Statistical Automatic Annotation System, generates a new set of decision-trees with every UniProtKB release using data-mining. UniRule maintains a set of manually established and maintained annotation rules. Automatic annotation is produced by two methods. One method just uses a computer program to generate rules, this is named SAAS, which stands for Statistical Automatic Annotation System. The other method is called UniRule and is a collection of rules created by Scientists to propagate a specific set of data based on a defined criteria. Both of these systems use Swiss-Prot and InterPro as training sets. Swiss-Prot InterPro

Help / Feedback help@uniprot.org Stuck? Just ask – active help and support team Feedback – if you find something incorrect, outdated, missing etc please tell us. help@uniprot.org

Introduction to Protein Signatures & InterPro Introduction to InterPro

Protein Signatures Protein Signature = an amino acid sequence (not necessarily consecutive) associated with a protein characteristic. Basically introduce the concept of protein signatures Introduction to InterPro

What value are signatures? Better at finding proteins with common function Find more distant homologues than BLAST Better at finding proteins with common function Classification of proteins Associate proteins that share: Function Domains Sequence Structure Annotation of protein sequences Define conserved regions of a protein e.g. location and type of domains key structural or functional sites

How are protein signatures made? Protein family/domain Build model Search Multiple sequence alignment Significant matches ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK SVPRSPVCGSDGVTYGTECDLK HPPPGPVCGTDGLTYDNRCELR E-value 1e-49 E-value 3e-42 E-value 5e-39 E-value 6e-10 Protein signature Refine Introduction to InterPro

Types of Protein signatures (sequence based) Multiple protein alignment

Types of Protein signatures (sequence based) Single motif methods Regular expression patterns x = any AA ( ) = number of AAs Must be this C - C - {P} - x(2) - C - [STDNEKPI] - C { } = cannot be.. [ ] = any of

Types of Protein signatures (sequence based) Single motif methods Regular expression patterns 1 2 3 Multiple motif methods Identity matrices Fingerprints

Types of Protein signatures Regular expression patterns (sequence based) Single motif methods Regular expression patterns Full domain alignment methods Profiles (Profile Library) M1 M2 M3 M4 I1 I2 I3 D2 D3 Multiple motif methods Hidden Markov Models Mathematical model of amino acid probability Identity matrices Fingerprints

CONTRIBUTING MEMBER DATA BASES Models built on either sequence or structural alignments Each MDB has its own focus Hidden Markov Models Finger- Prints Profiles Patterns Sequence Clusters InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins, such as possible function and the potential location of functionally important sites and domains. Each member database creates signatures in different ways: some groups build them from manually-created sequence alignments, some use automatic processes with some human input and correction and others build their signatures entirely automatically. The signatures are represented using a variety of different model types (HMMs, Profiles, Regular Expressions, etc.) The member databases all have their own particular niche or focus; at InterPro we aim to be a combination of their individual strengths. To do this we integrate signatures from the member databases that represent the same protein family, domain or site into a single InterPro entry. We check the biological accuracy of the individual signatures and add concise information about the signatures and the types of proteins they match, including consistent names, descriptive abstracts (with links to original publications) and GO terms. Protein features (active sites…) Prediction of conserved domains Structural Domains Functional annotation of families/domains

Database Basis Institution Focus URL Built from Pfam HMM Sanger Institute Sequence alignment Family & Domain based on conserved sequence http://pfam.sanger.ac.uk/ Gene3D UCL Structure alignment Structural Domain http://gene3d.biochem.ucl.ac.uk/Gene3D/ Superfamily Uni. of Bristol Evolutionary domain relationships http://supfam.cs.bris.ac.uk/SUPERFAMILY/ SMART EMBL Heidelberg Functional domain annotation http://smart.embl-heidelberg.de/ TIGRFAM J. Craig Venter Inst. Microbial Functional Family Classification http://www.jcvi.org/cms/research/projects/tigrfams/overview/ Panther Uni. S. California Family functional classification http://www.pantherdb.org/ PIRSF PIR, Georgetown, Washington D.C. Functional classification http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml PRINTS Fingerprints Uni. of Manchester http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php PROSITE Patterns & Profiles SIB Functional annotation http://expasy.org/prosite/ HAMAP Profiles Microbial protein family classification http://expasy.org/sprot/hamap/ ProDom Sequence clustering PRABI : Rhône-Alpes Bioinformatics Center Conserved domain prediction http://prodom.prabi.fr/prodom/current/html/home.php

Foundations of InterPro Integration of signatures InterPro Manual curation Master headline

InterPro Entry Groups similar signature together Links related signatures Adds extensive annotation Linked to other databases Structural information and viewers Master headline

Applies to domains and families Link related signatures - relationships Parent - Child (subgroup of more closely related proteins) * PFAM (100) Protein kinase PFAM (75) (100) SMART Protein kinase Serine kinase PFAM Protein kinase SMART PROSITE Serine kinase Tyrosine kinase Parent Children Applies to domains and families PROSITE (25) Tyrosine kinase No proteins in common SMART PROSITE Master headline

The InterPro entry types Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Biological units with defined boundaries Short sequences typically repeated within a protein PTM Active Site Binding Conserved Master headline

InterPro Search protein ID wwwdev.ebi.ac.uk/interpro

Unintegrated signatures InterPro Search Results Family Link to PDBe Domains and sites Unintegrated signatures Structural data

InterProScan – Searching New Sequence Additional options Paste in unknown sequence wwwdev.ebi.ac.uk/interpro

Links to signature databases InterProScan New Search Results Link to InterPro entry Links to signature databases MENTION ALSO SUMMARY TABLE

Contact and help We need your feedback! missing/additional references reporting problems requests help - questions answered NOTE: Worth mentioning than sometimes we can not infer function, only homology to unknown proteins: is the case of Proteins of unknown function and Domains of unknown function. 42

Thanks for your attention