UniProt - The Universal Protein Resource

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
European Bioinformatics Institute The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Proteins and Protein Function Charles Yan Spring 2006.
Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
URL: European Bioinformatics Institute (EMBL-EBI) Swiss Institute of Bioinformatics (SIB) Protein Information Resource.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…
Integration of PRO and UniProtKB Amherst, NY May 16, 2013 Cathy H. Wu, Ph.D. PRO-PO-GO Meeting.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Biological Databases By : Lim Yun Ping E mail :
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
Part I: Identifying sequences with … Speaker : S. Gaj Date
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
EMBL-EBI EMBL-EBI EMBL-EBI What is the EBI's particular niche? Provides Core Biomolecular Resources in Europe –Nucleotide; genome, protein sequences,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.
Introduction to the Gene Ontology GO Workshop 3-6 August 2010.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
What is BLAST? Basic BLAST search What is BLAST?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein databases Henrik Nielsen
Archives and Information Retrieval
Functional Annotation of the Horse Genome
UniProt: Universal Protein Resource
UniProt: the Universal Protein Resource
Introduction to Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Protein Sequence Analysis - Overview -
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Protein Sequence Analysis - Overview -
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

UniProt - The Universal Protein Resource Claire O’Donovan

Pre-UniProt Swiss-Prot: created in 1986; since 1987, a collaboration of the SIB and the EMBL/EBI; TrEMBL: created at the EBI in 1996 as a computer-annotated protein sequence database supplementing Swiss-Prot. It was introduced to deal with the increased data flow from genome projects.

UniProt Consortium

UniProt Consortium activities

The three-layered approach The UniProt Archive (UniParc) UniProtKB + all other protein sequences publicly available Completeness The UniProt Reference Clusters (UniRef) Non-redundant views of UniProtKB + selected UniParc sets Speed The UniProt Knowledgebase (UniProtKB) Central database of annotated protein sequences and functional information UniProtKB/Swiss-Prot + UniProtKB/TrEMBL

The three layer approach Interrelationship between the UniProt Databases This is a graphical representation of the relationships between the UniProt databases as a summary of the previous slide: -The UniProt Archive provides comprehensive coverage of all protein sequences—including new and revised protein sequences—from many sources -The UniPRotKB is a subset of the protein sequences from UniPArc based on quality annotation and sequence stability -The UniProt Reference Clusters (UniRef), derived by automatic procedures from the UniProt Knowledgebase and the UniProt Archive, provide complete coverage of sequence space while hiding redundant sequences from view For each organism with a completely sequenced genome we construct a non-redundant proteome reference set. The availability of complete non-redundant proteome sets for all organisms whose genomes have been fully sequenced is a pre-requisite of many types of proteome-wide analysis. For example this is essential to high-throughput mass-spectrometric identification experiments, and in general, for any type of comparative genomic studies. As described in B.1.2.4.4, UniProt already constructs complete non-redundant proteome sets using a number of approaches. Each set and its analysis is made available shortly after the appearance of the newly completed genome sequence in the nucleotide sequence databases. A standard procedure is used to create, from the UniProt Knowledgebase, proteome sets for bacterial, archaeal and non-metazoan eukaryotic genomes. For most microbial and fungal genomes these sets are obtained via the list of predicted coding sequences submitted by the genome centers as bona-fide protein-coding “CDS

UniProt Archive UniParc is a non-redundant archive of protein sequences from the public databases It contains only protein sequences (no annotations) It provides cross-references to the source databases With these database goals, the summary of Database tasks are:

UniProt Archive: Principles UniParc is non-redundant Each unique protein sequence is stored only once and is assigned a unique stable UniParc identifier (e.g UPI0000000356) UniParc provides cross-references to the original source: active or retired UniParc provides sequence versions. UniParc stores each unique sequence only once and assigns a unique UniParc identifier. UniParc handles all sequences just as strings - all sequences 100% identical over the whole length of the sequence are merged regardless of whether they are from the same or from different species. Furthermore, UniParc cross-references the accession numbers of the source databases and provides sequence versions that get incremented each time the underlying sequence changes, thus, making it possible to observe sequence changes in all source databases. The use of the status flag “active” means that the sequence still exists in the source database, while the status flag “obsolete” indicates that the sequence does not exist anymore in the source database. UniParc records are without annotation since the annotation will be only true in the real context of the sequence: proteins with the same sequence may have different functions depending on species, tissue, developmental stage, etc. All this context dependent information (if known at all) cannot be present in UniParc, and it is the purpose of the UniProt Knowledgebase to provide this.

UniProt Reference Clusters Principles It provides non-redundant reference data collections It allows faster and more informative sequence similarity searches It includes the UniProtKB and some data from UniParc It merges across different species The concept of non-redundancy for the UniProt Reference Clusters is similar to that discussed before for the UniProt Knowledgebase. However, there are some differences. UniRef merges sequences automatically across different species and includes some data from UniParc (like translations from the highly unstable gene predictions), while merging in the Knowledgebase is restricted to curator-assisted merging of reliable and stable sequence data derived from a certain gene from a certain species. UniRef100 is based on all UniProt Knowledgebase records, as well as on the UniParc records that represent Ensembl protein translations, RefSeq data, and some other smaller data sets. The production of UniRef100 starts with the clustering of all these records by sequence identity. Identical sequences and subfragments are presented as a single UniRef100 entry containing accession numbers of all the merged entries, the protein sequence, links to the corresponding UniProt Knowledgebase and archive records. UniRef90 and UniRef50 are built from UniRef100 (see section C.2.3) to provide non-redundant sequence collections for the scientific user community to perform faster homology searches. They yield a size reduction of approximately 40% and 65%, respectively

UniProt Reference Clusters Principles UniRef100 It merges identical sequences and subfragments UniRef90 Size reduction of 40% UniRef50 Size reduction of 65% The concept of non-redundancy for the UniProt Reference Clusters is similar to that discussed before for the UniProt Knowledgebase. However, there are some differences. UniRef merges sequences automatically across different species and includes some data from UniParc (like translations from the highly unstable gene predictions), while merging in the Knowledgebase is restricted to curator-assisted merging of reliable and stable sequence data derived from a certain gene from a certain species. UniRef100 is based on all UniProt Knowledgebase records, as well as on the UniParc records that represent Ensembl protein translations, RefSeq data, and some other smaller data sets. The production of UniRef100 starts with the clustering of all these records by sequence identity. Identical sequences and subfragments are presented as a single UniRef100 entry containing accession numbers of all the merged entries, the protein sequence, links to the corresponding UniProt Knowledgebase and archive records. UniRef90 and UniRef50 are built from UniRef100 (see section C.2.3) to provide non-redundant sequence collections for the scientific user community to perform faster homology searches. They yield a size reduction of approximately 40% and 65%, respectively

UniProtKB/Swiss-Prot UniProtKB/TrEMBL 2007 UniProtKB/Swiss-Prot - Non-redundant - High level of integration - High level of manual curation - Contains 241,242 entries UniProtKB/TrEMBL - Translations of CDS in EMBL/GenBank/DDBJ - Automatic annotation - Contains 3,313,265 entries The consortium is involved in various activities including UniParc and UniRef but the major curation effort is directed towards the UniProt Knowledgebase. So what is the UniProt KnowledgeBase? It consists of two sections called Swiss-Prot and TrEMBL. The core distinquishing features of each of these sections are as follows: 2010 UniProtKB/Swiss-Prot 515,000 UniProtKB/TrEMBL -11,000,000

UniProtKB/TrEMBL Automatically generated in a biweekly cycle from the data present in EMBL/GenBank/DDBJ and some other sources such as TAIR/SGD Exclusions: pseudogenes, synthetic, immunoglobulins, patents, small sequences <8 /product, /gene, /locus_tag RefSeq and Ensembl Basically UniProtKB/TrEMBL represents the great majority of the todo list of the UniProt curators so it is worth taking a few minutes to review what this section is. The main source of this database is the CDS features in the nucleotide sequence databases, to which the sanger centre is of course a major contributer. We parse various data from the CDS feature into the appropriate linetypes in TrEMBL. The most obvious of these being the /product, /gene and /locus_tag qualifiers. Please contact us for more information about this if you would like to optimise the data transferral. There are some sequences which are not available in the nucleotide sequence databases for various reasons so we also try to capture this data from available sources such as TAIR/SGD. We are also working with refseq and ensembl to establish a pipeline to establish what data is missing/different between the resources in order to build consensus sets across the database. Exclusions: excluded at point of entry from nucleotide sequence databases and also deleted from db after discovery by curators. Welcome feedback on this.

UniProtKB/TrEMBL Proteome annotation Cross-references to other databases Addition of relevant publications (eg PDB) Redundancy Automatic annotation Future plans for manual annotation eg human proteome project After the TrEMBL entries are created from the underlying nucleotide sequence database entries, various post-processing steps are carried out to enhance the entries prior to manual annotation. Annotation is standardised according to the manual rules in the proteome database. Cross-references are added to 72 other databases. Where the cross-references to other databases facilitate the addition of relevant citations, this is also done..eg PDB. Redundancy procedures merge records from the same species which are 100% identical. Automatic annotation procedures are applied based on the RuleBase and Spearmint systems. This is used very conservatively leading to the enhancement of 25% of the database. There are future plans for partial manual annotation in TrEMBL to facilitate enhancement of protein and gene names eg. Human proteome project

Literature Other databases Analysis tools External expertise So after the initial work by the TrEMBL team, the UniProt curators then proceed with the manual annotation. This involves combining information from many sources such as of course literature but also other databases such as ensembl and havana and using analysis tools and availing of external expertise. The UniProt curators also actively search for proteins in sources other than the nucleotide sequence databases,..eg submissions and journal scan Analysis tools External expertise

Capturing the correct sequence The first step is to capture the correct sequence. The nature of the nucleotide sequence databases is that they are archives and also that they store each sequence report in one entry. The 100% identical reports are already merged at the trembl level as mentioned previously but there will still be redundancy. Archive collections Each sequence report stored in its own entry Merging at 100% identity Still some redundancy

Sequence similarity searches Identify potential merge candidates Identify similar already curated entries So the curators begin with a sequence similarity search to order to identify potential merge candidates and also to identify already curated similar entries in Swiss-Prot in order to speed the annotation process

Sequence comparison Sequence alignments Identification of sequence differences Helps in identifying underlying causes Sequence alignments allow the identification of sequence differences and also helps is identifying the underlying causes.

Causes of sequence differences Polymorphisms, disease variants Splice variants Sequencing errors Incorrect predictions The usual causes of sequence differences cover both naturally occuring events such as polymorphisms and splice variants and also events due to the sequencing process itself such as sequencing errors and incorrect predictions which of course have lessened in recent years. However due to the archive nature of the nucleotide sequence databases, they still exist in the databases.

Sequence analysis Range of sequence analysis tools used to predict important sequence features Use of most appropriate programs Development of new predictive methods

Evidence attribution System which allows linking of all information in an entry to its original source. Allows users: to trace origin of all data to differentiate easily between literature-derived and computational data to assess data reliability

EBI curation projects Submissions Journal scanning Species-specific curation human, mouse, rat, C.elegans, Drosophila, Xenopus, zebrafish, S.cerevisiae, S.pombe Protein family curation kinases, keratins UniProtKB-MSD collaboration PTM standardisation

UniProt distribution Biweekly distribution Website access www.uniprot.org FTP access DVD of UniProtKB (datalib@ebi.ac.uk)

UniProt Web With these database goals, the summary of Database tasks are: