Presentation on theme: "UniProt - The Universal Protein Resource"— Presentation transcript:
1 UniProt - The Universal Protein Resource Claire O’Donovan
2 Pre-UniProtSwiss-Prot: created in 1986; since 1987, a collaboration of the SIB and the EMBL/EBI;TrEMBL: created at the EBI in 1996 as a computer-annotated protein sequence database supplementing Swiss-Prot.It was introduced to deal with the increased data flow from genome projects.
5 The three-layered approach The UniProt Archive (UniParc)UniProtKB + all other protein sequences publicly availableCompletenessThe UniProt Reference Clusters (UniRef)Non-redundant views of UniProtKB + selected UniParc setsSpeedThe UniProt Knowledgebase (UniProtKB)Central database of annotated protein sequences and functional informationUniProtKB/Swiss-Prot + UniProtKB/TrEMBL
6 The three layer approach Interrelationshipbetweenthe UniProt DatabasesThis is a graphical representation of the relationships between the UniProt databases as a summary of the previous slide:-The UniProt Archive provides comprehensive coverage of all protein sequences—including new and revised protein sequences—from many sources-The UniPRotKB is a subset of the protein sequences from UniPArc based on quality annotation and sequence stability-The UniProt Reference Clusters (UniRef), derived by automatic procedures from the UniProt Knowledgebase and the UniProt Archive, provide complete coverage of sequence space while hiding redundant sequences from viewFor each organism with a completely sequenced genome we construct a non-redundant proteome reference set. The availability of complete non-redundant proteome sets for all organisms whose genomes have been fully sequenced is a pre-requisite of many types of proteome-wide analysis. For example this is essential to high-throughput mass-spectrometric identification experiments, and in general, for any type of comparative genomic studies. As described in B , UniProt already constructs complete non-redundant proteome sets using a number of approaches. Each set and its analysis is made available shortly after the appearance of the newly completed genome sequence in the nucleotide sequence databases. A standard procedure is used to create, from the UniProt Knowledgebase, proteome sets for bacterial, archaeal and non-metazoan eukaryotic genomes. For most microbial and fungal genomes these sets are obtained via the list of predicted coding sequences submitted by the genome centers as bona-fide protein-coding “CDS
7 UniProt ArchiveUniParc is a non-redundant archive of protein sequences from the public databasesIt contains only protein sequences (no annotations)It provides cross-references to the source databasesWith these database goals, the summary of Database tasks are:
8 UniProt Archive: Principles UniParc is non-redundantEach unique protein sequence is stored only once and is assigned a unique stable UniParc identifier (e.g UPI )UniParc provides cross-references to the original source: active or retiredUniParc provides sequence versions.UniParc stores each unique sequence only once and assigns a unique UniParc identifier. UniParc handles all sequences just as strings - all sequences 100% identical over the whole length of the sequence are merged regardless of whether they are from the same or from different species. Furthermore, UniParc cross-references the accession numbers of the source databases and provides sequence versions that get incremented each time the underlying sequence changes, thus, making it possible to observe sequence changes in all source databases. The use of the status flag “active” means that the sequence still exists in the source database, while the status flag “obsolete” indicates that the sequence does not exist anymore in the source database.UniParc records are without annotation since the annotation will be only true in the real context of the sequence: proteins with the same sequence may have different functions depending on species, tissue, developmental stage, etc. All this context dependent information (if known at all) cannot be present in UniParc, and it is the purpose of the UniProt Knowledgebase to provide this.
9 UniProt Reference Clusters Principles It provides non-redundant reference data collectionsIt allows faster and more informative sequence similarity searchesIt includes the UniProtKB and some data from UniParcIt merges across different speciesThe concept of non-redundancy for the UniProt Reference Clusters is similar to that discussed before for the UniProt Knowledgebase. However, there are some differences. UniRef merges sequences automatically across different species and includes some data from UniParc (like translations from the highly unstable gene predictions), while merging in the Knowledgebase is restricted to curator-assisted merging of reliable and stable sequence data derived from a certain gene from a certain species.UniRef100 is based on all UniProt Knowledgebase records, as well as on the UniParc records that represent Ensembl protein translations, RefSeq data, and some other smaller data sets. The production of UniRef100 starts with the clustering of all these records by sequence identity. Identical sequences and subfragments are presented as a single UniRef100 entry containing accession numbers of all the merged entries, the protein sequence, links to the corresponding UniProt Knowledgebase and archive records.UniRef90 and UniRef50 are built from UniRef100 (see section C.2.3) to provide non-redundant sequence collections for the scientific user community to perform faster homology searches. They yield a size reduction of approximately 40% and 65%, respectively
10 UniProt Reference Clusters Principles UniRef100It merges identical sequences and subfragmentsUniRef90Size reduction of 40%UniRef50Size reduction of 65%The concept of non-redundancy for the UniProt Reference Clusters is similar to that discussed before for the UniProt Knowledgebase. However, there are some differences. UniRef merges sequences automatically across different species and includes some data from UniParc (like translations from the highly unstable gene predictions), while merging in the Knowledgebase is restricted to curator-assisted merging of reliable and stable sequence data derived from a certain gene from a certain species.UniRef100 is based on all UniProt Knowledgebase records, as well as on the UniParc records that represent Ensembl protein translations, RefSeq data, and some other smaller data sets. The production of UniRef100 starts with the clustering of all these records by sequence identity. Identical sequences and subfragments are presented as a single UniRef100 entry containing accession numbers of all the merged entries, the protein sequence, links to the corresponding UniProt Knowledgebase and archive records.UniRef90 and UniRef50 are built from UniRef100 (see section C.2.3) to provide non-redundant sequence collections for the scientific user community to perform faster homology searches. They yield a size reduction of approximately 40% and 65%, respectively
11 UniProtKB/Swiss-Prot UniProtKB/TrEMBL 2007UniProtKB/Swiss-Prot- Non-redundant- High level of integration- High level of manual curation- Contains 241,242 entriesUniProtKB/TrEMBL- Translations of CDS in EMBL/GenBank/DDBJ- Automatic annotation- Contains 3,313,265 entriesThe consortium is involved in various activities including UniParc and UniRef but the major curation effort is directed towards the UniProt Knowledgebase. So what is the UniProt KnowledgeBase? It consists of two sections called Swiss-Prot and TrEMBL. The core distinquishing features of each of these sections are as follows:2010UniProtKB/Swiss-Prot515,000UniProtKB/TrEMBL-11,000,000
12 UniProtKB/TrEMBLAutomatically generated in a biweekly cycle from the data present in EMBL/GenBank/DDBJ and some other sources such as TAIR/SGDExclusions: pseudogenes, synthetic, immunoglobulins, patents, small sequences <8/product, /gene, /locus_tagRefSeq and EnsemblBasically UniProtKB/TrEMBL represents the great majority of the todo list of the UniProt curators so it is worth taking a few minutes to review what this section is. The main source of this database is the CDS features in the nucleotide sequence databases, to which the sanger centre is of course a major contributer. We parse various data from the CDS feature into the appropriate linetypes in TrEMBL. The most obvious of these being the /product, /gene and /locus_tag qualifiers. Please contact us for more information about this if you would like to optimise the data transferral. There are some sequences which are not available in the nucleotide sequence databases for various reasons so we also try to capture this data from available sources such as TAIR/SGD. We are also working with refseq and ensembl to establish a pipeline to establish what data is missing/different between the resources in order to build consensus sets across the database.Exclusions: excluded at point of entry from nucleotide sequence databases and also deleted from db after discovery by curators. Welcome feedback on this.
13 UniProtKB/TrEMBL Proteome annotation Cross-references to other databasesAddition of relevant publications (eg PDB)RedundancyAutomatic annotationFuture plans for manual annotation eg human proteome projectAfter the TrEMBL entries are created from the underlying nucleotide sequence database entries, various post-processing steps are carried out to enhance the entries prior to manual annotation. Annotation is standardised according to the manual rules in the proteome database. Cross-references are added to 72 other databases. Where the cross-references to other databases facilitate the addition of relevant citations, this is also done..eg PDB.Redundancy procedures merge records from the same species which are 100% identical. Automatic annotation procedures are applied based on the RuleBase and Spearmint systems. This is used very conservatively leading to the enhancement of 25% of the database. There are future plans for partial manual annotation in TrEMBL to facilitate enhancement of protein and gene names eg. Human proteome project
14 Literature Other databases Analysis tools External expertise So after the initial work by the TrEMBL team, the UniProt curators then proceed with the manual annotation. This involves combining information from many sources such as of course literature but also other databases such as ensembl and havana and using analysis tools and availing of external expertise. The UniProt curators also actively search for proteins in sources other than the nucleotide sequence databases,..eg submissions and journal scanAnalysis toolsExternal expertise
15 Capturing the correct sequence The first step is to capture the correct sequence. The nature of the nucleotide sequence databases is that they are archives and also that they store each sequence report in one entry. The 100% identical reports are already merged at the trembl level as mentioned previously but there will still be redundancy.Archive collectionsEach sequence report stored in its own entryMerging at 100% identityStill some redundancy
16 Sequence similarity searches Identify potential merge candidatesIdentify similar already curated entriesSo the curators begin with a sequence similarity search to order to identify potential merge candidates and also to identify already curated similar entries in Swiss-Prot in order to speed the annotation process
17 Sequence comparison Sequence alignments Identification of sequence differencesHelps in identifying underlying causesSequence alignments allow the identification of sequence differences and also helps is identifying the underlying causes.
18 Causes of sequence differences Polymorphisms, disease variantsSplice variantsSequencing errorsIncorrect predictionsThe usual causes of sequence differences cover both naturally occuring events such as polymorphisms and splice variants and also events due to the sequencing process itself such as sequencing errors and incorrect predictions which of course have lessened in recent years. However due to the archive nature of the nucleotide sequence databases, they still exist in the databases.
20 Sequence analysisRange of sequence analysis tools used to predict important sequence featuresUse of most appropriate programsDevelopment of new predictive methods
21 Evidence attributionSystem which allows linking of all information in an entry to its original source.Allows users:to trace origin of all datato differentiate easily between literature-derived and computational datato assess data reliability