Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.

Similar presentations


Presentation on theme: "1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL."— Presentation transcript:

1 1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL

2 2 EMBL Outstation — The European Bioinformatics Institute SWISS-PROT F is a curated protein sequence data bank established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987 F contains currently 75 000 protein sequence entries

3 3 EMBL Outstation — The European Bioinformatics Institute Essential criteria for a sequence data bank F it must be complete with minimal redundancy F it must contain as much up-to-date information as possible on each sequence F all the information items must be retrievable by computer programs in a consistent manner F it should be integrated (cross-referenced) with other sequence related data banks

4 4 EMBL Outstation — The European Bioinformatics Institute The Bottleneck: Annotation

5 5 EMBL Outstation — The European Bioinformatics Institute Annotation consists of the description of: F Function(s) of the protein F Post-translational modification(s) F Domains and sites F Secondary structure F Quaternary structure F Similarities to other proteins F Disease(s) associated with deficiencie(s) in the protein F Sequence conflicts, variants, etc.

6 6 EMBL Outstation — The European Bioinformatics Institute TrEMBL F is a Computer-annotated supplement to SWISS-PROT F consists of entries in SWISS-PROT format F translations of CDS in the Nucleotide Sequence Database not in SWISS-PROT F the translation tools used are based on the program trembl written by Thure Etzold at the EMBL in Heidelberg

7 7 EMBL Outstation — The European Bioinformatics Institute TrEMBLNEW F Weekly update of TrEMBL which contains protein coding sequences derived from EMBLNEW F TrEMBLNEW entries are moved into TrEMBL during the quarterly release building procedure

8 8 EMBL Outstation — The European Bioinformatics Institute The Production of TrEMBL F Translation and entry creation F Sorting the entries F Automated post-processing of the SP-TrEMBL entries

9 9 EMBL Outstation — The European Bioinformatics Institute Automated post-processing of TrEMBL entries F Redundancy removal: affects currently >10% of the entries F Improvements to annotation: affects currently >20% of the entries

10 10 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F Causes of redundancy and the detection of redundancy F Removing redundancy

11 11 EMBL Outstation — The European Bioinformatics Institute Causes of redundancy F Different literature and sequence reports for the same protein F Subfragments of longer sequences F Mutations, polymorphism, variations and conflicts of a sequence are often given as separate entries in EMBL

12 12 EMBL Outstation — The European Bioinformatics Institute Redundancy detection F The Cyclic Redundancy Check (CRC32) calculates a nearly unique and very compact checksum for each sequence F The Boyer-Moore sequence comparison algorithm for a fast string searching F An algorithm that finds strings with errors ( Landau- Vishkin)

13 13 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F Identical full length proteins are merged in one entry F Identical fragment proteins and subfragments of longer sequences from the same organism are merged

14 14 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F The ‘MERGE’ procedure - match CRC32  match TrEMBLNEW vs TrEMBLNEW (automatic merge)  match TrEMBLNEW vs TrEMBL (automatic merge)  match TrEMBLNEW vs SWISS-PROT (manual merge) - Subfragment assembly (LASSAP)  match TrEMBLNEW vs TrEMBLNEW (automatic merge and manual check)  match TrEMBLNEW vs TrEMBL (automatic merge and manual check)  match TrEMBLNEW vs SWISS-PROT (manual merge)

15 15 EMBL Outstation — The European Bioinformatics Institute PID Check EMBLNEW trembl SP + TREMBL PIDS (Work Release) Day 1 Day 2 Day n TREMBLNEW Week 1 Week 2 Week n TREMBLNEW Updates Replace PIDs in SP+TREMBL SP TREMBL Merge Between releases Building Release

16 16 EMBL Outstation — The European Bioinformatics Institute Results EMBL Nucleotide Sequence Database (rel 55) has 326,000 CDS SWISS-PROT (rel 36) has 74,019 entries TrEMBL (rel 7) has 193,860 entries F 110,000 CDS were already in 74,000 SWISS-PROT entries F 207,000 CDS were in 194,000 TrEMBL entries F 9,000 currently being processed due to redundancy procedures

17 17 EMBL Outstation — The European Bioinformatics Institute Results F Results of redundancy removal within TrEMBL 7 production - 743 were already in SWISS-PROT - 3380 were merged due to CRC32 matches - 4736 were removed by subfragment matches F 8,859 entries were removed

18 18 EMBL Outstation — The European Bioinformatics Institute Credits SWISS-PROT at EBI F Rolf Apweiler F Sergio Contrino F Wolfgang Fleischmann F Henning Hermjakob F Viv Junker F Fiona Lang F Claire O'Donovan F Michele Magrane F Maria Jesus Martin F Nicoletta Mitaritonna F Steffen Moeller F Youla Karavidopoulou F Gill Fraser F Evguenia Kriventseva Collaborators F Amos Bairoch F Eric Glemet F Jean-Jacques Codani


Download ppt "1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL."

Similar presentations


Ads by Google