Presentation on theme: "ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from."— Presentation transcript:
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from other related nematodes (C.briggsae for example). WormBase operates a fortnightly release cycle where each build is made available on the website (http://www.wormbase.org). Over the past year we have continued to improve and incorporate new data sets (large, and small) into the data resource. WormBase encourages users to submit their data and observations via email and online submission forms (http://www.wormbase.org/db/curate/base) as we strive to improve the usability and content of the data resource. WormBase is constantly evolving and recently WormBase has introduced Frozen Release Versions to better accommodate the different needs of the scientific community. WormBase works very closely with the CGC (Caenorhabditis Genetics Center) to adhere to the established naming conventions and improve genetic data. WormBase's user community continues to grow as more data resources are incorporated into the database. To continue the consistent browsing experience users enjoy a second mirror site has been established to relieve load on the main site (http://caltech.wormbase.org).http://www.wormbase.orghttp://www.wormbase.org/db/curate/basehttp://caltech.wormbase.org WormBase, A Resource for Nematode Biology: 2003-2004 Paul Davis – The Sanger Institute WHATS NEW? 24 new releases last year. WS97 (07 Mar 2003) - WS121 (12 Mar 2004) Three ‘frozen’ releases (WS100/110/120) More and more data has been integrated into wormbase. This includes: ~42,000 new C. elegans ESTs ~60,000 new other nematode ESTs Orfeome data (Reboul et al. 2003, Nature Genetics) Brigpep (Stein et al. 2003, PLoS Biol.) Anatomy names & terms Antibody Gene regulation data New set of deletion alleles ( National Bioresource Project, Japan ) Protein 3D Structures NESG - Northeast Structural Genomics Consortium Extra stability build process is much more robust more testing of data on development website. WS120 (04) WS97 (03) WS75 (02) connected to sequence 4987 3723 2549 Growth of locus names How Has the Genome Changed: New sequence data is derived from a variety of sources. 3rd party clone data. Repeat assembly data. Transcript data. Resolution of N’s (St Louis). Genomic Sequence Errors. Gene predictions may have an incorrect structure compared to available experimental data. Curators check through generated lists of potential problems: Introns confirmed by transcript data but not in a prediction. Small Introns. Transcript data matching introns. WormBase Users email in structure corrections, observations, falsely assigned pseudogenes as well as family structure studies, some of which help to identify sequencing errors. Data Increase since last year: Needs of Different Users. Within the worm community there are different needs from the sequenced genome. Researchers requiring latest, accurate sets of gene predictions. Bioinformatics groups wanting stability to perform global analyses. Catering for Different Needs. WormBase 2 week release cycle. Good for research groups interested in subsets of genes as allows quick turnaround of corrections and data. Introduction of WormBase “Frozen” release versions. Take place every 10 releases since WS100. Hosted on a separate website (http://ws100.wormbase.org/).http://ws100.wormbase.org/ Remain available on ftp site. Stability and insulation from constant changes. Coordinate research and reference specific archived version Contact WormBase: email@example.com firstname.lastname@example.org TRACKING GENES IN WormBase. The Problem: No way of tracking gene name changes No unique, stable identifier for a ‘gene’ Worm genes first existed as Locus objects e.g. dpy-1 Then genes existed as Sequence objects e.g. F31D4.3 Some genes exist as both Locus and Sequence objects Gene names change frequently. Genomic Change 2003-2004 Resolution of Repeat misassemblies Frozen Release Latest Sequence update + Other Sequence corrections Sequence Updates. A list of potential sequencing errors was compiled over the last 2 years. Archived projects were checked (Sanger) or Genomic PCR was conducted over the problem region (GSC). New sequence files were created. Incorporated into clone linkage groups. Data base rebuilt. Sequencing errors resolved meaning a large number of gene predictions were revised. Known sequencing errors still exist in problematic clones. Incomplete Archive. Some early clone projects not available. Poor quality, unfinished projects. Strategy for resolving this issue. Genomic PCR of known problems. WHAT WE HAVE DONE. New Gene database class to store genes New Gene_name class to aid querying Devolution of Sequence class: This required a large number of changes to acedb. RNA sequences transferred to new Transcript class Protein-coding sequences to go in new CDS class Pseudogenes get their own class Introduced Gene objects (April) Part 1: replaced use of Locus objects (WS124) WS124/125 will remain in testing. Part 2: replace links to CDS where relevant (WS126) WS126 will be the 1 st release to be available on the main site. All genes will have a stable ID. FUTURE. Transfer Gene IDs to MySQL database to store details on genes: Track merges, splits etc. Add history information to genes Located at Sanger, accessible to all in WormBase.