Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.
Working with gene lists: Finding data using GEO & BioMart June 5, 2014.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Welcome to mini-symposium on ontologies for biological sample description EMBL-EBI Wellcome Trust Genome Campus Deceber 5, 2001.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Protein and Function Databases
UniProt - The Universal Protein Resource
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
WormBase: A Resource for the Biology & Genome of C. elegans Lincoln D. Stein.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Gramene Objectives Develop a database and tools to store, visualize and analyze data on genetics, genomics, proteomics, and biochemistry of grass plants.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
Part I: Identifying sequences with … Speaker : S. Gaj Date
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Gramene Objectives Provide researchers working on grasses and plants in general with a bird’s eye view of the grass genomes and their organization. Work.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
Oct.27, 2003 Curator Meeting, Oct Gene Expression Curation ~WormBase, 2003 ~
Introduction to the Gene Ontology GO Workshop 3-6 August 2010.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics and Computational Biology
Cool BaRC Web Tools Prat Thiru. BaRC Web Tools We have.
How can we find genes? Search for them Look them up.
Protein Structure Database for Structural Genomics Group Jessica Lau December 13, 2004 M.S. Thesis Defense.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Advisory Board Meeting, CSHL 2005 Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.
What do we already know ? The rice disease resistance gene Pi-ta Genetically mapped to chromosome 12 Rybka et al. (1997). It has also been sequenced Bryan.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Genome Database Comparative Genomics Phylogenomics Variation GrameneMart (BioMart) Discovery Environment Josh Stein Cold Spring Harbor Laboratory 1.
IMDB: A Generic Insertional Mutagenesis Database Xiaokang Pan and Lincoln Stein Cold Spring Harbor Laboratory.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Data Loading into Ensembl Database TGAC Browser
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Annotating with GO: an overview
Data Mining with BioMart
Sequence based searches:
Functional Annotation of the Horse Genome
Ensembl Genome Repository.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Genetic Data in Mary Ann Tuli.
1. C. briggsae sequence curation 2. SNP data handling
Welcome - webinar instructions
Presentation transcript:

Building WormBase database(s)

SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray ● Anatomy / Cell ● Homology groups ● SAGE data ● Gene Ontology ● Papers / References ● Person / Author ● Detailed Functional Annotation ●Expression Patterns Literature Curation ● PCR_products / Oligos ● 3D structures Website and tools Gene prediction annotation Comparative analysis Genetic Data Alleles Gene name info ( incl unique ids ) Strains Data Integration and analysis The WormBase Consortium Washington University in St. Louis ● Gene prediction annotation ● SNPs Gene Structure curation

SAB 2008 Build Process 99% perl scripts Continued improvements in modularistation logging and error checking de-eleganisation eg Species modules Inherited classes 1 per species access to names, sequences paths etc

SAB 2008 Build Overview Initiate FTP uploads from other sites Recreate primary databases Class by class extraction Load to fresh database Blat Align cDNAs etc to genome Transcript building Use alignments etc to construct coding transcripts Generate UTRs and genespans INITIALISE MAPPING BLAT BLAST PIPELINE FINAL CHECK COMPARA BUILD TRANSCRIPTS GFF POST-PROCESS RELEASE ONTOLOGY CLEAN UP

SAB 2008 Build Overview BLAST Pipeline Genomic DNA RepeatMasker Blastx Human, fly, yeast, other worms, SwissProt/ TrEMBL Proteins Blastp PFAM, InterPro, TMHMM Ensembl mysql databases using Ensembl schema and code Results dumped as ace or GFF3 Compara Provides gene families and multi genome alignments. INITIALISE MAPPING BLAT BLAST PIPELINE FINAL CHECK COMPARA BUILD TRANSCRIPTS GFF POST-PROCESS RELEASE ONTOLOGY CLEAN UP

SAB 2008 Build Overview Mapping Ensure correct location of features and experimental data on genome sequence regardless of changes Ensure connection to correct genes even after gene model changes. Done for eg RNAi, Variations, PCR_products, We have also developed a publicly available tool to easily transform coordinates between any pair of releases. Ontology Infer GO terms from InterPro domains and phenotypes Write out files for ? INITIALISE MAPPING BLAT BLAST PIPELINE FINAL CHECK COMPARA BUILD TRANSCRIPTS GFF POST-PROCESS RELEASE ONTOLOGY CLEAN UP

SAB 2008 Build Overview GFF Processing Add extra info to GFF files to enhance genome browser eg Gene names to CDS Landmark genes Species info to transcripts alignments Final Checks Consistency between GFF and acedb. Class counts objects loaded Release Autogenerate release notes FTP and websites INITIALISE MAPPING BLAT BLAST PIPELINE FINAL CHECK COMPARA BUILD TRANSCRIPTS GFF POST-PROCESS RELEASE ONTOLOGY CLEAN UP

SAB 2008 Building other species databases All tierII species stored as acedb databases. All build scripts are (will be) species independent. All tierII can be rebuilt exactly same as C. elegans. Update frequency - Why not every release? –Effort : value

SAB 2008 Build Process

SAB 2008 What’s the point? 10% of our time. Faster builds – no “dead time”. No chance of missing things out. Better use of system resource. Forces better coding & error checking.

SAB 2008 What’s the hold up? Tighten up error reporting –Differentiate “show stoppers” from undefined variables. Make sure of dependancies. LSF conversion to LSF::JobManager for parallel work.

SAB 2008 TierIII Builds No acedb database, all stored in Ensembl mysql databases. All automatic annotation (blasts, protein domains) GFF3 dumping process improved to add extra info eg GO_terms Will be included in comparative analyses Syntenic regions determined where applicable (closely related species)

SAB 2008 TierIII Collaborations Sanger Institute Pathogens group. –Managing the sequencing projects. –Initial gene predictions. –Community links. –Ongoing annotation and gene improvement. WormBase help with Ensembl infrastructure –Alignment and comparative pipelines. –Automatic protein alignments. –Some gene prediction assessment. –Integrated and linked genome browsers.

SAB 2008 TierIII Collaborations Ensembl-metazoa –New ensembl branded websites covering much wider range organisms as replacement for Genome Reviews. –Display in Ensembl environment –Link to other EBI resources, e.g. UniProt Proposed model of data providers within established communities. –Shared data to ensure consistancy