VectorBase BRC4 20061 VectorBase annotation metrics Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton.

Slides:



Advertisements
Similar presentations
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Advertisements

Glossina Transcriptome Annotation Karyn Megy, VectorBase European Bioinformatics Institute, UK.
Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
Centers of Excellence for Influenza Research and Surveillance 6 th Annual Meeting Aug 1, 2012 Status of IRD Development.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
January 25, Current and Future Database (CH)  Indexing vgd_common (JM; 1Q)  Fully implement Taxonomy tables (JO, DD; 2Q)  Allow subspecies-level.
Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics USC School of Medicine Library.
BRC6 28 th October 2008 Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community. Daniel Lawson, VectorBase.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Bootcamp: Data Resources1 Paul Bain Reference and Education Services Librarian Countway Library of Medicine Countway.
How to access genomic information using Ensembl August 2005.
Transcriptional profiling I – microarrays and proteomics
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
Genome Annotation BCB 660 October 20, From Carson Holt.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
1 SRI International Bioinformatics Advanced PGDB Editing: Regulation GO Terms Ingrid M. Keseler Bioinformatics Research Group SRI International
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Abstract Although transposable elements (TEs) were discovered over 50 years ago, the robust discovery of them in newly sequenced genomes remains a difficult.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.
Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.
The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
Managing Data Modeling GO Workshop 3-6 August 2010.
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
MAKER Annotation Process Example of Glossina VectorBase Karyn Mégy Dan Hughes.
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
H-Invitational Database (H-InvDB) release 5.0, an integrated database of human genes and transcripts Released on 2007/12/26 Integrated database team Japan.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
1 SRI International Bioinformatics GO Term Integration and Curation in Pathway Tools and EcoCyc Ingrid M. Keseler Bioinformatics Research Group SRI International.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
1 GMOD Meeting, Spring 2005 Peili Zhang, FlyBase - Harvard Comparative Genome Annotation of Drosophila pseudoobscura and Its Implementation in chado.
Genome Annotation Rosana O. Babu.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
August 2008Bioinformatics Tools for Comparative Genomics of Vectors1 Genomes Daniel Lawson EBI.
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
Data Mining in Ensembl with BioMart Giulietta Spudich.
Copyright OpenHelix. No use or reproduction without express written consent1.
A collaborative tool for sequence annotation. Contact:
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Denise Carvalho-Silva Ensembl Outreach
VectorBase genome annotation
Gramene Technical Improvements
Sequence based searches:
UniProt: Universal Protein Resource
ID Mapping tools: Converting Accessions between Databases
INFORMATION FLOW AARTHI & NEHA.
Strategies for annotation of a genome
Ensembl Genome Repository.
Presentation transcript:

VectorBase BRC VectorBase annotation metrics Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK

VectorBase BRC Topics Annotation metrics –Numbers (Gene numbers & xrefs) –Data types (Availability & Integration) Annotation SOPs –Genome specific –Gene specific –Gene build profile & prediction confidence

VectorBase BRC AaegL1.1AgamP3.3YeastWormFlyHuman Gene Gene count16,69113,7657,09821,10514,75231,206 Protein-coding15,419 (92.4 %)13,277 (96.5 %)6,68020,06014,08623,245 other 1,272 ( 7.6 %)488 (3.5 %)4181, ,961 Transcript Transcript count18,06114, Protein-coding16,789 (93.0 %)13,639 (96.5 %)---- other1,272 (7.0 %)488 (3.5 %)---- Manual effort Manually reviewed0 (0.0 %)261 (1.9 %)6,68020,06014,0866,995 Community input0 (0.0 %)667 (4.9 %)4,6847,2289,94516,887 Orthologs Combined11,487 (74.5)9,782 (73.7 %)---- A.aegyptin/a8,907 (67.1 %)2,2024,4167,9916,590 A.gambiae9,923 (54.9 %)n/a2,2284,4447,7026,612 C.elegans4,923 (29.5 %)4,442 (33.4 %)2,185n/a4,5986,121 D.melanogaster9,078 (50.3 %)7,649 (57.6 %)2,2284,543n/a6,654 H.sapiens5,510 (33.0 %)5,046 (38.0 %)2,3264,4735,109n/a S.cerevisiae2,520 (15.1 %)2,350 (17.7 %)n/a2,3492,4703,265 Functional annotation GO terms9,335 (51.7 %)7,601 (55.7 %)4,17611,33410,22617,000 EC numbers2,950 (16.3 %)2,230 (16.4 %)4,103 *5,240 *4,009 *13,245 * InterPro11,536 (74.8 %)9,869 (72.4 %)4,61114,73010,47518,199 Expression evidence Combined12,350 (80.0 %)7,557 (55.4 %)---- cDNA/EST9,270 (60.1 %)7,557 (55.4 %)---- microarray9,143 (59.2 %)†0 (0.0 %)‡---- MPSS3,984 (25.8 %)†n/a----

VectorBase BRC Considerations Importance of calculating all metrics using similar methodology from the same data set Metrics calculated from Ensembl using BioMart & raw SQL queries. GO terms - many ways of calculating (InterPro2GO, projection from Drosophila orthologs) No VectorBase capability to automatically assign EC numbers

VectorBase BRC AaegL1.1AgamP3.3 SequenceYesDownload, search, visualizationYesDownload, search, visualization PolymorphismsNon/aYesSearch, visualization Genetic mapsYesNot integratedYesVisualization Syntenic alignmentYesVisualizationYesVisualization cDNAs & ESTsYesDownload, search, visualizationYesDownload, search, visualization SAGE tagsNon/aNon/a MicroarraysYesVisualizationYesVisualization MPSSYesNot integratedNon/a ProteomicsNon/aNon/a StructuresNon/aNon/a Interactome dataNon/aNon/a PathwaysNon/aNon/a Orthology profilesYesVisualizationYesVisualization Essentiality dataNon/aNon/a

VectorBase BRC VectorBase gene prediction pipeline (SOP) Blessed predictions Community submissionsManual annotations Species-specific predictions Similarity predictions Transcript based predictions Ab initio gene predictions Canonical Gene set VB:SOP001 VB:SOP002 & SOP003 VB:SOP005 VB:SOP004 Protein family HMMs VB:SOP009 ncRNA predictions VB:SOP008 VB:SOP007 VB:SOP010

VectorBase BRC Assignment of SOPs to VectorBase genes: AgamP3.3 SOPNo. genes VB:SOP001Confirmed674 VB:SOP002Protein-based with transcript support 3765 VB:SOP003Protein-based4830 VB:SOP004Transcript-based2857 VB:SOP005Supported ab initio585 VB:SOP006ab initio0 VB:SOP007Manual annotation928

VectorBase BRC Display of Metrics & SOPs Metrics –VectorBase wiki –Species-page containing the three tables available from the VectorBase species homepage –Expansion of documents relating to genomic resources (citations, links to primary data where possible) –Single collated table for BRC as separate download SOPs –VectorBase wiki –‘Documents’ section of main site

VectorBase BRC

10 Manual annotation progress Protein-coding gene No. VectorBase manual Community submission Anopheles gambiae AgamP3.313, ( 2.0 %)667 ( 5.0 %) current2474 (18.6 %)667* ( 5.0 %) Aedes aegypti AaegL1.115,4190 ( 0.0 %) current0 ( 0.0 %)341 ( 2.2 %)

VectorBase BRC Merging gene sets Reduce to single predictions per locus Compare exon/intron structures Gene set #1Gene set #2 Identical structures Compatible structures Different structures Merge/Split structures ComplexNo Map Add isoform predictions based on EST/Peptide data Canonical gene set