Importing Community annotations into VectorBase
Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry requirements, be scaleable and (relatively) simple to use
Genome annotation First-pass genome annotation is almost always based on “automatic” computational approaches ab initio Similarity based Transcript (ESTs, RNAseq) Protein (nr protein database)
Genome assembly Map Repeats Genefinding Protein-coding genes Map Transcripts Map Peptides nc-RNAs Functional annotation Submission to archival databases (Release) Genome annotation - building a pipeline
Current VectorBase annotation pipeline MAKER based automatic annotation includes SNAP training and ab initio RNAseq based transcript similarity prediction Taxonomically constrained peptide similarity prediction 2 rounds of prediction refinement & final round includes all peptide similarity Community annotation phase Capture gene structure changes Metadata associated with locus (symbol, description, citation) Submission to INSDC, propagation to UniProt Presentation through VectorBase Start 1.0 set (automatic) 1.1 set (published)
Processing submissions 4 phases Capture Moderation Storage Integration
Capture: Community annotation decision tree
Community annotation decision tree
Tool of choice: WebApollo Web-based Eliminates main drawback of deprecated CAP system - GFF3 format validation
WebApollo example
Community annotation decision tree
Tool of choice: Web forms
Moderation & Storage Gene metadata captured through forms to spreadsheets Batch submissions use similar spreadsheet format
Integration: Dataflow for ‘patch’ build CAP GFF3 WebApollo Reference core Updated geneset TXT Patch Users Stable IDs Reports Updated core IDs Reference core CAP Release core Google Fusion Table Xrefs Release Xrefs Google Form ` Metadata Users } Commit
Presentation of community annotation
Usage (as of ) 31 WebApollo instances (Organisms) 3,407 gene models Gene metadata (protein-coding loci) 4,987 gene symbols 512 gene synonyms 57,878 gene descriptions 910 loci citations from 208 publications
Supplementing annotations Community jamboree’s ‘Standard’ improvement (e.g. Sandfly, snail communities) Glossina community (e.g. March 2015, Kenya) VectorBase Default Xref run includes symbol/description assignment via UniProt Projection of gene description via orthology from key marker species (e.g. An. gambiae). Due to be deployed for June (VB ) release. Supplemental data from genome papers (e.g. 16 Anopheles spp, Musca)
Deprecated CAP system example