Presentation is loading. Please wait.

Presentation is loading. Please wait.

VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

Similar presentations


Presentation on theme: "VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012."— Presentation transcript:

1 VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012

2

3 Genome Sizes Pediculus humanus: ~110 Mb, N50 = 488 kb Anopheles gambiae S: ~260 Mb, N50 = 1,505 kb Culex quinquefasciatus: ~580 Mb, N50 = 487 kb Aedes aegypti: ~1.3 Gb, N50 = 1,500 kb Ixodes scapularis: ~1.8 Gb, N50 = 72 kb

4 Future genomes 4 White papers Sandflies Lutzomyia longipalpis Phlebotomus papatasi Anopheles (AGCC) Anopheles arabiensis Anopheles quadriannulatus Anopheles merus Anopheles melas Anopheles christyl Anopheles epiroticus Anopheles stephensi Anopheles maculatus Anopheles funestus Anopheles minimus Anopheles culicifacies Anopheles farauti Anopheles dirus Anopheles atroparvus Anopheles albimanus Glossina Glossina palpalis Glossina fuscipes Glossina pallidipes Glossina brevipalpis Glossina austeni Stomoxys calcitrans Musca domestica Simulium Simulium vittatum Simulium sirbanum Simulium damnosum Simulium ochraceum Simulium squamosum Simulium thyolense Simulium santipauli Simulium woodi Simulium exiguum Simulium yahense Tick & Mites Leptotrombidium deliense Ixodes scapularis* Dermacentor variabilis Ornithodorus turicata Anopheles Anopheles darlingi* Anopheles stephensi Others Aedes Aedes albopictus i5K initiative

5 First New Release in New Contract

6

7

8 Challenges of vector genomes Relatively large, hard to inbreed genomes Heterozygosity in sequencing samples (up to 80 different males were sequenced for the new gambiae genomes) causes dubious scaffolds. Inversions and heterochromatic regions induce gaps Newer generation sequencing has reduced cost but has not yet kept overall quality Non-trivial annotations

9 An. gambiae forms M-form More permanent Available year-round Allows slower development Predator-rich S-form Ephemeral rainy-season dependent Requires rapid development Largely predator-free

10 C. Cheng et al, unpublished Divergence across chromosome arms 2L 2R X 3R 3L

11 Optical mapping DBP : Wisconsin

12 Size matters GenomeMB optically mappedgenes found S Sanger 145,837.97 14162 S Illumina 58,192.13 14124 PEST 60,239.6 14324 Sanger + Ill 204,030.1 14224

13 Annotation strategies 13 Speeding up computational annotation Use of MAKER system Prediction by projection from ‘high quality’ reference Expanded use of RNA-Seq Scripture, Trinity & Cufflinks/Bowtie Community engagement Primarily deployed for new genomes (Glossina, Rhodnius) Works for all other VectorBase genomes

14 14 de novo annotation MAKER with RNA-Seq & reference proteomes Aim: Gene prediction pipeline for the masses. Used for a number of arthropod genome projects Touted as the default pipeline for many more (part of the GMOD toolkit) Overview ab-initio gene predictions from SNAP, Augustus & FGENESH Final gene models from MAKER EST alignments from both EXONERATE and BLASTN Protein alignments from EXONERATE and BLASTX Repeats from RepeatFinder & RepeatMasker Additional data sets integrated via GFF3 files (RNA-Seq) Uses MPI for parallelization over a compute farm Optimization for long scaffolds Summary Iterative runs give acceptable reference gene sets. Used for Glossina and An. stephensi Used by others for Strigamia, Manduca, published ant genomes

15 15 Community annotation Simple tool to capture community annotation Makes gene prediction and evidence available as GFF3 Compatible with Artemis and Apollo tools Submissions in GFF3 format Gene structure corrections Gene meta data (symbol, description, citations) Glossina annotation effort (Nov 11 – Apr 12) 790 GFF submissions 2670 items of metadata gene symbols, descriptions Structure confirmation

16 16 ARTEMIS APOLLO scf7180000638805 ptn2genome ptn_match 52 605 892 +. ID=xxxx;Name=tr|Q3UIQ2| scf7180000638805 ptn2genome ptn_match 78 205 960 +. ID=xxxx2;Name=tr|Q3TIU7| scf7180000638805 ptn2genome ptn_match 52 305 696 +. ID=xxxx3;Name=sp|Q91VD9| scf7180000638805 ptn2genome ptn_match 78 205 950 +. ID=xxxx2;Name=tr|Q3VIU732| scf7180000638805 ptn2genome ptn_match 52 605 892 +. ID=xxxx;Name=tr|Q3UIQ2| scf7180000638805 ptn2genome ptn_match 78 205 960 +. ID=xxxx2;Name=tr|Q3TIU7| scf7180000638805 ptn2genome ptn_match 78 205 950 +. ID=xxxx2;Name=tr|Q3VIU732| >MY SUPERCONTIG ATATATGCGTTGAGCTGCGTTACGTTCGG GATGCGTTAGGCTTGTGAGCTGGATCGGT CCTGCCTGCGTCGATATAAACGACCT… Identify gene Modify model Submit CAP GFF3 FASTA

17 Population biology 17 Chado Natural diversity schema 183 projects, 15190 samples incorporates Irbase samples Ensembl variation schema 1,511,335 SNP calls Visualization through browser Data downloads through browser Queries via BioMart interface


Download ppt "VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012."

Similar presentations


Ads by Google