Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.

Slides:



Advertisements
Similar presentations
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Advertisements

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Greg Phillips Veterinary Microbiology
Some new sequencing technologies. Molecular Inversion Probes.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome sequencing and assembling
Reminder: Class on Friday, Discussion of Li et al. Proposal/Projects CAMERA feedback?
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Next generation sequencing Xusheng Wang 4/29/2010.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Mouse Genome Sequencing
Molecular Microbial Ecology
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Introduction to next generation sequencing Rolf Sommer Kaas.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
The Strange Case of the Dot Chromosome of Drosophila Sarah C R Elgin Bio 4342 Copyright 2014, Washington University.
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
Next Generation DNA Sequencing
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
The Changing Face of Sequencing
Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Bombus terrestris, the buff-tailed bumble bee Native to Europe A managed pollinator Commercially available Reared in greenhouses Important pollinator in.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Searching for Transcription Start Sites in Drosophila
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
billion-piece genome puzzle
Genomics Education Partnership: a flexible approach to implement Genomic teachings and research in the classroom Matthew W. Wadsworth and Consuelo J. Alvarez,
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Drosophila Genomics Where are we now? Where are we going? Christopher Shaffer, Wilson Leung, Sarah Elgin Dept of Biology; Washington University in St.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Searching for Transcription Start Sites in Drosophila
Web Databases for Drosophila
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Annotation of Drosophila
Metabolic Networks OF Fruit-fly
MGmapper A tool to map MetaGenomics data
Quality Control & Preprocessing of Metagenomic Data
Denovo genome assembly of Moniliophthora roreri
Professors: Dr. Gribskov and Dr. Weil
Fig. 1. (a) Relationship of codon adaptation index (CAI) and the number of substitutions per fourfold-degenerate site (dS) between D. melanogaster.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
GEP Annotation Workflow
2nd (Next) Generation Sequencing
Identify D. melanogaster ortholog
Bringing Genomics Research into the Undergraduate Curriculum
Introduction to Sequencing
Nora Pierstorff Dept. of Genetics University of Cologne
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014

Agenda Overview of the modENCODE species Problems with the version 1 genome assembly Improvements in the version 2 genome assembly Pipeline used to create the sequence improvement projects Sequence improvement goals for hybrid assemblies

New Drosophila species sequenced by the modENCODE project D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. ananassae D. bipectinata D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Reference Sequence improvement projects for New Drosophila species sequenced by modENCODE Phylogeny from the modENCODE white paper (produced by Artyom Kopp at UC Davis)

The Drosophila modENCODE sequencing project Selected 8 additional Drosophila species for sequencing based on the ideal evolutionarily distances for motif finding Sequenced and assembled by the Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC) in 2011 BCM-HGSC also produced RNA-seq data for each newly sequenced species Adult males, adult females, and mixed embryos Initial assembly based on only 454 reads ~45x coverage 15x unpaired reads with 454 XLR 30x paired end sequencing of 3kb and 8kb inserts

Version 1 assembly statistics D. biarmipes has the highest N50 among the 8 species N50 = Total size of scaffolds this size or larger that account for half of the total length of the entire assembly Quality of the D. rhopaloa assembly is much lower than the other 7 genomes

Error profile for 454 reads Gilles et al. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics 2011, 12 :245 Length of each 454 read depends on the number of cycles and base composition of the read Generally trim the end of 454 reads because they are low quality Most of the remaining errors are base insertions or deletions (indels)

Indels error profile for 454 reads 454 sequencing technology has high error rates in long (> 5bp) mononucleotide runs See the Next Generation Sequencing video on the GEP web site for details Base insertions or deletions in the consensus sequence could cause frame- shifts and introduce in-frame stop codons within coding regions Luo et al. Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample. PLoS One. 2012;7(2)

blastx alignment of D. melanogaster CDS Thd1:1_1636_0 against D. biarmipes contig32 Frame shifts introduced by base deletions in the consensus sequence

Version 2 D. biarmipes assembly published by BCM-HGSC in 2013

Assembly statistics for version 2 of the D. biarmipes assembly VersionTotal Length Scaffold Count Total Gap Length Scaffold N50 Spanned Gaps Version 1169,375,3875,528787,6483,385,6222,336 Version 2169,378,5995,523776,7913,386,1212,333 Difference3, , New assembly did not significantly change the overall assembly statistics D. biarmipes Muller F element scaffold scf (version 2) Net alignment between the two assemblies show large number of differences Genes Differences with version 1

Most of insertions in the version 2 assembly are A’s and T’s Estimated total number of insertions: 27,345 Number of insertions less than 10bp: 26,884 (98.3%) Insertion Size Count 126,

The new hybrid assembly fixed many errors in the consensus Version 1 Version 2

Some problems remained in the new D. biarmipes assembly

Statistics on D. biarmipes CDS’s with potential consensus errors Identify overlapping and partial CDS alignments Number of CDS’s with potential consensus errors Entire version 2 assembly: 309 CDS’s (250 genes) Muller F element scaffolds: 7 CDS’s (7 genes) CG2316, CG33521, MED26, Thd1, bip2, eIF4G, plexB

Use Consed to examine regions with potential consensus errors Reference Custom Track Illumina and 454 Reads

Pipeline used to construct the sequence improvement projects The version 2 assembly is constructed using a custom pipeline developed by BCM-HGSC No read placement information available for hybrid assemblies Download raw data from the NCBI Sequence Reads Archive (SRA) and map these reads against the version 2 assembly Map Illumina reads using bowtie2 Map 454 reads using Newbler 454 Coverage Illumina Coverage Genes D. biarmipes Muller F element scaffold: scf

Construct Consed project package Identify Muller F and Muller D element scaffolds ~95% of the genes remained on the same Muller element across the different Drosophila species Partition Muller F and Muller D element scaffolds into 100kb chunks (with 10kb overlap) Extract 454 and Illumina reads mapped to each chunk using samtools Construct Consed project packages Create initial ace file using consensus sequence of the extracted region from the version 2 assembly ( ref_.fasta ) Use cross_match to map extracted 454 reads against the consensus ( ace.1 ) Use cross_match to map extracted Illumina reads against the consensus ( ace.2 )

Order genomic DNA for PCR reactions Genomic DNA are available at the UC San Diego Drosophila Species Stock Center ( Chris can send you the genomic DNA if you would like to do the sequencing reactions in class as part of a wet lab

Sequence improvement goals for hybrid assemblies Primary goal: Correct errors within mononucleotide runs Secondary goal: Close gaps and correct regions with low consensus quality Optional goal: Identify regions with putative polymorphisms Recommended protocols and training materials for improving hybrid assemblies available on the GEP web site under the GEP Sequence Improvement Projects Issues section GEP Sequence Improvement Projects Issues GEP Sequence Improvement Projects Issues

Questions?