Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.

Slides:



Advertisements
Similar presentations
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Advertisements

Final Results Genome Assembly Team Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington,
Click to edit Master title style Irys data analysis January 10 th, 2014.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Henrik Lantz - BILS/SciLife/Uppsala University
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
High Throughput Sequencing
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
Introduction to next generation sequencing Rolf Sommer Kaas.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Genome Assembly Preliminary Results
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
The iPlant Collaborative
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Theobroma cacao Integrated Physical and Genetic Map 2 BAC Libraries 250 Genetic Markers.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Short Read Workshop Day 5: Mapping and Visualization
1 Berger Jean-Baptiste
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
De-novo Bacterial draft genome de-novo asembly, from the sequencing machine (Illumina) to a genome database (NCBI) An example case: Assembly of Stenotrophomonas.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
1 Aplicação de metodologias genómicas na detecção de polimorfismos no sobreiro Ciência 2010 Octávio S. Paulo Computational Biology and Population Genomics.
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.
Next-generation sequencing data analysis using open source software
Assembly algorithms for next-generation sequencing data
Sequence Assembly.
MGmapper A tool to map MetaGenomics data
Quality Control & Preprocessing of Metagenomic Data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Bacterial Genome Assembly
Cross_genome: Assembly Scaffolding using Cross-species Synteny
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
A Fast Hybrid Short Read Fragment Assembly Algorithm
Denovo genome assembly of Moniliophthora roreri
M. roreri de novo genome assembly using abyss/1.9.0-maxk96
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Assembly.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Bacterial Genome Assembly
Maximize read usage through mapping strategies
BF528 - Sequence Analysis Fundamentals
Presentation transcript:

Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly

Outline  Input Data  Sequence read data  Pipeline Review  Un-processed data  Assemblers  Preliminary data – assembler comparison  Visualization  Future

Input Data V. navarrensisV. vulnificus V Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio navarrensis- 454 SequenceID Min. Read Length 21 bp25 bp19 bp28 bp Max. Read Length 738 bp573 bp704 bp Avg. Read Length (± bp) (± bp) (± bp) (± bp) Total Reads160,56013,854303,434218,021 Coverage15x1.23x28.06x20.51x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio vulnificus- 454 SequenceID2009V Min. Read Length 26 bp21 bp23 bp22 bp18 bp Max. Read Length 593 bp597 bp723 bp594 bp736 bp Avg. Read Length (± bp) (± bp) (± bp) (± bp) (± bp) Total Reads191,280786,944352,726173,538777,228 Coverage17x65x32x16x63x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio navarrensis- Illumina SequenceID Min. Read Length 76 bp Max. Read Length 76 bp Avg. Read Length 76 bp Total Reads19,316,65929,414,237126,298,69192,338,634 Coverage326x496x250x237x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio vulnificus- Illumina SequenceID2009V Min. Read Length 76 bp Max. Read Length 76 bp Avg. Read Length 76 bp Total Reads15,764,32914,562,25215,343,64816,007,89515,495,709 Coverage~250x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray MIRA Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet Taipan SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/(454?) reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff MUMmer Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid

Vibrio vulnificus- 454 Metric Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content

Vibrio navarrensis- 454; unprocessed data Metric Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio vulnificus- Illumina; unprocessed data Metric2009V Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Vibrio navarrensis- Illumina; unprocessed data Metric Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Per base sequence quality vul_454_ nav_454_ vul_ill_ nav_ill_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Per base sequence content vul_454_ vul_ill_ nav_ill_ nav_454_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Seq. duplicate levels vul_454_ nav_454_ vul_ill_ nav_ill_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Pre-processing stats ParameterValue Total sequences15,343,648 Good sequences9,775,116 Bad sequences5,568,532 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray MIRA Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet Taipan SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/(454?) reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff MUMmer Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid

Assemblers NamePlatformSource file InstallationUsage Allpaths LGIllumina SOAP DeNovoIllumina VelvetIllumina SUTTAHybrid RAYHybrid CLC genomics workbenchHybrid Newbler454 CABOG454 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

CLC Genomics  Word Size: Automatic Word Size  CLC bio's de novo assembly algorithm works by using de Bruijn graphs. It makes a table of all sub-sequences of a certain length (called words) found in the reads.  Bubble Size: Automatic Bubble Size  A bubble is defined as a bifurcation in the graph where a path furcates into two nodes and then merge back into one.  Minimum Contig Length: 200  Mismatch cost : 2  The cost of a mismatch between the read and the reference sequence.  Insertion cost: 3  The cost of an insertion in the read (causing a gap in the reference sequence)  Deletion cost: 3  The cost of having a gap in the read. The score for a match is always 1.  Length fraction: 0.5  Set minimum length fraction of a read that must match the reference sequence. Setting a value at 0.5 means that at least half the read needs to match the reference sequence for the read to be included in the final mapping.  Similarity: 0.8  Set minimum fraction of identity between the read and the reference sequence. If you want the reads to have e.g. at least 90% identity with the reference sequence in order to be included in the final mapping, set this value to 0.9.  Update contigs based on mapped reads  This means that the original contig sequences produced from the de novo assembly will be updated to reflect the mapping of the reads Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Velvet  De brujin assembler  Max kmer length-31, default 29  Commands  velveth directory -k-mer -readtype –file format filename  velvetg VAssemILL -exp_cov auto -cov_cutoff auto  exp_cov – allow the sytem to infer expected coverage of unique regions  Cov_cutoff - Allow the system to infer the removal of low coverage nodes  Designed for very short reads (25-50bp) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Newbler  De Novo OLC assembler  Uses k-mer based hashing  Command – runAssembly [ filename ]  Designed for longer reads (454) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

SOAP DeNovo2  Short reads DeNovo assembler  Designed to study Illumina GAII contigs  Command - SOAPdenovo-127mer all -s test.config -K 30 -R -p 4 -N o test_OP 1>ass.log 2>ass.err  Parameters specified:  Insert_size: 0, single end reads  Kmer_size: 23, default  asm_flag: both contigs and scaffold Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Assembler comparison- 454 ToolN50No. of contigs Avg. contig length No. of large contigs Largest contigRead usage % CLC Genomics wb. 93, ,107NA Newbler194, , , ToolN50No. of contigs Avg. contig length No. of large contigs Largest contig Read usage % CLC Genomics wb. 84, ,828NA Newbler111, , , nav_454_ vul_454_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Assembler comparison- Illumina ToolN50No. of contigsAvg. contig length Read usage %Largest contig Median coverage depth SOAP DeNovo1,07728,760184NA Velvet17,4081,4023, , CLC Genomics wb56, , ,565NA ToolN50No. of contigsAvg. contig length Read usage %Largest contig Median coverage depth SOAP DeNovo1,09426,773207NA Velvet15,6991,2533, , CLC Genomics wb87, , ,510NA nav_ill_ vul_ill_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/454? reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid

Reference Genomes  V. vulnificus MO6-24/O  V. vulnificus YJ016  V. vulnificus CMCP6

Reference vs. all contigs- 454 Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% CLC Genomics wb. (n = 313) Newbler (n = 347) nav_454_ vul_454_ Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs%- Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% CLC Genomics wb.NA Newbler (n = 142)

Reference vs. all contigs- Illumina Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% SOAP DeNovo (n = 28,760) Velvet (n = 1402) nav_ill_ vul_ill_ Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases%- Aligned contigs% Aligned bases% SOAP DeNovo (n = 26,773) Velvet(n = 1,253)

Visualization Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Road ahead…..  Get all the tools working  Optimize tool parameters  Use Illumina reads to finish 454 contigs  Performance considerations for the tool Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future

Questions???