Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly
Outline Input Data Sequence read data Pipeline Review Un-processed data Assemblers Preliminary data – assembler comparison Visualization Future
Input Data V. navarrensisV. vulnificus V Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- 454 SequenceID Min. Read Length 21 bp25 bp19 bp28 bp Max. Read Length 738 bp573 bp704 bp Avg. Read Length (± bp) (± bp) (± bp) (± bp) Total Reads160,56013,854303,434218,021 Coverage15x1.23x28.06x20.51x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- 454 SequenceID2009V Min. Read Length 26 bp21 bp23 bp22 bp18 bp Max. Read Length 593 bp597 bp723 bp594 bp736 bp Avg. Read Length (± bp) (± bp) (± bp) (± bp) (± bp) Total Reads191,280786,944352,726173,538777,228 Coverage17x65x32x16x63x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- Illumina SequenceID Min. Read Length 76 bp Max. Read Length 76 bp Avg. Read Length 76 bp Total Reads19,316,65929,414,237126,298,69192,338,634 Coverage326x496x250x237x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- Illumina SequenceID2009V Min. Read Length 76 bp Max. Read Length 76 bp Avg. Read Length 76 bp Total Reads15,764,32914,562,25215,343,64816,007,89515,495,709 Coverage~250x Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray MIRA Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet Taipan SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/(454?) reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff MUMmer Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid
Vibrio vulnificus- 454 Metric Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content
Vibrio navarrensis- 454; unprocessed data Metric Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- Illumina; unprocessed data Metric2009V Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- Illumina; unprocessed data Metric Per Base Seq. Quality Per Seq. Quality Score Per Base Seq. Content Per Base GC Content Per Seq. GC Content Per Base N Content Seq. Length Dist. Seq. Dup. Levels Overrepresented Seqs. Kmer Content Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Per base sequence quality vul_454_ nav_454_ vul_ill_ nav_ill_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Per base sequence content vul_454_ vul_ill_ nav_ill_ nav_454_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Seq. duplicate levels vul_454_ nav_454_ vul_ill_ nav_ill_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Pre-processing stats ParameterValue Total sequences15,343,648 Good sequences9,775,116 Bad sequences5,568,532 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray MIRA Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet Taipan SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/(454?) reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff MUMmer Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid
Assemblers NamePlatformSource file InstallationUsage Allpaths LGIllumina SOAP DeNovoIllumina VelvetIllumina SUTTAHybrid RAYHybrid CLC genomics workbenchHybrid Newbler454 CABOG454 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
CLC Genomics Word Size: Automatic Word Size CLC bio's de novo assembly algorithm works by using de Bruijn graphs. It makes a table of all sub-sequences of a certain length (called words) found in the reads. Bubble Size: Automatic Bubble Size A bubble is defined as a bifurcation in the graph where a path furcates into two nodes and then merge back into one. Minimum Contig Length: 200 Mismatch cost : 2 The cost of a mismatch between the read and the reference sequence. Insertion cost: 3 The cost of an insertion in the read (causing a gap in the reference sequence) Deletion cost: 3 The cost of having a gap in the read. The score for a match is always 1. Length fraction: 0.5 Set minimum length fraction of a read that must match the reference sequence. Setting a value at 0.5 means that at least half the read needs to match the reference sequence for the read to be included in the final mapping. Similarity: 0.8 Set minimum fraction of identity between the read and the reference sequence. If you want the reads to have e.g. at least 90% identity with the reference sequence in order to be included in the final mapping, set this value to 0.9. Update contigs based on mapped reads This means that the original contig sequences produced from the de novo assembly will be updated to reflect the mapping of the reads Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Velvet De brujin assembler Max kmer length-31, default 29 Commands velveth directory -k-mer -readtype –file format filename velvetg VAssemILL -exp_cov auto -cov_cutoff auto exp_cov – allow the sytem to infer expected coverage of unique regions Cov_cutoff - Allow the system to infer the removal of low coverage nodes Designed for very short reads (25-50bp) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Newbler De Novo OLC assembler Uses k-mer based hashing Command – runAssembly [ filename ] Designed for longer reads (454) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
SOAP DeNovo2 Short reads DeNovo assembler Designed to study Illumina GAII contigs Command - SOAPdenovo-127mer all -s test.config -K 30 -R -p 4 -N o test_OP 1>ass.log 2>ass.err Parameters specified: Insert_size: 0, single end reads Kmer_size: 23, default asm_flag: both contigs and scaffold Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Assembler comparison- 454 ToolN50No. of contigs Avg. contig length No. of large contigs Largest contigRead usage % CLC Genomics wb. 93, ,107NA Newbler194, , , ToolN50No. of contigs Avg. contig length No. of large contigs Largest contig Read usage % CLC Genomics wb. 84, ,828NA Newbler111, , , nav_454_ vul_454_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Assembler comparison- Illumina ToolN50No. of contigsAvg. contig length Read usage %Largest contig Median coverage depth SOAP DeNovo1,07728,760184NA Velvet17,4081,4023, , CLC Genomics wb56, , ,565NA ToolN50No. of contigsAvg. contig length Read usage %Largest contig Median coverage depth SOAP DeNovo1,09426,773207NA Velvet15,6991,2533, , CLC Genomics wb87, , ,510NA nav_ill_ vul_ill_ Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
454 raw reads PRE-PROCESSING Illumina raw reads Pre-processing 454 reads Illumina reads Statistical analysis Read stats Published Genomes from public databases V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O Align Illumina against the reference Fastqc Prinseq NGS QC Compare mapping statistics Reference genome samstats bwa REFERENCE SELECTION Hybrid DeNovo Ray Illumina/ 454/ Hybrid DeNovo assembly 454 DeNovo Newbler CABOG SUTTA Illumina DeNovo Allpaths LG SOAP DeNovo Velvet SUTTA contigs * 3 Align illumina reads against 454 contigs Unmapped reads Mac vector CLC wb contigs Unmapped reads Evaluation GAGE Hawk-eye Illumina/454? reference based assembly AMOScmp contigs Unmapped reads DENOVO ASSEMBLY REFERENCE BASED ASSEMBLY Draft/ Finished genome Reference evaluation DNA Diff Parameter optimization CONTIG MERGING All possible combinations of the best 3 Mimimus MAIA PAGIT Mauve Finished genome Scaffolds GAGE GENOME FINISHING Gap filling Nulceotide identity MUMmer GRASS Built-in Process 454 Illumina Info. Chosen Ref. Assemblers Illumina 454 LEGEND hybrid
Reference Genomes V. vulnificus MO6-24/O V. vulnificus YJ016 V. vulnificus CMCP6
Reference vs. all contigs- 454 Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% CLC Genomics wb. (n = 313) Newbler (n = 347) nav_454_ vul_454_ Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs%- Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% CLC Genomics wb.NA Newbler (n = 142)
Reference vs. all contigs- Illumina Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% Aligned contigs% Aligned bases% SOAP DeNovo (n = 28,760) Velvet (n = 1402) nav_ill_ vul_ill_ Tool/ReferenceCMCP6YJ016MO6-24/O Aligned contigs% Aligned bases% Aligned contigs% Aligned bases%- Aligned contigs% Aligned bases% SOAP DeNovo (n = 26,773) Velvet(n = 1,253)
Visualization Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Road ahead….. Get all the tools working Optimize tool parameters Use Illumina reads to finish 454 contigs Performance considerations for the tool Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Questions???