JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo
Pipeline for evaluation
Strategy – Key alterations Prinseq Preprocessing Unnecessary, assemblers have built in capabilities Use Prinseq for data statistics Error Correction Does not fit methods Coral is based on Overlap-layout-consensus and works best with de Bruijin Graph assemblers Echo has never been tested on 454 data Final Assemblers Newbler, Mira, Celera, AmosCMP Discarded Assemblers Abyss, Velvet, and Pcap454 MAIA Hybrid Assembly Needs a close phylogenetic reference genome
Outline Pipeline for evaluation Quantitative Evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo
Metrics No. of Contigs -> Lesser the better N50 -> Higher the better Assembly size -> Closer to the estimated genome, the better Quantitative Assembly Score N50 * Assembly size No. of Contigs Higher the score, the better! Quantitative Evaluation
M Evaluation Runs# ContigsN50Total SizeScore Newbler Mira Celera Newbler_Mira Newbler_Celera Newbler_Mira_Celera
Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo
Qualitative Evaluation Strategy Align the assembly contigs to the original reference genome and compute differences Challenges No Original reference genome for our data set Approach Create simulated 454 read datasets from a completely sequenced genome Tools used FlowSim 454Sim Art-454
FlowSim A simulation pipeline based on real data Lets you model each step of pyrosequencing process Utilities: Clonesim : To simulate the shearing step Clonesim Usage: clonesim -c count -l dist input.fasta Gelfilter: To select a certain range of clone lengths. Gelfilter Usage: gelfilter min max Kitsim: To attach A and B adaptors. Kitsim Usage: kitsim -k key -a adapter input.fasta -o output.fasta Mutator: To introduce random substitutions and indels in the sequences. Mutator Usage: mutator -i indel_rate -s subst_rate input.fasta -o output.fasta Duplicator: To generate artificial duplicates of many clones. Duplicator Usage: duplicator dup_prob Flowsim : To simulate the actual pyrosequencing process Flowsim Usage: flowsim -G generation input.fasta -o output.sff Example: clonesim -c –l “Normal ” input.fasta | gelfilter | kitsim | mutator | duplicator 0.03 | flowsim –G Titanium -o output.sff
454Sim 454 Simulation at higher speed and accuracy USP: Configurable statistical models Support GS FLX, Titanium and GS 20 Utilities: fragsim: To simulate shearing Usage: fragsim -c l 1000 genome.fasta > genome.fragments.fasta 454sim: To simulate the sequencing step Usage: 454sim -o genome.sff genome.fragments.fasta Example: fragsim -c l 1000 genome.fasta | 454sim –g FLX -o genome.sff
ART-454 Supports Illumina, 454 and Solexa read simulation Used for 1000 Genomes Project Usage: Art_454 Input.fasta Output prefix Fold_coverage (single – end reads) Art_454 Input.fasta Output prefix Fold_coverage Mean_Flag_Len Std_Deviation (paired end reads)
Running pipeline on Simulated reads Reference – Haemophilus influenzae F3047 (NC_014922) Ran 454Sim, FlowSim and Art-454 to generate reads Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG) Merged assemblies using Minimus2 Evaluate Assembly Accuracy (How?)
Assembly Accuracy Challenges Alignment of contigs to the reference genome Approach Local alignment (BLAST, bwa, bowtie) Whole genome alignment (Mauve, MUMmer) Align the assembly to the reference genome Compute nucleotide differences, gaps and rearranged segments
Mauve Uses positional homology genome alignment Each site in the assembly maps to at most one site on the reference Optimized contiguity E.g. progressiveMauve Ordering of contigs: Mauve Contig Mover algorithm Compare to identify differences
Mauve Genome Aligner
After Ordering of Contigs
Mauve Assembly Metrics Basecalling accuracy Count and location of bases called wrongly Direction of miscalling, e.g. A->G Count and location of bases predicted to exist, but uncalled Genome content accuracy Count and location of bases missing from the assembly Count and location of extra bases in the assembly Size distribution of the missing and extra fragments Genome structure accuracy Estimate of misassembly count
Example Miscalls 2 (C->G and G->A) Uncalled bases 1 (N) Extra bases 1 (Insertion of C ) Missing bases 2 (Deletion of GC ) Missing segments 1 Extra segments 1
Scoring simulated reads with Mauve Reference – Haemophilus influenzae F3047 (NC_014922) Ran 454Sim, FlowSim and Art-454 to generate reads Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG) Merged assemblies using Minimus2 Ran Mauve to align the assemblies back to the reference genome Computed Assembly metrics
Miscalled Bases
Uncalled bases
Total missing bases
Total extra segments
Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo
Choosing the BEST assembly Quantitative metrics N50 Contig count Assembly size Qualitative metrics Miscalled bases Uncalled Missing bases Extra bases
Quantitative Score N50 * Assembly size No. of Contigs Qualitative Score ( % Accuracy ) Miscalls + Uncalled + Missing + Extra + Gaps in Ref + Gaps in Assembly Assembly Scores Reference Size 1 -
Metrics Summary – Art 454 ASSEMBLY SCORE QUALITY SCORE
Assembly spec. vs Accuracy plot – 454Sim
Assembly spec. vs Accuracy plot - Art-454
Assembly spec. vs Accuracy plot – FlowSim
Assembly spec. vs Accuracy plot – M21709
Inference Striking a balance is critical We chose Newbler + MIRA for H. haemolyticus Newbler + AMOScmp for H. influenzae Universally applicable pipeline Assembling specific genomes/strains Adopt the most consistent tool /pipeline (Conservative approach) NEWBLER Choose the one that works the best balance for your genome NEWBLER + (CELERA/MIRA)
Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo
Final Results
Key take-aways Understand your data Platform, long/short reads, Coverage, Paired/Non-paired, Quality of basecalling etc Evaluate the need for error correction Choose a set of “best” assemblers De novo /Reference assembly, DBG/OLC algorithm Merge assemblies Ordering and Scaffolding Finishing Evaluate your assembly at every step to ensure that you are on the right track!
Coming next >>> Demo