JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results 02 -22- 2012.

Slides:



Advertisements
Similar presentations
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Advertisements

Proprietary Signal Generation and Imaging Photons Generated Reagent Flow PicoTiterPlate Wells Sequencing By Synthesis 1600K field of addressable wells.
Final Results Genome Assembly Team Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington,
Click to edit Master title style Irys data analysis January 10 th, 2014.
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Introduction to Short Read Sequencing Analysis
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Henrik Lantz - BILS/SciLife/Uppsala University
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Bacterial Genome Finishing Using Optical Mapping Dibyendu Kumar, Fahong Yu and William Farmerie Interdisciplinary Center for Biotechnology Research, University.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
De-novo Assembly Day 4.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Introduction to next generation sequencing Rolf Sommer Kaas.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Introduction to Short Read Sequencing Analysis
Genome Assembly Preliminary Results
KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.
PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
The Changing Face of Sequencing
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
1.Data production 2.General outline of assembly strategy.
billion-piece genome puzzle
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
De novo assembly validation
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
1 Aplicação de metodologias genómicas na detecção de polimorfismos no sobreiro Ciência 2010 Octávio S. Paulo Computational Biology and Population Genomics.
Assembly algorithms for next-generation sequencing data
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Cross_genome: Assembly Scaffolding using Cross-species Synteny
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
A Fast Hybrid Short Read Fragment Assembly Algorithm
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Very important to know the difference between the trees!
2nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
Sequence Analysis Alan Christoffels
Apollo: A Sequencing-Technology-Independent, Scalable,
Presentation transcript:

JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results

Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

Pipeline for evaluation

Strategy – Key alterations Prinseq Preprocessing  Unnecessary, assemblers have built in capabilities  Use Prinseq for data statistics Error Correction  Does not fit methods  Coral is based on Overlap-layout-consensus and works best with de Bruijin Graph assemblers  Echo has never been tested on 454 data Final Assemblers  Newbler, Mira, Celera, AmosCMP  Discarded Assemblers  Abyss, Velvet, and Pcap454 MAIA Hybrid Assembly  Needs a close phylogenetic reference genome

Outline Pipeline for evaluation Quantitative Evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

Metrics  No. of Contigs -> Lesser the better  N50 -> Higher the better  Assembly size -> Closer to the estimated genome, the better Quantitative Assembly Score N50 * Assembly size No. of Contigs Higher the score, the better! Quantitative Evaluation

M Evaluation Runs# ContigsN50Total SizeScore Newbler Mira Celera Newbler_Mira Newbler_Celera Newbler_Mira_Celera

Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

Qualitative Evaluation Strategy  Align the assembly contigs to the original reference genome and compute differences Challenges  No Original reference genome for our data set Approach  Create simulated 454 read datasets from a completely sequenced genome Tools used  FlowSim  454Sim  Art-454

FlowSim A simulation pipeline based on real data Lets you model each step of pyrosequencing process Utilities:  Clonesim : To simulate the shearing step Clonesim  Usage: clonesim -c count -l dist input.fasta  Gelfilter: To select a certain range of clone lengths. Gelfilter  Usage: gelfilter min max  Kitsim: To attach A and B adaptors. Kitsim  Usage: kitsim -k key -a adapter input.fasta -o output.fasta  Mutator: To introduce random substitutions and indels in the sequences. Mutator  Usage: mutator -i indel_rate -s subst_rate input.fasta -o output.fasta  Duplicator: To generate artificial duplicates of many clones. Duplicator  Usage: duplicator dup_prob  Flowsim : To simulate the actual pyrosequencing process Flowsim  Usage: flowsim -G generation input.fasta -o output.sff Example: clonesim -c –l “Normal ” input.fasta | gelfilter | kitsim | mutator | duplicator 0.03 | flowsim –G Titanium -o output.sff

454Sim 454 Simulation at higher speed and accuracy USP: Configurable statistical models Support GS FLX, Titanium and GS 20 Utilities:  fragsim: To simulate shearing  Usage: fragsim -c l 1000 genome.fasta > genome.fragments.fasta  454sim: To simulate the sequencing step  Usage: 454sim -o genome.sff genome.fragments.fasta Example:  fragsim -c l 1000 genome.fasta | 454sim –g FLX -o genome.sff

ART-454 Supports Illumina, 454 and Solexa read simulation Used for 1000 Genomes Project Usage:  Art_454 Input.fasta Output prefix Fold_coverage (single – end reads)  Art_454 Input.fasta Output prefix Fold_coverage Mean_Flag_Len Std_Deviation (paired end reads)

Running pipeline on Simulated reads Reference – Haemophilus influenzae F3047 (NC_014922) Ran 454Sim, FlowSim and Art-454 to generate reads Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG) Merged assemblies using Minimus2 Evaluate Assembly Accuracy (How?)

Assembly Accuracy Challenges  Alignment of contigs to the reference genome Approach  Local alignment (BLAST, bwa, bowtie)  Whole genome alignment (Mauve, MUMmer)  Align the assembly to the reference genome  Compute nucleotide differences, gaps and rearranged segments

Mauve Uses positional homology genome alignment  Each site in the assembly maps to at most one site on the reference  Optimized contiguity  E.g. progressiveMauve Ordering of contigs: Mauve Contig Mover algorithm Compare to identify differences

Mauve Genome Aligner

After Ordering of Contigs

Mauve Assembly Metrics Basecalling accuracy  Count and location of bases called wrongly  Direction of miscalling, e.g. A->G  Count and location of bases predicted to exist, but uncalled Genome content accuracy  Count and location of bases missing from the assembly  Count and location of extra bases in the assembly  Size distribution of the missing and extra fragments Genome structure accuracy  Estimate of misassembly count

Example Miscalls  2 (C->G and G->A) Uncalled bases  1 (N) Extra bases  1 (Insertion of C ) Missing bases  2 (Deletion of GC ) Missing segments  1 Extra segments  1

Scoring simulated reads with Mauve Reference – Haemophilus influenzae F3047 (NC_014922) Ran 454Sim, FlowSim and Art-454 to generate reads Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG) Merged assemblies using Minimus2 Ran Mauve to align the assemblies back to the reference genome Computed Assembly metrics

Miscalled Bases

Uncalled bases

Total missing bases

Total extra segments

Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

Choosing the BEST assembly Quantitative metrics  N50  Contig count  Assembly size Qualitative metrics  Miscalled bases  Uncalled  Missing bases  Extra bases

Quantitative Score N50 * Assembly size No. of Contigs Qualitative Score ( % Accuracy ) Miscalls + Uncalled + Missing + Extra + Gaps in Ref + Gaps in Assembly Assembly Scores Reference Size 1 -

Metrics Summary – Art 454 ASSEMBLY SCORE QUALITY SCORE

Assembly spec. vs Accuracy plot – 454Sim

Assembly spec. vs Accuracy plot - Art-454

Assembly spec. vs Accuracy plot – FlowSim

Assembly spec. vs Accuracy plot – M21709

Inference Striking a balance is critical We chose  Newbler + MIRA for H. haemolyticus  Newbler + AMOScmp for H. influenzae Universally applicable pipeline Assembling specific genomes/strains Adopt the most consistent tool /pipeline (Conservative approach) NEWBLER Choose the one that works the best balance for your genome NEWBLER + (CELERA/MIRA)

Outline Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo

Final Results

Key take-aways Understand your data  Platform, long/short reads, Coverage, Paired/Non-paired, Quality of basecalling etc Evaluate the need for error correction Choose a set of “best” assemblers  De novo /Reference assembly, DBG/OLC algorithm Merge assemblies Ordering and Scaffolding Finishing Evaluate your assembly at every step to ensure that you are on the right track!

Coming next >>> Demo