Henrik Lantz - BILS/SciLife/Uppsala University

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

Considerations for Analyzing Targeted NGS Data HLA
Final Results Genome Assembly Team Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington,
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Next Generation Sequencing, Assembly, and Alignment Methods
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
WV-INBRE West Virginia IDeA Network of Biomedical Research Excellence Managing the NextGen data pipeline Jim Denvir, Ph.D.
Transcriptomics Jim Noonan GENE 760.
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
Assembly & Annotation at iPlant
De-novo Assembly Day 4.
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Introduction to next generation sequencing Rolf Sommer Kaas.
GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology.
PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Metagenomics Assembly Hubert DENISE
The Changing Face of Sequencing
The iPlant Collaborative
Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Jan Pačes Institute of Molecular Genetics AS CR
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
billion-piece genome puzzle
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.
De novo assembly validation
De Novo Genome Assembly - Introduction
Whole Genome Assembly with iPlant
Comparative transcriptomics of fungi Group Nicotiana Daan van Vliet, Dou Hu, Joost de Jong, Krista Kokki.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
How to design arrays with Next generation sequencing (NGS) data Lecture 2 Christopher Wheat.
Canadian Bioinformatics Workshops
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Lesson: Sequence processing
Assembly algorithms for next-generation sequencing data
Sequence Assembly.
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Fig. 4 Disease gene prediction based on multiple CSNs
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Assembly.
Fig. 5. A GC content and het/nonref-hom ratio relationship for exome regions. B GC content and het/nonref-hom ratio relationship for intron regions. C.
Transcriptome Assembly
Henrik Lantz - NBIS/SciLife/Uppsala University
2nd (Next) Generation Sequencing
Genome Sequencing and Assembly
BF nd (Next) Generation Sequencing
Genome Assembly Chris Fields
Schematic representation of a transcriptomic evaluation approach.
Henrik Lantz - NBIS/SciLifeLab/Uppsala University
Presentation transcript:

Henrik Lantz - BILS/SciLife/Uppsala University Genome assembly Henrik Lantz - BILS/SciLife/Uppsala University

De novo genome project workflow Extracting DNA (and RNA) - as much DNA as possible! Choosing best sequence technology for the project Sequencing Quality assessment and other pre-assembly investigations Assembly Assembly validation Assembly comparisons Repeat masking? Annotation

Genome assembly - things to think about Genome specifics - Size of genome, number of chromosomes, repeat content, heterozygosity Which assembly programs can run on “my” genome? What kind of data do these programs need?

Genome assembly - things to think about

Genome assembly - things to think about Genome specifics - Size of genome, number of chromosomes, repeat content, heterozygosity Which assembly programs can run on “my” genome? What kind of data do these programs need? How much data do I need? Will I have enough coverage? Do I need to subsample? Are there closely related organisms that already have had their genome sequenced? Do I need additional data for post-assembly?

Genome assembly programs Abyss Allpaths-LG CABOG (a.k.a. Celera) HGAP Masurca Mira Newbler SGA SoapDeNovo Spades Velvet

Genome assembly programs Name Algorithm Data Abyss De Bruijn Illumina Allpaths-lg Illumina/PacBio CABOG (Celera) OLC All HGAP PacBio Masurca De Bruijn/OLC Mira “OLC” Newbler 454/Illumina/Torrent SGA String SoapDeNovo Spades Velvet

OLC vs. de Bruijn

de Bruijn

de Bruijn

Sequence Assembly via De Bruijn Graphs The first step in a de Bruijn graph-based assembly is to construct the de Bruijn graph from the sequence reads. Each read is decomposed into substrings of some specified length k. Each word of length k is called a k-mer. In this example, k is set to 5, so here each 5-mer is extracted from the read. An ordered list of k-mers is generated by scanning a window of length k across the length of the read. You’ll notice that each k-mer overlaps the next k-mer by exactly k-1 bases. -- Then, a de Bruijn graph is constructed by assigning each unique k-mer as a node in the graph and connecting immediately overlapping k-mers by an edge. This is a very effective and compact way of representing the sequence data within the reads. For example, hundreds of millions of reads can be sequenced, and the identical sequence regions within reads become compressed into individual nodes within the graph. At positions where related sequences diverge due to allelic polymorphisms, splicing variations, repeats, or due to sequencing errors, the graph will branch and can form bulges or loops. From Martin & Wang, Nat. Rev. Genet. 2011

From Martin & Wang, Nat. Rev. Genet. 2011 After building the graph from all the reads, the graph is typically pruned to remove bubbles and structures that likely stem from sequencing errors, -- and the graph is compacted by collapsing those nodes that form linear unbranched chains of overlapping k-mers. For example, this linear chain of kmers is compressed into a single node in the compacted graph. From Martin & Wang, Nat. Rev. Genet. 2011

From Martin & Wang, Nat. Rev. Genet. 2011 Now, to reconstruct transcripts, paths are traversed across the graph. -- In this example, there are four possible paths from the beginning to the end of the graph, each path shown traced by a different color. By traversing each path, a different transcript sequence is generated. In this case, each of the four differently colored paths generates a different sequence as shown. By taking into account the paths that the reads trace through the graph, along with any mate-pairing information, constraints can be placed such that not all possible path combinations are reported, but instead only those paths that are best supported by the RNA-seq reads. From Martin & Wang, Nat. Rev. Genet. 2011

De Bruijn Pros: Computationally efficient, can work with large coverage short read datasets Cons: Sensitive to sequence errors, connection between assembly and read is lost, does not work so well with longer reads

OLC Pros: Utilizes longer reads well Cons: Time consuming, high memory requirements

Assemblathon 2 Uses 454, Illumina, and PacBio for three large eukaryote genomes: a bird, a fish, and a snake Bird - Illumina 14 libraries, 454, PacBio Fish - Illumina, 8 libraries Snake - Illumina, 4 libraries Teams take the data, perform assemblies with whatever tools they wish, and then submit their results => teams are evaluated more than individual programs! GigaScience 2013, 2:10

Assemblathon 2

Assemblathon 2 - Bird vs. Snake

Assemblathon 2 - Bird

CEGMA

Assemblathon 2 - Validation measures

GAGE-B Uses Illumina (HiSeq and MiSeq) data for a number of bacteria One team runs all programs => assembly programs are compared, not teams Reference high quality assemblies are available => errors/misassemblies can be quantified

Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. All datasets were 100 bp HiSeq reads from B.cereus Magoc T et al. Bioinformatics 2013;29:1718-1725 © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

GAGE-B

GAGE-B statistics

Genome assembly programs - pros and cons Abyss Allpaths-LG CABOG (a.k.a. Celera) Masurca Mira Newbler SGA SoapDeNovo Spades Velvet

Allpaths-LG Pros: Produces contigs and scaffolds with high N50 values, can use PacBio data for scaffolding, can run on large genomes with high coverage Cons: Only accepts Illumina data, needs very specific libraries to work at all (180 bp + 3 kbp), needs very high coverage (100x), takes a long time to run and requires a lot of memory

Assemblathon 2 - Bird vs. Snake

Assemblathon 2 - Bird

Masurca Pros: Can accept any type of data, is a true hybrid assembler, usable for very large genomes, produces top results in comparison of assembly statistics Cons: Takes a long time to run, unstable(?)

Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. All datasets were 100 bp HiSeq reads from B.cereus Magoc T et al. Bioinformatics 2013;29:1718-1725 © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

GAGE-B statistics

MIRA Pros: Can accept any type of data, is a true hybrid assembler, produces good assemblies for smaller genomes, excellent documentation Cons: Only useful for smaller genomes (bacteria, fungi), can not use high coverage data (prefers max 50x), takes a long time to run, no scaffolding

Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. All datasets were 100 bp HiSeq reads from B.cereus Magoc T et al. Bioinformatics 2013;29:1718-1725 © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

GAGE-B statistics

SoapDeNovo Pros: Usable on large genomes, easy to use and runs fairly quickly, can use high coverage data Cons: Only accepts Illumina data, medium results in assembly statistic comparisons

Assemblathon 2 - Bird vs. Snake

Assemblathon 2 - Bird

GAGE-B statistics

Spades Pros: Designed to work with amplified data, performs very well in GAGE-B with MiSeq data Cons: Only accepts Illumina data, only for smaller genomes

Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. All datasets were 100 bp HiSeq reads from B.cereus Magoc T et al. Bioinformatics 2013;29:1718-1725 © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

GAGE-B statistics

Some recommendations Large eukaryote genome, Illumina data: Allpaths-LG (needs specific libraries), SoapDeNovo, SGA, Masurca Large eukaryote genome, additional longer reads: Masurca, Newbler, CABOG Small eukaryote or prokaryote genome, Illumina data: Spades, Masurca, SoapDeNovo, Abyss, Velvet Small eukaryote or prokaryote genome, mixed data: MIRA, Masurca, Newbler Need to run in parallel: Abyss Amplified data (Single Cell Genomics): Spades

Assemblathon 2 recommendations Based on the findings of Assemblathon 2, we make a few broad suggestions to someone looking to perform a de novo assembly of a large eukaryotic genome: 1. Don’t trust the results of a single assembly. If possible, generate several assemblies (with different assemblers and/or different assembler parameters). Some of the best assemblies entered for Assemblathon 2 were the evaluation assemblies rather than the competition entries. 2. Do not place too much faith in a single metric. It is unlikely that we would have considered SGA to have produced the highest ranked snake assembly if we had only considered a single metric. 3. Potentially choose an assembler that excels in the area you are interested in (e.g., coverage, continuity, or number of error free bases). 4. If you are interested in generating a genome assembly for the purpose of genic analysis (e.g., training a gene finder, studying codon usage bias, looking for intron-specific motifs), then it may not be necessary to be concerned by low N50/NG50 values or by a small assembly size. 5. Assess the levels of heterozygosity in your target genome before you assemble (or sequence) it and set your expectations accordingly.

Post assembly considerations External scaffolders: SSPACE (commercial), SGA (see Hunt et al. in Genome Biology 2014:15, R42). Gap Closers (use with caution!): IMAGE, PILON, GapCloser Error correction: Nesoni, PILON Assembly validation is extremely important!

Abyss Pros: Only assembler that can run in parallel on different nodes => does not need a single huge memory node, fast, can run on large genomes with a high coverage Cons: Only accepts Illumina data, does not excel in any statistics

Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. All datasets were 100 bp HiSeq reads from B.cereus Magoc T et al. Bioinformatics 2013;29:1718-1725 © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

Assemblathon 2 - Bird vs. Snake

CABOG (Celera) Pros: Can accept any type of data, is a true hybrid assembler, output can easily be analyzed in the assembly validation toolkit AMOSvalidate, usable for large genomes Cons: Does not perform so well for any statistic in comparisons

Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. All datasets were 100 bp HiSeq reads from B.cereus Magoc T et al. Bioinformatics 2013;29:1718-1725 © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

GAGE-B statistics

Newbler Pros: Easy to run, works very well on 454 and Ion Torrent data, can use many types of data, usable for larger genomes, produces competitive assemblies if longer reads are available Cons: Requires longer reads to perform well

Assemblathon 2 - Bird vs. Snake

Assemblathon 2 - Bird

SGA Pros: Usable on large genomes, memory-efficient Cons: Only accepts Illumina data, does not perform well in comparisons of assembly statistics

GAGE-B statistics

Assemblathon 2 - Bird

Velvet Pros: Easy to use, runs quickly Cons: Only accepts Illumina data, only for smaller genomes, does not excel in any assembly statistic comparison

Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. All datasets were 100 bp HiSeq reads from B.cereus Magoc T et al. Bioinformatics 2013;29:1718-1725 © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

GAGE-B statistics