Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
MCB Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
Introduction to Short Read Sequencing Analysis
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Introduction to Short Read Sequencing Analysis
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
SUR-2250 Error Theory.
Assembly algorithms for next-generation sequencing data
Sequence Assembly.
Sequencing technologies
Cross_genome: Assembly Scaffolding using Cross-species Synteny
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Genome sequence assembly
Pre-assembly analyses
Research in Computational Molecular Biology , Vol (2008)
The FASTQ format and quality control
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Department of Computer Science
Introduction to Genome Assembly
Removing Erroneous Connections
Henrik Lantz - NBIS/SciLife/Uppsala University
CS 598AGB Genome Assembly Tandy Warnow.
2nd (Next) Generation Sequencing
An Eulerian path approach to DNA fragment assembly
CSCI 1810 Computational Molecular Biology 2018
Single-Molecule Sequencing: Towards Clinical Applications
(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.
Fragment Assembly 7/30/2019.
Presentation transcript:

Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es

{ Sequencing project Unknown sequence read 1 read 4 read 2 read 5 experimental evidence read 3 read 6 read 7 result

Computational requirements mykl_roventine@flickr Main requirements: Memory. Time. The kmer-based assemblers requirements depend on the genome complexity and not so much on the reads.

Read cleaning Some assemblers choke with: bad quality stretches, adapters, low complexity regions, contamination. Bad quality vector Unknown sequence Trim Trim or mask

The assembly problem Only one read is usually not capable of producing the complete sequence. Even if the problem sequence is short one read might have sequencing errors. repeats Unknown sequence Reads Contigs / scaffolds Reasons for the gaps contig5 (repeat) contig1 contig2 contig3 contig4 Repeated region no coverage

Long reads

Mate pairs and paired-ends Read length is critical, but constrained by sequencing technology, we can sequence molecule ends. Useful for dealing with repetitive genomic DNA and with complex transcriptome structure. Paired-ends: Illumina can sequence from both ends of the molecules. (150-500 bp) Mate-pairs: Can be generated from libraries with different lengths (2-20 Kb). BAC-ends: Usually sequenced with Sanger. Clone (template) reads

Library types Single reads Illumina Pair Ends (150-500 pb) Mate Pairs (2-10 kb) Illumina

Short reads are harder to assemble Taken from Jason Miller

The need for paired reads Taken from Jason Miller

Algorithm I Overlap – Layout - Consensus

Overlap – Layout - Consensus Algorithm: All-against-all, pair-wise read comparison. Construction of an overlap graph with approximate read layout. Multiple sequence alignment determines the precise layout and the consensus sequence. Overlaps detection All vs all alignment Contigs clean reads Consensus

Vector contamination or repetitive Overlap problems Two reads are similar if they have a good overlap. We assume that overlap implies common origin in the genome. This goodness depends on the overlap quality and similarity. But several things might go wrong Vector contamination or repetitive repetitive Too many sequencing errors Chimeric clone Short overlap false positive false negative

Overlap problems False positives will be induced by chance and repeats Avoid false positives with stringent criteria: Overlaps must be long enough Sequence similarity must be high (identity threshold) Overlaps must reach the ends of both reads Ignore high-frequency overlaps (repetitive regions in genomic?) But stringency will induce false negatives.

Assembly terminology Consensus ACTGATTAGCTGACGNNNNNNNNNNTCGTATCTAGCATT Scaffold = Contigs + Parings (gap lengths derived from pairings) Contigs = Reads + Overlaps Overlaps Reads & Pairing ACTGATTAGCTGACGNNNNNNNNNNTCGTATCTAGCATT Taken from Jason Miller

Unitigs Contigs: High-confidence overlaps Maximal contigs with no contradiction (or almost no contradiction) in the data Ungapped Contigs capture the unique stretches of the genome. Usually where unitigs end, repeats begin Contig example. The ideal Contig is A, B, C. C, D and E have a repetitive sequence.

Scaffolds Build with mate pairs Mate pairs help resolve ambiguity in overlap patterns Scaffolds Every contig is a scaffold Every scaffold contains one or more contigs Scaffolds can contain sequence gaps of any size (including negative)

Overlap-Consensus-Layout limitation 30 Million 5Kb (long) reads Number of alignments: 30 M reads x 30 M reads = 900 trillion alignments It does not scale

Algorithm II kmer-based

K-mer based assembly K-mer graphs are de Bruijn graph. Nodes represent all K-mers. Edges represent fixed-length overlaps between consecutive K-mers. Assembly is reduced to a graph reduction problem. Read 1 AACCGG Read 2 CCGGTT Graph: AACC -> ACCG -> CCGG -> CGGT -> GGTT K-mer length = 4

K-mer based assembly if we sequence billions of short reads uniformly representing the entire genome and construct a giant de Bruijn graph from those short reads, that de Bruijn graph will look identical to the de Bruijn graph of the genome. Therefore, our challenge will be to go back to the genome sequence from its de Bruijn graph.

K-mer bubbles if some K-mers have low occurrence, they likely originate from sequencing errors.

K-mer repetitive If almost all nodes of the de Bruijn graph are visited by 50 short reads, but a small subset of nodes visited by 200 short reads, one can argue that the underlying sequence near those subset of nodes constitute repetitive regions of the genome, and they are present 4 times in the genome. So, instead of avoiding repeats, as done by traditional assemblers, de Bruijn graphs allow one to estimate the repeat frequencies of repetitive regions during assembly.

Graph problems Repeats, sequencing errors and polymorphisms increase graph complexity, leading to tangles difficult to resolve K-mer graphs are more sensitive to repeats and sequencing errors than overlap based methods. Optimal graph reductions algorithms are NP-hard, so assemblers use heuristic algorithms Polymorphism Sequencing error Repeat

K-mer analysis 21-mer distribution of sea urchin (Strongylocentrotus purpuratus) genome (900 Mbase long) unique 994,401,729 21-mers were present only once in the genome. That is the number you see for x=1 in the following graph. 122,161,264 21-mers were present twice (x=2). 20,077,544 21-mers were present thrice (x=3) repetitive

K-mer analysis K-mer distribution Simulated reads. 50X coverage Genome K-mer distribution Simulated reads. 50X coverage No errors reads unique 994,401,729 21-mers were present only once in the genome. That is the number you see for x=1 in the following graph. 122,161,264 21-mers were present twice (x=2). 20,077,544 21-mers were present thrice (x=3) Repetitive (duplication)

K-mer analysis K-mer distribution real reads. 50X coverage With errors Genome K-mer distribution real reads. 50X coverage With errors Real reads (with errors) noise Simulated reads (no error) Why is the noise peak so big? Think about this – every time a read has one nucleotide error, it can result in 21 erroneous 21-mers. Thus, when we look at the k-mer distribution of all reads, errors appear magnified k times. However, there is also memory cost to having those errors a good strategy for cleaning data would be to reduce the noise peak as much as possible. When we trim 3′ edges, it makes sense to check what trimming is doing to the noise peak. If cutting one additional nucleotide does not reduce amount of noise, it is not helping the assembly. I think checking the k-mer distribution is a better measure than using the pre-assigned quality scores from the sequencer. signal

Final remarks

N50 N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 statistic is a measure of the average length of a set of sequences, with greater weight given to longer sequences. It is widely used in genome assembly N50: 10829 N95: 20 minimum: 100 maximum: 146,370 average: 1618.01 num. seqs.: 141503 [ 100 , 7414[ (131545): ************************************************************************ [ 7414 , 14728[ ( 6247): *** [ 14728 , 22042[ ( 2353): * [ 22042 , 29356[ ( 842): [ 29356 , 36670[ ( 324): [ 36670 , 43984[ ( 118): [ 43984 , 51298[ ( 42): [ 51298 , 58612[ ( 13): [ 58612 , 65926[ ( 10): [ 65926 , 73240[ ( 4): [ 73240 , 80554[ ( 0): [ 80554 , 87868[ ( 1): [ 87868 , 95182[ ( 0): [ 95182 , 146380[ ( 4): Taken from Wikipedia

Common assembly errors Collapsed repeats Align reads from distinct (polymorphic) repeat copies Fewer repeats in assembly than in genome Fewer tandem repeat copies in assembly than genome Missed joins Missed overlaps due to sequencing error Contradictory evidence from overlaps Contradictory or insufficient evidence from pairs Chimera Enter repeat at copy 1 but exit repeat at copy 2 Assembly joins unrelated sequences

Ingredients for Good Assembly Coverage and read length 20X for (corrected) PacBio, 150X for Illumina Paired reads Read lengths long enough to place uniquely Inserts longer than long repeats Pair density sufficient to traverse repeat clusters Tight insert size variance Diversity of insert sizes Read Quality: Sequencing errors No vector contamination Contamination: mitochondrial or chloroplastic The requirements depend greatly on the quality: it is not the same a draft that high quality genome.

Common assemblers Genomic SOAPdenovo, Illumina CABOG: Celera assembler, 454 and few Illumina Staden for Sanger reads. Transcriptomic assemblers are specialized software

This work is licensed under the Creative Commons Attribution 3 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Jose Blanca COMAV institute bioinf.comav.upv.es