Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Slides:



Advertisements
Similar presentations
Considerations for Analyzing Targeted NGS Data HLA
Advertisements

RNA Assembly Using extending method. Wei Xueliang
MCB Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Next Generation Sequencing, Assembly, and Alignment Methods
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
NGS Bioinformatics Workshop 2
Assembly by short paired-end reads Wing-Kin Sung National University of Singapore.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
8. DNA Sequencing. Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved.
Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Fuzzypath – Algorithms, Applications and Future Developments
Metagenomics Assembly Hubert DENISE
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
The Changing Face of Sequencing
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Jan Pačes Institute of Molecular Genetics AS CR
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
billion-piece genome puzzle
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Assembly algorithms for next-generation sequencing data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
A Fast Hybrid Short Read Fragment Assembly Algorithm
Genome sequence assembly
Introduction to Genome Assembly
Removing Erroneous Connections
Jin Zhang, Jiayin Wang and Yufeng Wu
CS 598AGB Genome Assembly Tandy Warnow.
Presentation transcript:

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of DNA 13 July 2009

Overview Genome sequencing – Interrogating the genome of a particular species to discover its constituting DNA sequence. – Has both wet-lab and dry-lab (bioinformatics) component.

Overview A complete chromosome can range from a few thousands of bps to a few hundred millions. Maximum sequence-able fragment (read) length a is ~ 500-1,000 bps. Therefore needs whole genome shotgun sequencing approach.

Overview Whole genome shotgun sequencing. Illustration from

Traditional approach Sequence shotgun fragments of length 600 bps using Sanger capillary sequencing. ~ 10x coverage / sequencing depth. Assembled using overlap-layout-consensus approach.

Traditional approach Overlap-layout-consensus method for assembly. – Build an overlap graph where each node represents a read. An edge exists between two reads if they overlap. – Traverse the graph to find unambiguous paths which form contigs. Illustration from

Traditional approach Sanger capillary sequencing is very slow. 384 sequences / day (0.4 million bps) – 10x coverage of human genome: ~30gbps

Next-generation sequencing Alternative sequencing technologies to capillary, introduced in mid 2000s. Systems by Illumina Solexa and ABI SOLiD. Much higher throughput (1-4gbps / day) Lower cost / base pair Very short fragment lengths (25-75bps) High error rate Inherent ability to do paired-end (mate-pair) sequencing.

Next generation sequencing Paired-End sequencing (Mate pairs) – Sequence two ends of a fragment of known size. – Currently fragment length (insert size) can range from 200 bps – 10,000 bps

Next-generation sequencing Challenging to assembly data. Short fragment length = very small overlap therefore many false overlaps Sequenced up to 100x coverage, increase in data size. Large number of reads + short overlap + higher error rate make traditional overlap - layout - consensus approach impractical.

Current approaches Euler / De Bruijn approach. Introduced as a alternative to overlap-layout- consensus approach in capillary sequencing. More suited for short read assembly. Based on De Bruijn graph. Implemented in Velvet 1, the mostly used short read assembly method at present. 1 Daniel Zerbino and Ewan Birney. Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Res. 18:

De Bruijn graph method Break each read sequence in to overlapping fragments of size k. (k-mers) Form De Bruijn graph such that each (k-1)-mer represents a node in the graph. Edge exists between node a to b iff there exists a k-mer such that is prefix is a and suffix is b. Traverse the graph in unambiguous path to form contigs.

De Bruijn graph K = 4

De Bruijn graph method / Velvet Elegant way of representing the problem. Very fast execution. Error correction can be handled in the graph. De Bruijn graph size can be huge. – ~200GB for human genomes. Does not use pair information in initial phase, resulting in overly complicated graphs. Therefore we devised our own approach.

Our approach Based on ‘Overlap extension’ – Similar to SSAKE, VCAKE, but with support for paired end reads. Strictly paired-end sequences – Insert size: MIN_SPAN – MAX_SPAN 3 step procedure – Seed building & extension – Contig ordering – Gap filling

Our approach Overlap extension

Seed building Seed = Initial sequence of length MAX_SPAN Start with single read as current sequence. Do overlap extension. Keep track of ‘pools’ of paired end data. Resolve ambiguities using these ‘pools’

Seed building Resolving ambiguities

Seed building Seed verification – Check if assembled seed represent a contiguous region of target genome – Carry out once seed is of length MAX_SPAN. – Unverified seeds are discarded.

Seed extension Based on overlap extension Always look for anchored reads. Possible complication

Seed building & extension Repeat seed building, verification and extension steps until we have used (or tried to use) all read sequences. Order resulting contigs in next step.

Contig ordering Use paired end information to order contigs There is a potential gap between every pair of adjacent contigs.

Gap filling Fill the gap between two adjacent contigs using paired information. Length of gap can be estimated using paired sequences that map to both sides. Overlap extension only using set of ‘supported’ reads.

Implementation Implemented current approach using c++ Used compressed suffix array for overlap searching.

Implementation Simulated data – A strain of E. Coli. – 4.6 million bp length – 25bp tags – Insert size of – 40x coverage – 1% sequencing errors –.5% ligation errors

Implementation Real data – A strain of Neisseria meningitidis – ~2.2 million bp length – 25bp tags – Insert size of – ~40x coverage

Results Simulated data

Results Real data

To Do Improve speed Allow multiple libraries with different insert size. Make multi-cpu compatible

Acknowledgement Ken Sung Christina Nilsson Lim Yan Wei Ruan Yijun