PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.
RNA Assembly Using extending method. Wei Xueliang
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Assembly by short paired-end reads Wing-Kin Sung National University of Singapore.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
De-novo Assembly Day 4.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Fuzzypath – Algorithms, Applications and Future Developments
The Changing Face of Sequencing
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
1.Data production 2.General outline of assembly strategy.
Human Genome.
billion-piece genome puzzle
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
De novo assembly validation
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
De Novo Genome Assembly - Introduction
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Short Read Workshop Day 5: Mapping and Visualization
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Cross_genome: Assembly Scaffolding using Cross-species Synteny
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
M. roreri de novo genome assembly using abyss/1.9.0-maxk96
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Removing Erroneous Connections
Henrik Lantz - NBIS/SciLife/Uppsala University
CSCI 1810 Computational Molecular Biology 2018
Presentation transcript:

PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne

Outline Method – Read screening – Seed building – Contig extension – Scaffolding – Gap filling Result

Data-sets Used Single end reads Paired end reads – ReadLength (from 25bp to 100bp) – Insert size vary from MinSpan to MaxSpan – The information are mainly from this data-sets.

Overview Read screening step select a set of reads as starting point. Seed building step extend these reads using Single End Reads to make them longer than MaxSpan. Successfully extended regions are called seeds. Contig extension try to extend all seeds using paired-end reads, result sequences called contigs.

Read screening Get all k-mers from all the reads. – A k-mer that is expected to occur in the actual genome is called a ‘solid’ k-mer. – A k-mer that is expected to occur within a repeat region is called a ‘repeat’ k-mer. Repeat Region: – ACTTTGACACACACACAC……ACACACACGTTGAG

Read screening

A read is solid read if: – All it’s k-mers are within the two threshold cut-off. Example: – Two cut-off [42, 120] from previous graph. – K=5 – Read: ACCGTATA – ACCGT, CCGTA, CGTAT, GTATA – 100, 70, 90, 140 – Not a solid read.

Read screening Example: – Two cut-off [42, 120] from previous graph. – K=5 – Read: ACCGTATG – ACCGT, CCGTA, CGTAT, GTATG – 100, 70, 90, 70 – A solid read.

Seed Building Try to extend the solid read using all overlapping reads.

Seed Building Because of sequencing errors or small repeats, there maybe multiple feasible candidates.

Seed Building Ambiguities due to sequencing errors, we extend every candidate base up to ReadLength. – If only one candidate path reach the full distance ReadLength, then that path is assumed to be correct extension. If no path or more than one path found. Try other side.

Seed Building Finally, when the sequence reach MaxSpan, (called seed) do a verification. At least one paired-end reads overlaps with this seed within expected length [MinSpan, MaxSpan]

Contig Extension This step aims to extend each verified seed to form a longer contig using Paired-End reads. For multiple feasible candidates, may due to 3 reasons. – First, sequencing errors. – Second, short tandem repeat. Handling in Gap Filing step. – Third, long repeat. Which longer than MaxSpan.

Scaffolding Find the correct ordering of the resulting set of contigs. Gao Song currently working on it.

Gap filling Gap filling step is to assemble the gap region between two adjacent contigs to form a longer contig.

Gap filling

Simulated data results. Result compare using: – Average Length of all contigs. – N50, N90 of contigs. Bigger better. – Coverage. – Large Misassembly: accuracy is much more important than others.

Simulated data results. E. ColiS. PombeHG18 chr10 200bp + 10kbp200bp + 1kbp + 10kbp200bp + 10kbp200bp + 1kbp + 10kbp PAAllpaths2VelvetPAAllpaths2PAAllpaths2VelvetPAAllpaths2PA Contig statistics Contigs (>200bp) Average length (kb) Maximum length (kb) Contig N50 size (kb) Contig N90 size (kb) Coverage99.89%99.83%99.60%100.00%99.85%97.56%98.62%98.95%97.78%98.60%94.20% Evaluation Large misassemblies Segment maps99.20%99.27%95.00%99.68%99.18%96.02%96.78%93.38%96.42%96.83%90.48% Performance 1 Execution time (min) Memory usage (gb)

Thank you for attention.