billion-piece genome puzzle

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
V Improvements to 3kb Long Insert Size Paired-End Library Preparation Naomi Park, Lesley Shirley, Michael Quail, Harold Swerdlow Wellcome Trust Sanger.
Algorithms Today we will look at: what we mean by efficiency in programs why efficiency matters what causes programs to be inefficient? will one algorithm.
RNA Assembly Using extending method. Wei Xueliang
Next Generation Sequencing, Assembly, and Alignment Methods
Introduction to Short Read Sequencing Analysis
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
CS Lecture 9 Storeing and Querying Large Web Graphs.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Next generation sequencing Xusheng Wang 4/29/2010.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
ECE 371 Microprocessor Interfacing Unit 4 - Introduction to Memory Interfacing.
A FASTA forces assemblers to make mistakes Strictly linear nature forces assemblers to introduce errors: These simple events are difficult to represented.
Introduction to Short Read Sequencing Analysis
PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
The Changing Face of Sequencing
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
1.Data production 2.General outline of assembly strategy.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Effective Parallel Multicore-optimized K-mers Counting Algorithm
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
From Reads to Results Exome-seq analysis at CCBR
CT Multi-Slice CT.
Assembly algorithms for next-generation sequencing data
Quality Control & Preprocessing of Metagenomic Data
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Very important to know the difference between the trees!
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Department of Computer Science
Nanopore Sequencing Technology and Tools:
Removing Erroneous Connections
Henrik Lantz - NBIS/SciLife/Uppsala University
CS 598AGB Genome Assembly Tandy Warnow.
2nd (Next) Generation Sequencing
MapView: visualization of short reads alignment on a desktop computer
How to Build a Horse: Final Report
Fast Sequence Alignments
Introduction to Sequencing
Presentation transcript:

billion-piece genome puzzle a new standard for assembling a billion-piece genome puzzle ALLPATHS-LG

Sante Gnerre et al. (20 Authors) CS 681 High-quality draft assemblies of mammalian genomes from massively parallel sequence data ALLPATHS-LG by Sante Gnerre et al. (20 Authors) Jan 25th, 2011 presented by Ömer Köksal

Agenda Introduction Results Discussion Model for Input Data Sequencing Data ALLPATHS-LG Assembly Method Uncertainty in Assemblies Human and Mouse Assemblies Human Genome Mouse Genome Segmental Duplications Understanding Gaps Discussion

Introduction High-quality assembly of a genome sequence is critical Particularly challenging for large, repeat rich genomes such as those of mammals Using traditional capillary-based sequencing (>700 bases) such assemblies produced for multiple mammals at a cost of tens of millions dollars each. New massively parallel technologies are expected to lower cost dramatically but they could not, because of short sequencing (~100 bases in length) less accuracy difficult to assemble

Introduction (cont’d) ALLPATHS-LG de novo assembly of large (and small) genomes it should be possible to generate high quality draft assemlies of Large Genomes ~1000 fold lower cost than a decade ago Previous versions: ALLPATHS 1.0 (2008) ALLPATHS 2.0 (2009)

Results RESULTS Model for Input Data Sequencing Data ALLPATHS-LG Assembly Method Uncertainty in Assemblies Human and Mouse Assemblies Human Genome Mouse Genome Segmental Duplication Understanding Gaps

Results - Model for Input Data De novo genome assembly depends on computational methods nature and quantity of sequence data used Fairly standard model of Capillary-based sequence was modified Sets a target of 100 fold sequence coverage to compensate shorter reads & nonuniform coverage Despite using higher coverage, proposed model is dramatically cheaper since the per-base cost ~10000 fold lower than capillary sequencing illumina sequencing was used (Table-1)

Results - Model for Input Data (cont’d) Table 1 – Provisional sequencing model for de novo assembly

Results – Sequencing Data Using the model above generated sequences are: Human Genome Mouse Genome Human Genome: GM12878 (Coriell Institute) of 1000 Genomes Pilot Project NCBI Short Read: Human_NA_12878_Genome_on_illumina Mouse Genome: C57BL/6J female DNA NCBI Short Read: Mouse_B6_Genome_on_illumina

Results - ALLPATHS-LG Assembly Method previous versions were improved extensively can assembly small genomes freely available at: http://www.broadinstitute.org/science/programs/genome-biology/crd

Results - ALLPATHS-LG Assembly Method (cont’d) Some key innovations in ALLPATH-LG Handling repetitive sequences more resilient to repeats Error Correction for every 24-mer the algorithm examines the stack of all reads containing 24-mer incidence of incorrect error correction was reduced Use of jumping data it coult work even with such data: it trim bases beyond junction points and treats each read pair as belonging to one of two possible distributions Efficient memory usage can asseble human genomes on commercial servers (48 processors & 512 GB Ram) in a few week 3 week for mouse & 3.5 weeks for human)

Results – Uncertainty in Assemblies The goal of assembly is to reconstruct the genome as accurately as possible However in some locations the data may be compatible with more than one solutions Rather than making an arbitrary choice (& introducing errors) algortihm retains alternatives ALLPATHS-LG algorithm generates an assembly graph whose edges are sequences and braches represent alternate choices ATC{A,T}GGTTTTTTT{T,TT}ACAC Variant Call Format (.VCF file)

Results – Uncertainty in Assemblies (cont’d) NOTE: Current version of ALLPATHS-LG only captures single-base and simple sequence indel uncertainties Better way to capture alternatives are needed (many of which are still lost in the current version and giving rise to errors) It would be desirable to assign probabilities to each alternative

Results – Human & Mouse Assemblies Resulting genome assemblies provide good coverage of the human and mouse genomes ALLPATHS-LG assemblies were compared with previously published assemblies Capillary-based sequencing SOAP (massively sequencing parallel sequencing)

Results – Human & Mouse Assemblies (cont’d)

Results – Human Genome N50 contig length of 24 kb Scaffold length of 11.5 Mb Contiguity is > 4 fold longer than SOAP algorithm Connectivity is > 25 fold longer than SOAP algorithm Assembled sequence contains 91.1% of the reference genome (SOAP: 74.3%) Assembled sequence contains 95.1% of the reference genome (SOAP: 81.2%) Results are similar to capillary based assemblies

Results – Human Genome (cont’d) Local assembly error: 3.5 % Capillary: 4.1 % SOAP: 6.2 % Long range accuracy: 99.1% Capillary: 99.7 % SOAP: 99.5 %

Results – Mouse Genome Results are broadly similar for the mouse genome N50 contig length of 16 kb Scaffold length of 7.2 Mb Connectivity is > 20 fold larger than SOAP algorithm Approach Capillary results (contig: 25 kb, scaffold: 16.9 Mb) Assembled sequence contains 88.7% of the reference genome (Capillary: 94.2%) Assembled sequence contains 96.7% of the reference genome (Capillary: 97.3%) Results are considerably better than SOAP

Results – Mouse Genome (cont’d) Local assembly error: 3.0 % Capillary: 2.7 % SOAP: 14.2 % Long range accuracy: 99.0 % Capillary: 99.1 % SOAP: 98.8 %

Results – Segmental Duplications Segmental duplications shows a challange ALLPATHS-LG assemlies (human and mouse) cover only ~40% segemental duplications Capillary: 60% SOAP: 12% NOTE: Clearly additional work is needed here

Results – Understanding Gaps Rougly three quarters of the gaps captured Remaining gaps are not spanned Majority of the sequence within the gaps consists of repetitive elements, 61.9% of gaps: For mouse: LINE elements are major contributors to GAPS For human: LINE & SINE elements

Discussion High quality vertebrate genomes provided an essential foundation for comperative analysis of human genome Costing tens of millions $ each to generate with capillary based sequencing In this work, ALLPATHS-LG was presented lowering the cost ~1000 fold.

Discussion (cont’d) ALLPATHS-LG Good long range connectity, Good accuracy, Good coverage wrt capillary based sequencing and better than SOAP Quality of the assembliesis considerably better: scaffolds are > 25 times longer

Discussion (cont’d) ALLPATHS-LG is anticipated to yield even better results in the improved version. ALLPATHS-LG introduced a preliminary syntax for expressing alternatives: TTTT{T, TT} Computational hardware requirements: SOAP is faster (takes 3 days) but accuracy is low ALLPATHS-LG is slower but produces high quality assemblies ALLPATHS-LG is anticipated to be speeded up with algorithmic improvements (considering in mind the trade-off between speed and the accuracy)

Thank you. Questions ?