Presentation is loading. Please wait.

Presentation is loading. Please wait.

billion-piece genome puzzle

Similar presentations


Presentation on theme: "billion-piece genome puzzle"— Presentation transcript:

1 billion-piece genome puzzle
a new standard for assembling a billion-piece genome puzzle ALLPATHS-LG

2 Sante Gnerre et al. (20 Authors)
CS 681 High-quality draft assemblies of mammalian genomes from massively parallel sequence data ALLPATHS-LG by Sante Gnerre et al. (20 Authors) Jan 25th, 2011 presented by Ömer Köksal

3 Agenda Introduction Results Discussion Model for Input Data
Sequencing Data ALLPATHS-LG Assembly Method Uncertainty in Assemblies Human and Mouse Assemblies Human Genome Mouse Genome Segmental Duplications Understanding Gaps Discussion

4 Introduction High-quality assembly of a genome sequence is critical
Particularly challenging for large, repeat rich genomes such as those of mammals Using traditional capillary-based sequencing (>700 bases) such assemblies produced for multiple mammals at a cost of tens of millions dollars each. New massively parallel technologies are expected to lower cost dramatically but they could not, because of short sequencing (~100 bases in length) less accuracy difficult to assemble

5 Introduction (cont’d)
ALLPATHS-LG de novo assembly of large (and small) genomes it should be possible to generate high quality draft assemlies of Large Genomes ~1000 fold lower cost than a decade ago Previous versions: ALLPATHS 1.0 (2008) ALLPATHS 2.0 (2009)

6 Results RESULTS Model for Input Data Sequencing Data
ALLPATHS-LG Assembly Method Uncertainty in Assemblies Human and Mouse Assemblies Human Genome Mouse Genome Segmental Duplication Understanding Gaps

7 Results - Model for Input Data
De novo genome assembly depends on computational methods nature and quantity of sequence data used Fairly standard model of Capillary-based sequence was modified Sets a target of 100 fold sequence coverage to compensate shorter reads & nonuniform coverage Despite using higher coverage, proposed model is dramatically cheaper since the per-base cost ~10000 fold lower than capillary sequencing illumina sequencing was used (Table-1)

8 Results - Model for Input Data (cont’d)
Table 1 – Provisional sequencing model for de novo assembly

9 Results – Sequencing Data
Using the model above generated sequences are: Human Genome Mouse Genome Human Genome: GM12878 (Coriell Institute) of 1000 Genomes Pilot Project NCBI Short Read: Human_NA_12878_Genome_on_illumina Mouse Genome: C57BL/6J female DNA NCBI Short Read: Mouse_B6_Genome_on_illumina

10 Results - ALLPATHS-LG Assembly Method
previous versions were improved extensively can assembly small genomes freely available at:

11 Results - ALLPATHS-LG Assembly Method (cont’d)
Some key innovations in ALLPATH-LG Handling repetitive sequences more resilient to repeats Error Correction for every 24-mer the algorithm examines the stack of all reads containing 24-mer incidence of incorrect error correction was reduced Use of jumping data it coult work even with such data: it trim bases beyond junction points and treats each read pair as belonging to one of two possible distributions Efficient memory usage can asseble human genomes on commercial servers (48 processors & 512 GB Ram) in a few week 3 week for mouse & 3.5 weeks for human)

12 Results – Uncertainty in Assemblies
The goal of assembly is to reconstruct the genome as accurately as possible However in some locations the data may be compatible with more than one solutions Rather than making an arbitrary choice (& introducing errors) algortihm retains alternatives ALLPATHS-LG algorithm generates an assembly graph whose edges are sequences and braches represent alternate choices ATC{A,T}GGTTTTTTT{T,TT}ACAC Variant Call Format (.VCF file)

13 Results – Uncertainty in Assemblies (cont’d)
NOTE: Current version of ALLPATHS-LG only captures single-base and simple sequence indel uncertainties Better way to capture alternatives are needed (many of which are still lost in the current version and giving rise to errors) It would be desirable to assign probabilities to each alternative

14 Results – Human & Mouse Assemblies
Resulting genome assemblies provide good coverage of the human and mouse genomes ALLPATHS-LG assemblies were compared with previously published assemblies Capillary-based sequencing SOAP (massively sequencing parallel sequencing)

15 Results – Human & Mouse Assemblies (cont’d)

16 Results – Human Genome N50 contig length of 24 kb
Scaffold length of 11.5 Mb Contiguity is > 4 fold longer than SOAP algorithm Connectivity is > 25 fold longer than SOAP algorithm Assembled sequence contains 91.1% of the reference genome (SOAP: 74.3%) Assembled sequence contains 95.1% of the reference genome (SOAP: 81.2%) Results are similar to capillary based assemblies

17 Results – Human Genome (cont’d)
Local assembly error: 3.5 % Capillary: 4.1 % SOAP: 6.2 % Long range accuracy: 99.1% Capillary: 99.7 % SOAP: %

18 Results – Mouse Genome Results are broadly similar for the mouse genome N50 contig length of 16 kb Scaffold length of 7.2 Mb Connectivity is > 20 fold larger than SOAP algorithm Approach Capillary results (contig: 25 kb, scaffold: 16.9 Mb) Assembled sequence contains 88.7% of the reference genome (Capillary: 94.2%) Assembled sequence contains 96.7% of the reference genome (Capillary: 97.3%) Results are considerably better than SOAP

19 Results – Mouse Genome (cont’d)
Local assembly error: 3.0 % Capillary: 2.7 % SOAP: 14.2 % Long range accuracy: 99.0 % Capillary: 99.1 % SOAP: %

20 Results – Segmental Duplications
Segmental duplications shows a challange ALLPATHS-LG assemlies (human and mouse) cover only ~40% segemental duplications Capillary: 60% SOAP: 12% NOTE: Clearly additional work is needed here

21 Results – Understanding Gaps
Rougly three quarters of the gaps captured Remaining gaps are not spanned Majority of the sequence within the gaps consists of repetitive elements, 61.9% of gaps: For mouse: LINE elements are major contributors to GAPS For human: LINE & SINE elements

22 Discussion High quality vertebrate genomes provided an essential foundation for comperative analysis of human genome Costing tens of millions $ each to generate with capillary based sequencing In this work, ALLPATHS-LG was presented lowering the cost ~1000 fold.

23 Discussion (cont’d) ALLPATHS-LG Good long range connectity,
Good accuracy, Good coverage wrt capillary based sequencing and better than SOAP Quality of the assembliesis considerably better: scaffolds are > 25 times longer

24 Discussion (cont’d) ALLPATHS-LG is anticipated to yield even better results in the improved version. ALLPATHS-LG introduced a preliminary syntax for expressing alternatives: TTTT{T, TT} Computational hardware requirements: SOAP is faster (takes 3 days) but accuracy is low ALLPATHS-LG is slower but produces high quality assemblies ALLPATHS-LG is anticipated to be speeded up with algorithmic improvements (considering in mind the trade-off between speed and the accuracy)

25 Thank you. Questions ?


Download ppt "billion-piece genome puzzle"

Similar presentations


Ads by Google