Presentation is loading. Please wait.

Presentation is loading. Please wait.

Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.

Similar presentations


Presentation on theme: "Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University."— Presentation transcript:

1 Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University of Singapore Genome Institute of Singapore

2 Outline  Overview Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation Results Ongoing Work 2

3 Transcripts Microbial Community Biological Entity Data Entity Genome Genomic Sequence Transcript Assembly Metagenome ReadsAnalysis ACGTTTAACAGG… TTACGATTCGATGA… GCCATAATGCAAG… CTTAGAATCGGATAGAC… AGGCATAGACTAGAG… Sequencing Machine 3

4 Sequence Assembly ReadsContigsScaffolds Paired-end Reads Related Research Works Contig Level OLC Framework: De Bruijn Graph: Scaffold Level Comparative Assembly: Embedded Module: Standalone Module: (I)(II) Celera Assembler [Myers et al,2000], Edena [Hernandez et al,2008], Arachne [Batzoglou et al,2002], PE Assembler [Ariyaratne et al,2011] EULER [Pevzner et al, 2001], Velvet [Zerbino et al,2008], ALLPATHS [Butler et al,2008], SOAPdenovo [Li et al,2010] AMOScmp [Pop,2004], ABBA [Salzberg,2008] EULER [Pevnezer et al, 2001], Arachne [Batzoglou et al,2002], Celera Assembler [Myers et al,2000], Velvet [Zerbino, 2008] Bambus [Pop, et al, 2004], SOPRA [Dayarian et al, 2010] 4

5 Scaffolding Problem [Huson et al, 2002] Value Addition Gap Filling: GapCloser Module of SOAPdenovo Repeat Resolution Long-Range Genomic Structure 1k3k2.5k Discordant Read Paired-end Read Scaffold Contig * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 5

6 Data  Sequencing Errors  Read Length  Coverage Analysis Long Insert vs. Long Read [Chaisson, 2009; Zerbino, 2009] Statistics of Assembled Genomes [Schatz et al, 2010] OrganismGenome Size Grapevine500Mb Panda2.4Gb Strawberry220Mb Turkey1.1Gb * Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009) * Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19, 336-346 (2009) # of ContigsN50 58,61118.2kb 200,60436.7kb 16,48728.1kb 128,27112.6kb # of ScaffoldN50 2,0931.33Mb 81,4691.22Mb 3,2631.44Mb 26,9171.5Mb * Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, 1165-1173 (2010) * N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L >= N. 6

7 NP-Complete [Huson et al, 2002] * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) 7

8 Heuristic Methods - Celera Assembler [ Myers et al,2000 ] - Euler [ Pevzner et al, 2001 ] - Jazz [ Chapman et al, 2002] - Arachne [ Batzoglou et al,2002 ] - Velvet [ Zerbino et al,2008 ] - Bambus [ Pop, et al, 2004 ] “True Complexity” Phase transition based on parameters [Hayes, 1996] Parametric Complexity [Rodney et al, 1999] Vertex Cover Problem Fixed-parameter tractabillity * Hayes, B. Can't get no satisfaction. American. Scientist. 85, 108-112 (1996). 3-SAT Problem * Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol 49. 1999 8

9 Outline Overview  Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation Results Ongoing Work 9

10 1. Pre-Processing Paired-end Reads -> Clusters [Huson et al, 2002] Chimeric Noise Filtered by simulation * Upper Bound of Paired-end Reads 3 * Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002) Chimera 10

11 No discordant clusters in final scaffold Naïve Solution +A +A+B +A-B +A+C +A-C +A+B+C +A+B-C Exponential Time +A-C+B +A-C-B … … … ABCD 2. A Special Case 11

12 Dynamic Programming Scaffold Tail is Sufficient Analogous to Bandwidth Problem [Saxe, 1980] Orientation of Nodes Direction of Edges Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363- 369 (1980) width(w) Upper Bound 12

13 Equivalence class of scaffolds S 1 and S 2 have the same tail -> They are in the same class Feature of equivalence class: - Use of the same set of contigs; - All or none of them can be extended to a solution Tail +A-B+C +D+E -A+C +D+E+F …

14 Equivalence Class Number of Discordant Edges (p) Chimeric Reads ACCAAAATTT ACCAAGAATTT Sequencing Errors CTAGAA CAAGAA ? Mapping Errors 3. Full Algorithm Consider discordant clusters 14

15 4. Graph Contraction 20k

16 4. Graph Contraction

17

18 Utility Genome finishing(Genome Size Estimation) Scaffold Correctness Calculate Gap Sizes Maximum Likelihood Quadratic Function Solved through quadratic programming [Goldfarb, et al, 1983] Polynomial Time g1g1 g2g2 g3g3 μ,σμ,σ 5. Gap Estimation * Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983) 18

19 Outline Overview Methods - 1. Pre-Processing - 2. A Special Case - 3. Full Algorithm - 4. Graph Contraction - 5. Gap Estimation  Results Ongoing Work 19

20 Runtime Comparison ◆ E. coli ★ B. pseudomallei ◆ S. cerevisiae ◆ D. melanogaster Bambus50s16m2m3m SOPRA49m-2h5h Opera4s7m11s30s Coverage of 300bp insert library: >20X Coverage of 10kbp insert library: 2X Contigs assembled using Velvet 20 ◆ Simulated data set using MetaSim ★ In house data

21 Scaffold Contiguity 21

22 Scaffold Correctness 22

23 Scaffold Correctness E.coliS. cerevisiaeD. melanogaster Opera134 Bambus1955423 23

24 Ongoing Work Genome SizeN50 Opera~2Gbp765.5Kbp SSpace281.7Kbp A Rodent Genome A Tree Genome Genome SizeN50Max Length Opera~300Mbp209.9Kbp921.8Kbp 24

25 Ongoing Work Repeats Lower bounds and better scaffold Multiple Libraries Other applications Metagenomics Cancer Genomics Link: https://sourceforge.net/projects/operasf/https://sourceforge.net/projects/operasf/ 25

26 Acknowledgement Questions? Wing-Kin Sung Niranjan Nagarajan Pramila N. Ariyaratne Fundings: A*STAR of Singapore Ministry of Education, Singapore NUS Graduate School for Integrative Sciences and Engineering (NGS) 26


Download ppt "Opera: Reconstructing optimal genomic scaffolds with high- throughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University."

Similar presentations


Ads by Google