Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University.

Similar presentations


Presentation on theme: "Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University."— Presentation transcript:

1 Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University of California, San Diego 2. Wayne State University, Michigan * Contributed equally to this work

2 ≈ $ billions ≈ several years ≈ hundreds of people ≈ $ thousands ≈ several weeks ≈ two people 2

3 High Throughput Sequencing Assemblies 3

4 4 Sample Preparation Sequencing Assembly Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Draft Genome from HTS

5 5 Sample Preparation Sequencing Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Assembly HTS assemblies (contigs) still contain an abundance of error: 20-30 subst. errors per 100kbp with SOAPdenovo. 5-20 subst. errors per 100kbp with Velvet. Small (<50 bp) INDEL errors. Misassemblies, large INDELs, etc.

6 6 Sample Preparation Sequencing Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Assembly Errors in the assembled contigs will profoundly affect any downstream analysis.

7 7 Sample Preparation Sequencing Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Assembly SEQuel Refined Contigs

8 De Bruijn Graph for Fragment Assembly

9 De Bruijn Graph GCC CCA CAT ATT TTA GCC CCT CTTCTT CTTCTT TTT TTA CCT CTA TAT ATT (Pevzner, Tang, Waterman 2001) 9

10 De Bruijn Graph GCC CCA CAT ATT TTA GCCCCT CTTCTTCTTCTT TTT TTA CCT CTA TAT ATT (Pevzner, Tang, Waterman 2001) 10

11 De Bruijn Graph GCC CAT ATT TTA GCC CCT CTTCTTCTTCTT TTT TTA CCT CTA TAT ATT CCA (Pevzner, Tang, Waterman 2001) 11

12 De Bruijn Graph GCC CAT ATT TTA GCC CTTCTTCTTCTT TTT TTA CTA TAT ATT CCA CCT (Pevzner, Tang, Waterman 2001) 12

13 De Bruijn Graph GCC CAT ATT TTA CTTCTTCTTCTT TTT TTA CTA TAT ATT CCA CCT (Pevzner, Tang, Waterman 2001) 13

14 De Bruijn Graph 14

15 Challenges

16 GCC CCT CTA TAG AGGGGA GAC CAC ACT CTT TTG TGGGGC GCA..............GCCTAGGAC.............CACTTGGCA.............. GCCTAGGAC CACTTGGCA 16

17 17 Sequencing errors cause bulges in the de Bruijn graph GCC CCT CTA TAG AGGGGA GAC CAC ACT CTT TTG TGGGGC GCA..............GCCTAGGAC.............CACTTGGCA.............. GCCTAGGAC GCCTTGGAC CACTTGGCA CCTT TGGA CTTG TTGA

18 18 Sequencing errors cause bulges in the de Bruijn graph GCC CCT CTATAGAGG GGA GAC CAC ACT CTT TTGTTG TGG GGC GCA..............GCCTAGGAC.............CACTTGGCA.............. GCCTAGGAC GCCTTGGAC CACTTGGCA 2 2 2 2 3 1 4 4 1 3 3 3 3 3

19 19 Sequencing errors cause bulges in the de Bruijn graph GCC CCT GGA GAC CAC ACT CTT TTGTTG TGG GGC GCA..............GCCTAGGAC.............CACTTGGCA.............. GCCTAGGAC GCCTTGGAC CACTTGGCA 3 1 4 4 1 3 3 3 3 3......CACTTGGCA............GCCTTGGAC......

20 The SEQuel Algorithm

21 21 Sample Preparation Sequencing Analysis, Analysis Analysis, Analysis Fragments Reads Contigs Assembly SEQuel Refined Contigs

22 Permissively aligned read-pair: a read-pair for which at least one read aligned uniquely. 12 25 19 32 40 34 8 21 29 53 26 2134 39 44 57 68 81 75 89 The SEQuel Algorithm 22

23 Positional De Bruijn Graph 23

24 Positional De Bruijn Graph GCC,111 CCA,112 CAT,113 ATT,114 TTA,115 CCT,112 CTT,113 TTT,114 TTA,115 GCC,975 CCT,976 CTA,977 TAT,978 ATT,979 Positional k-mer: a pair (k-mer, position), e.g. (GCCA, 111). 24

25 Positional De Bruijn Graph GCC,111 CCA,112 CAT,113 ATT,114 TTA,115CCT,112 CTT,113 TTT,114 TTA,115 GCC,975 CCT,976 CTA,977 TAT,978 ATT,979CCA,112ATT,114 CAT,113 ATT,979 25

26 Positional De Bruijn Graph 4 4 44 26

27 partial contig #1: GCCATTA partial contig #2: GCCTATT The SEQuel Algorithm 27 GTATTCCGAGGACCACTGGATTATGA Original contig

28 28 The SEQuel Algorithm GTATTCCGAGGACCACTGGATTATGA

29 29 GTATTCCGAGGACCAC---TGGATTATGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm

30 30 GTATTCCGAGGACCAC---TGGATTATGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm

31 31 GCGGGCCGAGGACCAC---TGGATTATGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm

32 32 GCGGGCCGAGGACCAC---TGGATTATGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm

33 33 GCGGGCCGAGGACCACAAATGGATTACGA CAAATGGATTACGA GCGGGCCGAGGA The SEQuel Algorithm

34 34 GCGGGCCGAGGACCACAAATGGATTACGA The SEQuel Algorithm Repeat for all contigs.

35 35 Results Standard and Single-Cell E. coli. 100 bp paired-end, Illumina (GAII) reads. Mean coverage ≈ 600x. Assemblies compared to reference with & without SEQuel.

36 Standard E. coli 36

37 Standard E. coli 37

38 Single Cell Sequencing Standard Single Cell (Chitsaz et al., 2011) 38

39 Single Cell E. coli 39

40 Single Cell E. coli 40

41 Summary 41 Removed 35% to 96% of small-scale assembly errors. Introduced positional de Bruijn graph for contig refinement. Demonstrated utility in hard (single-cell) assembly. SEQuel can be used in combination with any assembler. Freely available at: http://bix.ucsd.edu/SEQuel

42 3P41RR024851-02S1 Acknowledgments CCF-1115206


Download ppt "Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University."

Similar presentations


Ads by Google