Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advisors: Dr. Qunfeng Dong(CGB) Prof. Haixu Tang(SOI) INDIANA UNIVERSITY, BLOOMINGTON DEVELOPING A COMPUTATIONAL METHOD TO HELP SOLVE THE SEQUENCING GAP.

Similar presentations


Presentation on theme: "Advisors: Dr. Qunfeng Dong(CGB) Prof. Haixu Tang(SOI) INDIANA UNIVERSITY, BLOOMINGTON DEVELOPING A COMPUTATIONAL METHOD TO HELP SOLVE THE SEQUENCING GAP."— Presentation transcript:

1 Advisors: Dr. Qunfeng Dong(CGB) Prof. Haixu Tang(SOI) INDIANA UNIVERSITY, BLOOMINGTON DEVELOPING A COMPUTATIONAL METHOD TO HELP SOLVE THE SEQUENCING GAP CLOSURE PROBLEM Vivek Krishnakumar M.S. Bioinformatics CAPSTONE

2 A Typical Sequencing Project Sequencing Assembly Finishing Annotation Steps in a whole genome shotgun sequencing project Claire M. Fraser, Jonathan A. Eisen and Steven L. Salzberg. Microbial genome sequencing. Nature 406, 799-803(2000)

3 Genome Assembly  Putting together a large number of short DNA sequences called reads to create a representation of the original genome.  Relies on the assumption that reads sharing the same string of letters, originated from the same place on the genome  Ideally, at an 8-10 fold coverage, most of the genome will be assembled into a small number of contigs (around 5 for a 1Mbp genome) 1 1. Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239 (1988)

4 Assembly Algorithms  Assembly is a complex computational problem (NP-hard)  Several approximation algorithms to solve this problem.  Types  Simple greedy algorithms  Graph based algorithms Overlap-layout-consensus (Phrap, TIGR, VCAKE, etc.) Eulerian Path (Newbler, Euler, ALLPATHS, etc.)

5 Assembly Algorithms – Graph based  k-mers are computed  k-mers are represented by edges of a deBruijn graph  reads are represented as paths through k-mers  reconstructing the genome involves finding the path that makes use of all the edges Overlap-layout- consensus 1 Eulerian Path 2  sequences represented as nodes  edges correspond to read overlap  use the overlap between sequence reads to create a link between them  contigs are eventually formed by reading along the links as far as possible Daniel MacLean, Jonathan D. G. Jones & David J. Studholme. Application of 'next-generation' sequencing technologies to microbial genetics. Nature Reviews Microbiology 7, 287-296 (2009) 1. J.D. Kececioglu and E.W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica,vol. 13, 1995, pp. 7-51. 2. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. PNAS August 14, 2001 vol. 98 no. 17 9748-9753

6 Finishing (Post-assembly)  No matter the type of assembly algorithm, in practice, the assemblers produce not one, but hundreds or thousands of unordered contiguous sequences. Reasons:  no or low coverage regions  repeats  sequencing errors  The task of gap closure consists of several steps:  Orienting the contigs  Ordering the contigs  Gap closure

7 Gap-closure problem C F A E B DG Un-ordered & un-oriented contigs ACBD F G E AC Ordered & oriented Gap-closing (by PCR or walking)

8 Existing methods for contig ordering (Reference genome based) For biologists, it is imperative to move beyond the assembly step and elucidate the order of the contigs for effective gap-closure. Existing computational methods provide the biologists with tools to identify the order of the contigs, which enables them to move that much closer to reconstructing the genome Projector 1 Projector 2 2 CAAT-Box or Contigs-Assembly and Annotation Tool-Box 3 Pheromone trail-based genetic algorithm 4 PGAAS or Prokaryotic Genome Assembly Assistant System 5 As observed, the existing methods are based on similar working principles, i.e. they assume that there are reference genomes which are closely related to the organism in consideration 1. van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector: automatic contig mapping for gap closure purposes. Nucleic Acids Res. 2003 Nov 15;31(22):e144. 2. van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W560-6 3. Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F. CAAT-Box, Contigs-Assembly and Annotation Tool-Box for genome sequencing projects. Bioinformatics. 2004 Mar 22;20(5):790-7. Epub 2004 Jan 29. 4. Fangqing Zhao, Fanggeng Zhao, Tao Li and Donald A. Bryant. A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Research, 2008, Vol. 36, No. 10 3455-3462 5. Yu Z, Li T, Zhao J, Luo J. PGAAS: a prokaryotic genome assembly assistant system. Bioinformatics. 2002 May;18(5):661-5.

9 Rationale Post-sequencing and assembly of a novel genome (one with no known nearest phylogenetic neighbor), the biologists are faced with a set of unordered contigs It is not feasible to run gap closing Polymerase Chain Reactions (PCR) for each possible pair of contigs Thus, we propose a computational method, employing graph theoretical concepts, to identify non-repeat contig neighborhoods (determining possible pairs of contigs and the corresponding gap sizes)

10 Proposed method C F A E B DG CA AC FA AF GD DG Exhausting all possible pairs of contigs for gap- closing PCR C REPEAT E G BF D Using the available data, employing a graph theoretical approach, we generate probable non- repeat contig neighborhoods contigs connected directly with each other connected through repeat contigs Set of unordered, un-oriented contigs generated after assembly

11 Graph Theory Graph theory is a mathematical/computational concept of studying graphs. It is used to model pair-wise relationships between objects from a certain collection of such objects. Technically, a graph consists of:  nodes/vertices (representing the object) and  edges (representing the connection between two objects) – they are either directed or undirected. For our study, we explored the use of the shortest path algorithm (Dijkstra’s Algorithm) node undirected edge directed edge

12 Simulation studies Before analyzing the real world data, we conducted a simulation study to explore a method for identifying possible pairs of contigs Escheirichia coli 536 genome used for simulation GENOME CONTIG 1CONTIG N+1 GAP 1GAP N Generate reads at different coverages (10x, 20x, 30x). Introduce N random length gaps into the genome, producing N+1 contigs CONTIG 2 GAP 2 CONTIG N METHOD 1 CONTIG 1CONTIG N+1 GAP 1GAP N Splice the genome with N copies of the same DNA sequence generate reads at different coverages CONTIG 2 GAP 2 CONTIG N METHOD 2 Repeat 1Repeat 2Repeat N

13 Classifying the reads From this population of reads we classify and extract two types of reads: Reads that belong to the gap regions Reads that partially overlap with the edges of the contigs CONTIG 1CONTIG N+1 GAP 1GAP N CONTIG 2 GAP 2 CONTIG N CONTIG 1 GAP 1 CONTIG 2 Map the shortest path from a suffix to a prefix through a set of junction reads and unassembled reads suffixprefix

14 Running the simulation Junction Reads Reads from the gap region Contig edges (suffix|prefix) Large-scale sequence alignment using vmatch Traverse the graph and find the shortest path between a suffix and prefix Compute all possible overlaps. Generate an overlap graph where the reads form the nodes and the overlaps form the edges. suffix prefix suffix prefix

15 Results – METHOD 1 Number of gaps introduced: 10 Number of contiguous sequences: 11 (C1 through C11) Simulated read lengths: 400 bpCoverage: 10x, 20x and 30x All computed paths (at 20x coverage):

16 Results – METHOD 2 (Repeat-induced gaps) Start NodeEnd Node suffix_C2prefix_C2 suffix_C2suffix_C1 suffix_C2prefix_C3 suffix_C1prefix_C2 suffix_C1prefix_C3 prefix_C2prefix_C3 Number of repeat-induced gaps introduced: 2 Number of contiguous sequences: 3 (C1, C2 and C3) Simulated read lengths: 400 bpCoverage: 10x, 20x and 30x All computed paths (at 30x coverage): Repeat 1Repeat 2C1C2C3

17 Exploring the real world data Eventually we want to help the biologists develop PCR primers to close the gaps between the contigs in the genome. For this study, we obtained assembly data from Prof. Yves Bruns’ Lab in the Biology department for the organism Brevundimonas diminuta. Genome sequencing method: 454 (without paired ends) Sequence Assembly program:Newbler We intend to derive information from the.ace file produced by the Newbler assembler

18 Assembly data statistics Brevundimonas diminuta STATISTICS Total Number Of Reads613425 Number of contigs476 Length of contigs3325300 Avg. Length of contigs6985 ± 10125 Largest contig77275 bp Smallest contig102 bp Number of N50 contigs61 Length of N50 contigs14580

19 .ace file format .ace file created by the Newbler assembler contains the following assembly information: 1. the contig sequences 2. quality values 3. read alignment information. 4. contig linking information AS 476 537972 CO contig00001 755 63 28 U TTAC**AGCT*CCC*GCC*A**GGTC*TT*G*CCGG*TCATGCCTTCATA GTTGGTCGCGTGATAGACGCCGCGCGCCACGGCGCGGGCCAGGACGTCGG BQ 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 AF EYKMWLY02IBXNT.12-1.fm55 C -58 AF EYKMWLY01AHEVV.94-111.fm55 U -92 AF EYKMWLY01E0SVI.79-107.fm55 U -77 AF EYKMWLY01BQQ0K.153-181.fm55 U -151 AF EYKMWLY02GTKZH U 377 AF EYKMWLY01AUK2G U 466 AF EYKMWLY02IS15X C 466 AF EYKMWLY02IYK7T.265-2 C 485 AF EYKMWLY02IT5PF C 511 AF EYKMWLY01B4PII.51-1 C 500 AF EYKMWLY01D8FPB C 520 AF EYKMWLY02HKKY C 600 RD EYKMWLY01AHEVV.94-111.fm55 117 0 0 GGGGCTGAAGGCGCTGAGCGAGACGCTGCCGGTCGCGCAGGCGGTGCCGG CCGGGCGCGCCGATTTAATCAACCCTCTCCCGGCGGGAGAGGGTTAC**A GCTACCC*GCC*A**GG

20 Inferring information from the.ace file  From the.ace file, we extract information that suggests which reads a particular contig shares with which other contig(s) in the assembly CO contig00016 10537 1778 1241 U AF EYKMWLY02JASI2.1-151.to428 U 10379 AF EYKMWLY01CDRT0.1-130 U 10400 AF EYKMWLY02IF22Q.1-105 U 10426 AF EYKMWLY02HA9IP.1-105 U 10426 AF EYKMWLY01CV3A4.1-107 U 10426 AF EYKMWLY01BNJV1.1-91.to428 U 10440 AF EYKMWLY01D4P1Q C 10445 AF EYKMWLY02FFW9C.1-81 U 10450 AF EYKMWLY01C1SKF.1-75 U 10457 AF EYKMWLY01A3C1V.1-68 U 10465 AF EYKMWLY01CCJHV.1-66.to428 U 10465 AF EYKMWLY01D7WLL.1-52.to428 U 10480 AF EYKMWLY01CPNZS.1-49.to428 U 10484 AF EYKMWLY01CBBAK.216-196.to428 C 10508 AF EYKMWLY01DWD7X.196-176.to428 C 10506 AF EYKMWLY01A27AI.239-214.to428 C 10510 AF EYKMWLY01A14U9.208-187.to428 C 10513 AF EYKMWLY01A4PVY.242-220.to428 C 10513 CO contig00428 21606 2746 2071 U AF EYKMWLY01CPNZS.73-75.fm16 U -71 AF EYKMWLY01D7WLL.76-83.fm16 U -74 AF EYKMWLY02JASI2.175-195.fm16 U -173 AF EYKMWLY01CCJHV.90-117.fm16 U -88 AF EYKMWLY01BNJV1.115-149.fm16 U -113 AF EYKMWLY02JIVGZ.41-1.fm16 C -23 AF EYKMWLY02HY7R3.48-1.fm16 C -25 AF EYKMWLY02FW7CZ.58-1.fm16 C -27 AF EYKMWLY01B9IJ6 U 1 AF EYKMWLY01D4UXN.80-1.fm16 C -34 AF EYKMWLY01DBA88.146-1 C -93 AF EYKMWLY02HKXAA.152-1 C -96 AF EYKMWLY01DWD7X.152-1.fm16 C -47 AF EYKMWLY01B9SVF.62-221 U -60 AF EYKMWLY01A14U9.163-1.fm16 C -44 AF EYKMWLY01CBBAK.172-1.fm16 C -45 AF EYKMWLY01A27AI.190-1.fm16 C -48 AF EYKMWLY01A4PVY.196-1.fm16 C -45

21 Classification of contigs  Based on the number of recorded connections, we try to classify the contigs into so-called:  Repeat contigs; and  Non-repeat contigs  After classification, we go on to build non-repeat contig neighborhoods (pairs of non-repeat contigs connected through one or more repeat contigs) CONTIG XCONTIG YREPEATCONTIG XCONTIG YREPEAT AREPEAT B

22 Results Total number of contigs In the dataset 476 After classification: Repeat contigs 31 Non-repeat contigs 445 Number of non-repeat contig pairs (through one or more repeat contigs) 184

23 Generated networks

24 Output from this method  On analysis of the assembly data, biologists are provided with the output in the following format suffix_C466 prefix_C4111269 suffix_C20prefix_C439 833 suffix_C20 prefix_C18 833 suffix_C34 prefix_C81442 suffix_C466 prefix_C514095 suffix_C39prefix_C425 1532 suffix_C42prefix_C17 1532 suffix_C457 prefix_C18 2087  This output can be easily plugged into a primer design program such as Primer3 for developing unique PCR primers for each possible pair

25 Summary  Using this method, we were able to generate probable non- repeat contig neighborhoods that can be effectively utilized by biologists for gap-closing by PCR  Limitations  The simulation studies conducted represent simple cases of the gap closure problem. In reality, we may encounter more complicated repeat structures that cannot be fully resolved by the simple method proposed here. We will conduct more simulation experiments that resemble the real-world sequencing data.  For the contigs that do not possess any connectivity information, further directed experiments can be conducted to close the physical gaps  Future Work  The generated contig-neighborhoods can be aligned to the available optical mapping data to improve the assembly

26 References  Claire M. Fraser, Jonathan A. Eisen and Steven L. Salzberg. Microbial genome sequencing. Nature 406, 799-803(2000)  Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis, Genomics 2(3): 231-239 (1988)  Daniel MacLean, Jonathan D. G. Jones & David J. Studholme. Application of 'next-generation' sequencing technologies to microbial genetics. Nature Reviews Microbiology 7, 287-296 (2009)  J.D. Kececioglu and E.W. Myers, Combinatorial Algorithms for DNA Sequence Assembly, Algorithmica,vol. 13, 1995, pp. 7-51.  Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. PNAS August 14, 2001 vol. 98 no. 17 9748-9753  van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector: automatic contig mapping for gap closure purposes. Nucleic Acids Res. 2003 Nov 15;31(22):e144.  van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W560-6  Frangeul L, Glaser P, Rusniok C, Buchrieser C, Duchaud E, Dehoux P, Kunst F. CAAT-Box, Contigs-Assembly and Annotation Tool-Box for genome sequencing projects. Bioinformatics. 2004 Mar 22;20(5):790-7. Epub 2004 Jan 29.  Fangqing Zhao, Fanggeng Zhao, Tao Li and Donald A. Bryant. A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Research, 2008, Vol. 36, No. 10 3455-3462  Yu Z, Li T, Zhao J, Luo J. PGAAS: a prokaryotic genome assembly assistant system. Bioinformatics. 2002 May;18(5):661-5.  Mark J. Chaisson1 and Pavel A. Pevzner. Short read fragment assembly of bacterial genomes. Genome Res. 2008. 18: 324-330

27 Acknowledgments  Dr. Qunfeng Dong  Prof. Haixu Tang  Prof. Rajarshi Guha  Jeong-Hyeon (Justin) Choi – Genomics  Prof. Yves Bruns’ Lab  Colleagues at CGB  Faculty of the School of Informatics  CGB Computing Team – For the technical support and resources  Linda Hostetter & Rachel Lawmaster

28 THANK YOU!


Download ppt "Advisors: Dr. Qunfeng Dong(CGB) Prof. Haixu Tang(SOI) INDIANA UNIVERSITY, BLOOMINGTON DEVELOPING A COMPUTATIONAL METHOD TO HELP SOLVE THE SEQUENCING GAP."

Similar presentations


Ads by Google