Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Eulerian path approach to DNA fragment assembly

Similar presentations


Presentation on theme: "An Eulerian path approach to DNA fragment assembly"— Presentation transcript:

1 An Eulerian path approach to DNA fragment assembly
Presented by: Zohar Barak – Dor Alon – Pavel A Pevzner, Haixu Tang, and Michael S Waterman. An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748{9753, 2001

2 What are we going to see Background and motivation De Bruijn Graph
Eulerian Path Approach Error and data correction Eulerian Superpath Problem Using DB data Results and conclusions

3 Some Terminology L-tuples = k-mers = strings of constant size (l for l-tuple, k for k-mer) EP or EPP – Eulerian Path or Eulerian Path Problem ES or ESP – Eulerian Superpath or Eulerian Superpath Problem NM – Neisseria meningitidis (used it’s sequence data for examples throughout the article) Was considered to be one of the most “difficult-to-assemble” and “repeat-rich” bacterial genomes completed EULER – The assembler that uses the methods discussed in this article

4 Background and motivation – overlap-layout-consensus (OLC)
In the period of 20 years before the article was out (2001), fragment assembly in DNA sequencing followed the “overlap–layout–consensus” paradigm that was used in all available assembly tools. The OLC algorithm has 3 steps Overlap - build overlap graph Layout - Bundle stretches of the overlap graph into contigs Consensus - Pick most likely nucleotide sequence for each contig We find overlaps between reads of some minimum size, we build the graph.

5 Problems with OLC Huge graph – Memory and build time
Long time to assemble * HamPath problem Problem with repeats בעיות רבות – * גרף ענקי, צורך הרבה זיכרון ולוקח הרבה זמן לבנות אותו, * לוקח הרבה זמן לעשות assemble – כלומר, גם בהינתן גרף בנוי היטב נאלץ לשלם הרבה זמן על פתרון בעיה NP שלמה – בעיית HamPath של מציאת מסלול המליטון. * בעיות שנוצרות מחזרות

6

7 Many problems encountered in OLC
Suggesting a new approach: An Eulerian path approach. כמו שראינו, יש בעיות רבות בOLC. השלב הבא הוא לנסות להציע גישה חדשה לבעיה – במאמר קראו לגישה הזו גישת מסלול אוילריאני, והיא נעשית על ידי בניית גרף דה ברוי ומציאת מסלול אוילר

8 Reminder of de Bruijn Graph
Vertex – 𝑘−𝑚𝑒𝑟 ( 𝑥 1 ,…, 𝑥 𝑘 ) Edges (directed) – 𝑥 1 ,…, 𝑥 𝑘 , 𝑥 2 ,…, 𝑥 𝑘 , 𝑥 𝑘+1 ∀𝑣 𝑑 + 𝑣 = 𝑑 − 𝑣 =|Σ| We use the name “de Bruijn graph” also for subgraphs induced by a subset of the vertices De Bruijn sequence – Euler cycle on G - includes each 4-mer once

9 An easy example

10 Building the de Bruijn Graph
Given a set of reads S = {s1, …, sn} define 𝑆 𝑙 - the set of all l-tuples from S. Given 𝑆 𝑙 we can construct the de-Bruijn graph G(Sl) with vertex set Sl−1 (the set of all (l − 1)- tuples from S) as follows: An (l − 1)-tuple v ∈ Sl−1 is joined by a directed edge with an (l − 1)-tuple w ∈ Sl−1, if Sl contains an l-tuple for which the first l − 1 nucleotides coincide with v and the last l − 1 nucleotides coincide with w.  Each l-tuple from Sl corresponds to an edge in G We ‘glue’ usages of the same edge (same l-tuple) together מעבר על הנקודות

11 Comparison אפשר לראות שקודם כל קיבלנו משהו יותר קומפקטי,
ובעיה 'קלה' יותר. כל קשת היא מייצגת l-tuples מread כלשהו. אם l-tuple מופיע k פעמים בread אז נפצל לk קשתות מקבילות.

12 Eulerian Path Finding We’ll solve the assembly problem by finding an Eulerian path to the graph EP problem can be solved in linear time

13 Error Correction Unlike OLC assemblers, EULER makes the consensus step – error correction, the first step. If genome G is known, then error correction in a read s can be done by aligning the read s against G. Catch 22! To assemble a genome, we’d like to correct errors in reads first, but to correct errors in reads, one has to assemble the genome first.

14 Error Correction – Bypassing Catch-22
𝑮 𝒍 - The set of all l-tuples in genome G Solid l-tuple – an l-tuple that belongs to more than M reads, for some threshold M. If a tuple is not solid we’ll call it weak. We use an approximation of 𝐺 𝑙 rather than G to correct sequencing errors. A natural approximation of 𝐺 𝑙 is the set of all solid l-tuples

15 Spectral Alignment Problem
Let T be a collection of l-tuples called a spectrum. A string s is called a T-string if all its l-tuples belong to T Spectral Alignment Problem - Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string. Aligning a read with the set solid tuples changes the sets of weak and solid and l-tuples. Iteratively aligning the set of all reads with the set of solid l-tuples gradually reduces the number of weak l-tuples and increase the number of solid l-tuples. Leads to elimination of many read errors: Usually, reads with poor alignments represents different errors. It is common to ignore these reads. But we can do better! בהקשר של תיקון שגיאות, הגיוני להשתמש בשיטה הזאת רק אם מספר המוטציות קטן יחסית. אם כן, אז אפשר להשתמש בשיטה זו אפילו עם lים גדולים בד"כ reads שמקבלים ציון נמוך מייצגים שגיאות בקריאות או בתהליך המיפוי ונהוג להתעלם מהן

16 Error Correction Problem
Given a collection of reads 𝑆= 𝑠 1 ,…, 𝑠 𝑛 , and an integer 𝑙 – the spectrum of S is a set 𝑆 𝑙 of all l-tuples from the reads 𝑠 1 ,…, 𝑠 𝑛 and 𝑠 1 ,…, 𝑠 𝑛 . Δ- Upper bound on the number of errors in each DNA read. Error Correction Problem – given S,𝑙, Δ introduce up to Δ corrections in each read in S so 𝑆 𝑙 is minimized. An error in a single read affects at most l l-tuples in s and l l-tuples in 𝑠 and usually creates 2l erroneous l-tuples that point to the same error (2𝑑 for positions within 𝑑<𝑙 from the endpoints)

17 Error Correction Problem
EULER uses a greedy approach and looks to correct errors in the reads that reduces 𝑆 𝑙 by 2𝑙 We do this by correcting mutations with multiplicity to match mutations with higher multiplicity This eliminates 86.5% of the errors in the reads Then another more involved procedure is conducted. This eliminates 97.7% of the errors. It transformed the original test data with 4.8 errors per read on average into almost error-free data with 0.11 errors per read on average. TODO: Maybe some up about orphan elimination graph. In this way, we eliminate false edges in our graph and deal with this problem later: the correct nucleotides are easily reconstructed either by a majority rule or by a variation of the Churchill–Waterman algorithm

18 Data correction or Data corruption?
However, this procedure isn’t perfect while deciding which nucleotide (for example A or T) is correct in a given l-tuple within a read. If the correct nucleotide is A, but T is also present in some reads from the same region, we might assign T instead of A to all reads- introducing an error rather than to correct it. These errors are easy to fix later, and it is much more important for the reads of the same region to be consistent thus reducing the complexity of the De-Bruijn graph. For the NM sequencing project, EULER eliminated 234,410 errors and introduced 1452 errors. TODO: Maybe some up about orphan elimination graph. In this way, we eliminate false edges in our graph and deal with this problem later: the correct nucleotides are easily reconstructed either by a majority rule or by a variation of the Churchill–Waterman algorithm

19 Back to graphs

20 Repeats A path v1 … vn in the de Bruijn graph is called a repeat if indegree(v1) > 1, outdegree(vn) > 1, and indegree (vn) = outdegree(vi) = 1 for 1 ≤ i ≤ n − 1  Edges entering the vertex v1 are called entrances into a repeat, whereas edges leaving the vertex vn are called exits from a repeat

21 A repeat v1 … vn and a system of paths overlapping with this repeat.
Branching vertex Branching vertex A repeat v1 … vn and a system of paths overlapping with this repeat. The uppermost path contains the repeat and defines the correct pairing between the corresponding entrance and exit. If this path were not present, the repeatv1 … vn would become a tangle. Entrances Exists

22 Tangles  A repeat is called a tangle if there is no read-path containing this repeat. Tangles create problems in fragment assembly, because pairings of entrances and exits in a tangle cannot be resolved via the analysis of read-paths. To address this issue, we formulate the following generalization of the Eulerian Path Problem: Eulerian Superpath Problem

23 Eulerian Superpath Problem (ESP)
ESP - Given an Eulerian graph G and a collection of paths 𝓟 in this graph, find an Eulerian path in this graph that contains all these paths as subpaths. To solve the Eulerian Superpath Problem, we transform both the graph G and the system of paths 𝒫 in this graph into a new graph G1 with a new system of paths 𝒫1. Such  transformation is called equivalent if there exists a one-to-one correspondence between Eulerian superpaths in (G, 𝒫) and (G1,  𝒫 1 ).

24 Eulerian Superpath Problem (ESP)
Our goal is to make a series of equivalent transformations 𝐺,𝒫 → 𝐺 1 , 𝒫 1 →…→( 𝐺 𝑘 , 𝒫 𝑘 ) that lead to a system of paths 𝒫 𝑘 , with every path being a single edge. Because all transformations on the way from (G,  𝒫) to (Gk, 𝒫 𝑘 ) are equivalent, every solution of the Eulerian Path Problem in (Gk,  𝒫 𝑘 ) provides a solution of the Eulerian Superpath Problem in (G,  𝒫). We’ll look into 2 cases: Graph has multiple edges Graph doesn’t have multiple edges

25 Case: no multiple edges
 Let x = (vin, vmid) and y = (vmid, vout) be two consecutive edges in graph G, and let 𝒫x,y be a collection of all paths from 𝒫 that include both these edges as a subpath. Define 𝒫→x as a collection of paths from 𝒫 that end with x and 𝒫y→ as a collection of paths from 𝒫 that start with y.

26 Case: no multiple edges
The x, y-detachment is a transformation that adds a new edge z = (vin, vout) and deletes the edges x and y from G This detachment alters the system of paths 𝒫 as follows:  substitute z for x, y in all paths from 𝒫x,y substitute z for x in all paths from 𝒫→x substitute z for y in all paths from 𝒫y→ Because every detachment reduces the number of edges in G, the detachments will eventually shorten all paths from 𝒫 to single edges and will reduce the Eulerian Superpath Problem to the Eulerian Path Problem.

27 Case: got multiple edges
in the case of graphs with multiple edges, the detachment procedure may lead to errors, because “directing” all paths from the set 𝒫→x through a new edge z may not be an equivalent transformation. In this case, the edge x may be visited many times in the Eulerian path, and it may or may not be followed by the edge y on some of these visits.

28 For illustration purposes, let us consider a simple case when the vertex vmid has the only incoming edge x = (vin, vmid) with multiplicity 2 and two outgoing edges y1 = (vmid, vout1) and y2 = (vmid, vout2), each with multiplicity 1. In this case, the Eulerian path visits the edge x twice; in one case, it is followed by y1, and in another case, it is followed by y2. Consider an x,y1-detachment that adds a new edge z = (vin, vout1) after deleting the edge y1 and one of two copies of the edge x. This detachment: shortens all paths in 𝒫x,y1 by substitution of x, y1 by a single edge z substitutes z for y1 in every path from 𝒫y1→.

29 If 𝒫→x is not empty, it is not clear whether the last edge of a path P ∈ 𝒫→x should be assigned to the edge z or to the (remaining copy of) edge x. To resolve this dilemma, one has to analyze every path P ∈ 𝒫→x and decide whether it “relates” to 𝒫x,y1 (in this case, it should be directed through z) or to 𝒫x,y2 (in this case, it should be directed through x). By “relates” to 𝒫x,y1 (𝒫x,y2), we mean that every Eulerian superpath visits y1 (y2) immediately after visiting P.

30 More definitions… Two paths are called consistent if their union is a path again.  A path P is consistent with a set of paths 𝒫 if it is consistent with all paths in 𝒫 and inconsistent otherwise (i.e., if it is inconsistent with at least one path in 𝒫). 𝑝 1 = 𝑎,𝑏 , 𝑏,𝑐 , 𝑝 2 = 𝑏,𝑐 ,(𝑐,𝑑) are consistent because 𝑝 1 ∪ 𝑝 2 = 𝑎,𝑏 , 𝑏,𝑐 ,(𝑐,𝑑) is a path. 𝑝 1 = 𝑎,𝑏 , 𝑏,𝑐 , 𝑝 2 = 𝑎,𝑏 ,(𝑏,𝑑) are not consistent because 𝑝 1 ∪ 𝑝 2 = 𝑎,𝑏 , 𝑏,𝑐 , 𝑏,𝑑 is not a path

31 There are three possibilities:
(a) P is consistent with exactly one of the sets 𝒫x,y1 and 𝒫x,y2 (b) P is inconsistent with both 𝒫x,y1and 𝒫x,y2 (c) P is consistent with both 𝒫x,y1 and 𝒫x,y2.

32 Case (a): Consistent with one
In the first case, the path P is called resolvable, because it can be unambiguously related to either 𝒫x,y1 or 𝒫x,y2. An edge x is called resolvable if all paths in 𝒫→x are resolvable. If the edge x is resolvable, then the described x, y-detachment is an equivalent transformation after the correct assignments of last edges in every path from 𝒫→x. In the analysis of the NM project, the researchers found that 18,026 among 18,962 edges in the de Bruijn graph are resolvable.

33 Case (b): Inconsistent with both
The second condition implies that the Eulerian Superpath Problem has no solution, because P, 𝒫x,y1, and 𝒫x,y2 impose three different scenarios for just two visits of the edge x. After discarding the poor-quality and chimeric reads, the researchers did not encounter this condition in the NM project. Chimeric reads – We call a read chimeric when it merges together sequences from two or more distant regions of the genome. We declare a read to be chimeric when it matches in two disjoint pieces in the genome better than any single match of the whole read. Such reads typically cannot be corrected and have to be trimmed or discarded.

34 Case (c): Consistent with both
The last condition (P is consistent with both 𝒫x,y1 and 𝒫x,y2) corresponds to the most difficult situation. If this condition holds for at least one path in 𝒫→x , the edge x is called unresolvable, and we postpone analysis of this edge until all resolvable edges are analyzed. The researchers observed that equivalent transformation of other resolvable edges often resolves previously unresolvable edges.  However, some edges cannot be resolved even after the detachments of all resolvable edges are completed. Such situations usually correspond to tangles, and they have to be addressed by another equivalent transformation called a cut.

35 Cuts Consider a fragment of graph G with 5 edges and 4 paths y3 − x, y4 − x, x − y1, and x − y2. In this symmetric situation, x is a tangle, and there is no information available to relate any of paths y3 − x and y4 − x to any of paths x − y1 and x − y2. An edge x = (v, w) is removable if: (i) it is the only outgoing edge for v and the only incoming edge for w (ii) x is either the initial or the terminal edge for every path P ∈ 𝒫 containing x. An x-cut transforms 𝒫 into a new system of paths by simply removing x from all paths in 𝒫→x and 𝒫x→ without affecting the graph G itself . An x-cut is an equivalent transformation if x is a removable edge.

36 Detachments and cuts proved to be powerful techniques to untangle the de Bruijn graph and to reduce the fragment assembly to the Eulerian Path Problem for all studied bacterial genomes. However, there is still a gap in the theoretical analysis of the Eulerian Superpath Problem in the case when the system of paths is not amenable to either detachments or cuts.

37 Using DB data Still have tangles to solve – using DB data to solve it
DB data – Double Barreled data Mate pair ( 𝑟 1 , 𝑟 2 ) – 2 reads, at both ends of an unknown sequence with roughly known length 𝑙( 𝑟 1 , 𝑟 2 ) 𝑟 2 𝑟 1 𝑙( 𝑟 1 , 𝑟 2 )

38 Using DB data EULER-DB maps every read into some edge(s) of the de Bruijn graph After this mapping, most mate-pairs of reads correspond to paths that connect the positions of these reads in the de Bruijn graph. We call these paths that connect a mate pair: mate- paths. We define 𝑑( 𝑟 1 , 𝑟 2 ) to be the distance between the reads in the de Bruijn graph (length of the mate-path of 𝑟 1 , 𝑟 2 ).

39 Using DB data We compare 𝑑( 𝑟 1 , 𝑟 2 ) to 𝑙( 𝑟 1 , 𝑟 2 ) and use this information to our advantage. In most cases such path is unique and its length approximately matches the distance between the mate-pair (𝑑( 𝑟 1 , 𝑟 2 ) ≈𝑙( 𝑟 1 , 𝑟 2 ))

40 Using DB data In the case of multiple paths between r1 and r2 in the de Bruijn graph, there are three possibilities: The difference between 𝑑( 𝑟 1 , 𝑟 2 ) and 𝑙( 𝑟 1 , 𝑟 2 ) is beyond an acceptable variation, it is most likely an error in the DB data 𝑙( 𝑟 1 , 𝑟 2 ) matches the length of exactly one path between 𝑟 1 𝑎𝑛𝑑 𝑟 2 . We transform the mate-pair into mate-read. 𝑙 𝑟 1 , 𝑟 2 matches the length of more than one path between 𝑟 1 𝑎𝑛𝑑 𝑟 2 . In this case there is no sufficient information (yet) for mate-read. The mate-pair is kept in the DB data in the hope it will be resolved at the following iterations.

41 Last case example Red mate-pair corresponds to 2 different paths
Blue corresponds to a unique path. Paths generated lead to resolution of both tangles

42 Insert length is the length of the sequence in between a pair of reads.
In the NM project (with insert length up to 1,800), euler left only 5 unresolved tangles of length 3,610, 3,215, 2,741, 2,503, and 1,831. After completing euler-db, we build scaffolds by using DB data as “bridges” between different contigs (euler-sf). euler-sf combines the 91 contigs into 60 scaffolds, thus closing most gaps that are shorter than the insert length Insert length is the length of the sequence in between a pair of reads. Sequencers are supplied DNA samples in fragments of a known length and each end is sequenced (generally in a 5′ to 3′ direction from both ends).

43

44 More results

45

46 Conclusions Euler bypasses the “repeat problem,” because the Eulerian Superpath approach transforms most repeats into different paths in the de Bruijn graph. As a result, euler does not even notice repeats unless they are long perfect repeats. Tangles caused by long repeats may be resolved by DB information (euler-db) euler has excellent scaling potential for eukaryotic genomes, because there exist linear-time algorithms for the Eulerian Path Problem. * Eukaryotes are organisms whose cells have a nucleus enclosed within membranes. Animals and plants are the most familiar eukaryotes,The typical multicellular eukaryotic genome is much larger than that of a bacterium. * Euler high accuracy reduces biologists work of mapping, verifying and finishing and saves time and money

47 Thank you for listening!


Download ppt "An Eulerian path approach to DNA fragment assembly"

Similar presentations


Ads by Google