"An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment.

"An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment for DNA sequences” by Y. Zhang and M. Waterman ** Presented by Jaehee Jung Mar 4 2005 CPSC 689-604 *Journal of Computational Biology 10-6, pp. 803-819 (2003). ** Proc. National Academy of Science of USA 102-5, pp. 1285-1290 (2005).

2 Outline Motivation –Hamiltonian & Eulerian path –Superpath problem Global Alignment –Global Alignment Algorithm –Probability Analysis –Complexity –Discussion Local Alignment –Local Alignment Algorithm –Significance Estimation –Complexity –Discussion

3 Motivation - Hamiltonian path S={ATG, TGG, TGC, GTG, GGC,GCA, GCG, CGT} ATGTGGCTGGGCGCAGCGCGTTGC ATGCGTGGCA ATGGCGTGCA Hamiltonian path problem is NP- complete

4 Motivation - Eulerian path S={ATG, TGG, TGC, GTG, GGC,GCA, GCG, CGT} Vertices correspond to (l-1) tuples Edges correspond to l-tuples from the spectrum AT GTCG GCCA GG TG AT GTCG GCCA GG TG AT GTCG GCCA GG TG ATGGCGTGCAATGCGTGGCA Eulerian path – visiting all edges correspond to sequence reconstruction

5 Global multiple alignment –Entire sequence are align into one configuration –Time and memory cost L : sequence length N : number of sequences Multiple sequence alignment –Many heuristic algorithm Progressive alignment strategies –Aligning the closet pair of sequences –Aligning the next close pair of sequences »Ex: MULTAL, CLUSTALW, T-COFFEE

6 Global multiple alignment –Many heuristic algorithm (cont’d) Iterative refinement strategies –Local alignment to construct multiple alignment based on segment –segment comparison –Refine the initial alignment iteratively by local alignment »Ex: DIALIGN –Iteratively dividing the sequence into two groups and the realignment »Ex: PRRP –Stochastic iterative strategies »Ex: HMMT, SAM ISSUE –Robust under certain condition –Local optimal problem (iterative problem) => Efficient time and memory space

7 Motivation EULER[1][2]EulerAlign[3] Fragment assembly in DNA sequencing using Eulerian superpath approach Global multiple DNA sequence alignment problem using Eulerian Paths Easy to solve Eulerian path problem in Bruijn graph Similar to Star method Contribution: discard the traditional “overlap-layout- consensus” “error-free” data by an error- correction procedure Assume all input sequences are derived from a common ancestral sequence

8 Star Alignment Example s2s2 s1s1 s3s3 s4s4 x 1 : MPE x 2 : MKE x 3 : MSKE x 4 : SKE MPE | MKE MSKE -|| MKE SKE || MKE MPE MKE -MPE -MKE MSKE -MPE -MKE MSKE -SKE Compute the alignments of all sequence pairs Picks one sequence among N sequences as the consensus

9 Motivation - Eulerian Superpath Superpath Problem – EULER [2] –Given an Eulerian graph and a collection of paths in this graph, find an Eulerian Path in this graph that contains all these paths as subpath –Solve Transform graph G, system of path P -> G 1 and P 1 Make a series of equivalent transformation (G, P) -> (G 1, P 1 ) -> (G 2, P 2 ) …. ->(G k, P k )

10 Motivation - Eulerian Superpath Equivalent transformation –X,Y detachment v in V mid v out xy P ->x P y-> P x,y P y-> v in V mid v out z P ->x P x,y

11 Motivation - Eulerian Superpath P P x,y1 P x,y2 Equivalent transformation –X,Y detachment P consistent with P x,y1 but inconsistent with P x,y2 P is resolvable

12 Motivation - Eulerian Superpath P P x,y1 P x,y2 Equivalent transformation –X,Y detachment P inconsistent with both P x,y1 and P x,y2 Has no solution (did not encounter in *NM project) *NM project: “difficult-to assemble” and “repeat-rich” bacterial genomes

13 Motivation - Eulerian Superpath P P x,y2 P x,y1 Equivalent transformation –X,Y detachment P consistent with both P x,y1 and P x,y2 Difficult situation –Analyze until all resolvable edges are analyzed

14 Motivation - Eulerian Superpath Equivalent transformation –X-cut P ->x and P x-> without affecting the graph G v in P x-> P ->x x x y3 y4y2y2 y1y1 P ->x P x-> y3 y4y2y2 y1y1 v in P x-> P ->x x x y3 y4y2y2 y1y1 P ->x P x-> y3 y4y2y2 y1y1

15 Eulerian global alignment -the algorithm 1.Construct a directed de Bruijn graph 2.Transform the de Bruijn graph to DAG 3.Extract a consensus path form the DAG according to the edges 4.Do fast pairwise alignment between the consensus path and each input sequence 5.Construct the final multiple alignment according to the pairwise alignment

16 (1) – (2) – (3) – (4) – (5) Construct a directed de Bruijn graph CCTTAG: CCTTACTTAG: CCTTCTTA TTAG++ CCTTCTTA TTAG CCTTCTTATTAG Merge Vertices “CTTA Construction of the de Bruijn graph for CCTTAG and k=5

17 de Bruijn Graph Construction Assume that there are no sequencing errors. Construct the de Bruijn graph, taking all (k – 1)-mers appearing in the set of fragments as vertices. TCACA ACAA GTCA These errors have to be corrected before construction of the de Bruijn graph read ACGGCTAT other reads CTAACTGC CTGCTA AACTGCT correction T  k = 3 GTTCCAAC AA

18 (1) – (2) – (3) – (4) – (5) Construct a directed de Bruijn graph 1 2 3 4 5 6 8 9 0 7 0 4 1 2 3 5 6 7 8 9 8 9 10 9 9 9 9 8 9 8 9 An example of the initial de Bruijn graph multiplicity

19 (1)– (2) – (3) – (4) – (5) Transformation the de Bruijn graph to DAG Transformation the de Bruijn graph to DAG –Tangle a vertex that has more than one incomings or outgoings edges Created by random matches, repeats, mutation DNA sequences Result cycle –Goal : delete tangle, because of many cycles vivi

20 (1)– (2) – (3) – (4) – (5) Transformation the de Bruijn graph to DAG Claim –E ->V i : left edge for vertex v i to be an edge that points to v i –If a vertex v i has two or more left edge{E n ->V i } n=1,2,3.. that are contained in the same sequence path, there must exist a cycle in a graph Proof –v i will visited when visiting E 1 ->V i and v i wil visited will when visiting E 2 ->V i vivi

21 (1)– (2) – (3) – (4) – (5) Transformation the de Bruijn graph to DAG Rule of transformation –Sequence information in E vi-> partitioned two superedges E 1 ->vi->, E 2 ->vi-> –Multiplicity for superedge E 1 ->vi->, E 2 ->vi-> compute vivi vjvj v´ i vjvj vivi A tangle at v i is eliminated by making a copy v i ’ of vertex v i and separating E 1 ->Vi E 2 ->Vi E Vi-> E 1 ->Vi-> E 2 ->Vi->

22 (1)– (2) – (3) – (4) – (5) Transformation the de Bruijn graph to DAG Rule of transformation vivi v´ i vivi A tangle at v i is eliminated by making a copy v i ’ of vertex v i E 1 ->Vi E 2 ->Vi E 1 Vi-> E 1 ->Vi-> E 2 ->Vi-> E 2 Vi->

23 (1)– (2) – (3) – (4) – (5) Transformation the de Bruijn graph to DAG Safe transformation Does not introduce the loss of similarity vivi E 1 ->Vi E 2 ->Vi E 1 Vi-> E 2 Vi-> 2 1 v´ i vivi vivi E 1 ->Vi-> E 2 ->Vi-> 2 1 2 1 2 1

24 (1)– (2) – (3) – (4) – (5) Transformation the de Bruijn graph to DAG Unsafe transformation Introduce the loss of similarity vivi 2 1 1 2 E 1 ->Vi E 2 ->Vi E 1 Vi-> E 2 Vi-> v´ i vivi 2 1 1 11 E 1 ->Vi-> E 2 ->Vi->

25 (1)– (2) – (3) – (4) – (5) Transformation the de Bruijn graph to DAG Remove all cycles by performing safe transformation Leave all unsafe stansformations for later 1 2 3 4 5 6 8 9 0 7 0 4 1 2 3 5 6 7 8 9 8 9 10 9 9 9 8 9 8 9 multiplicity Make DAG : heaviest consensus path

26 (1) – (2) – (3) – (4) – (5) Extract a consensus path from DAG Greedy Algorithm –To find a heaviest path within linear time –Not optimal but satisfactory –Weight for each edge Proportional to its multiplicity and length

27 (1)– (2) – (3) – (4) – (5) Fast pairwise alignment Banded pairwise alignment algorithm –The positional shifts between two candidate letters in two sequences are bonded by a constant Align the consensus sequence with each input sequence

28 (1) – (2) – (3) – (4) – (5) Construct the final multiple alignment Combine the alignment to construct the final multiple alignment

29 Probability Analysis Assume: all input sequence are derived from a common ancestral sequence S 0 –N -> identical S 0 –N: number of sequence –L : average sequence length –k :size k-tuple – :mutation rate No mutation : N sequence exactly same S 0 multiplicity for each edge N With mutation : weight edge in S 0

30 Probability Analysis Large Deviation Theorem (L.D.T) for binomial estimate If,then consensus path exist and be accurate

31 Computational complexity Construction and transformation of the graph – Find the heaviest path – Banded pairwise alignment –

32 Discussion Choice of k-tuple size –The larger k, the fewer multiplicity for edge For Larger N –The smaller k, the k is not unique in the sequence For small N : get high multiplicity –Estimate k using L.D.T Graph transformation may lose information – unsafe transformation, lose of similarity information Arbitrary scoring function

33 Local multiple alignment Difficulty –Locations, sizes, structures,number of conserved regions Local multiple alignment – PIMA, MACW,DIALIGN Subproblem of local alignment –Motif finding Gibbs motif sampler Ex: MEME Limitation –size of data, the length of motif

34 Local multiple alignment Another Specific Problem of local alignment –Entire Genome Sequence –Large size sequence comparsion Local Alignment –Using pairwise sequence comparison Not accurate, error accumulate, ruin final result –Comparing each sequence with a DB Find only conserved regions

35 Local Alignment Algorithm 1. Construct de Bruijn graph by overlapping k- tuple 2. Cut “thin” edge by estimating the statistical significance of each edge with a Poisson heuristic 3. Resolve cycles in graph 4. Extract a heaviest path as the consensus 5. Construct and output a multiple alignment from pairwise alignment 6. Declump de Bruijn graph and return to step 5 to find other patterns

36 (1) – (2) – (3) – (4) – (5) – (6) Construct de Bruijn graph ATGT ATG TGT ATGC ATG TGC CTGT CTG TGT ATTG GT ATG TGT ATTG GC ATG TGT CTTG GT ATG TGT AT CT TG GT GC 3 tuple de Bruijn graph by “gluing” identical edge and vertices TCT TGC CTG ATG

37 (1) – (2) – (3) – (4) – (5) – (6) Cut “thin” edge Uninteresting edge –Huge number of thin edge => small multiplicity –Remove an edge by estimating the probability a : before removing thin edges b : after removing thin edges

38 (1) – (2) – (3) – (4) – (5) – (6) Resolve cycles in graph Tandem repeat –Repeat present as a cycle in the graph Ambiguous to determine how many time a cycle –Solve the superpath solution

39 (1) – (2) – (3) – (4) – (5) – (6) Extract a heaviest path as the consensus Heaviest path –Shortest path algorithm with negative edge Using topological sort –Cost linear time (acyclic graph)

40 (1) – (2) – (3) – (4) – (5) – (6) Construct and output a multiple alignment Find the consensus –Banded version of local pairwise alignment –Declumping algorithm to find segments similar to the consensus Optimal alignment has p > p 0 P 0 : assume the Poisson distribution

41 (1) – (2) – (3) – (4) – (5) – (6) Construct and output a multiple alignment Declumping algorithm AT ATC ㅡㅡ AA T T CGC ATCT T AA ㅡㅡ CGC ATC A A T T ㅡㅡ CGC ATC ㅡㅡ T T A A CGC

42 (1) – (2) – (3) – (4) – (5) – (6) Declumping graph Remove information of previously output local alignments Allows additional patterns Ex: XYZ PYQ –Do not remove the edge of Y –Reduce its multiplicity Repeat –Finding consensus – consensus alignment – decumpling graph Until no significant local alignment are left

43 Significance Estimation Estimate the P value of local multiple alignment –Remove thin edge formed by random matches –Rank multiple outputs by statistical significance Estimate minimum multiplicity of mutations free edge –Local alignment is complicated than in the global case Position and the orders of conserved regions in each sequences

44 Poisson clumping heuristic Pairwise alignment –H is the optimal clump score –p (2) is the probability that two letters are identical –L 1, L 2 are the adjusted lengths of two sequences – L 1, L 2 p (2) x is an approximation to the expected number of clumps with score Multiple alignment,

45 Computation Efficiency k : tuple size l : pattern length found in each iterations N : number of sequences L : average sequence length Time –Graph construction and transformation –Pairwise alignment with declumping Space The size of alignment matrix

46 Discussion Tuple size(10~20) How to detect true pattern other than concatenation different pattern Current version focus on DNA not protein sequence

47 Assignment #5 When we using the de Bruijn graph in Eulerain graph, we just adopt in DNA because its characters are consist of four nucleotide like A,C,G,T. Give me an efficient algorithm to get the multiple sequence alignment for adopting protein (it is 20 characters) using the graph. –Hint: Not use de Bruijn graph and Eulerian graph, Graph structure is embedded in the dynamic programming algorithm) If you have question, Contact me jhjung@cs.tamu.edu

48 Reference [1] “A new algorithm for DNA sequence assembly” by Idury, R., and Waterman,. Journal of Computational Biology. 2, 291–306. (1993) [2] “An Eulerian path approach to DNA fragment assembly”. by Pevzner, P.A., Tang, H., and Waterman,Proc. National Academy of Science of USA, PP9748–9753 (1998) [3] "An Eulerian path approach to global multiple alignment for DNA sequences" by Y. Zhang and M. Waterman, Journal of Computational Biology 10-6, pp. 803-819 (2003). [4] "An Eulerian path approach to local multiple alignment for DNA sequences" by Y. Zhang and M. Waterman, Proc. National Academy of Science of USA 102-5, pp. 1285-1290 (2005).

"An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment.

Similar presentations

Presentation on theme: ""An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

"An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment.

Similar presentations

Presentation on theme: ""An Eulerian path approach to global multiple alignment for DNA sequences” by Y. Zhang and M. Waterman * “An Eulerian path approach to local multiple alignment."— Presentation transcript:

Similar presentations

About project

Feedback