Presentation is loading. Please wait.

Presentation is loading. Please wait.

Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Similar presentations


Presentation on theme: "Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)"— Presentation transcript:

1 Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002) Reconstructing the duplication history of tandemly repeated gene, Mol. Biol. Evol 2.Tang,M., Waterman M,(2001) Zinc finger gene clusters and tandem gene duplication, RECOMB reporter: r92922054 李明翰 b88506020 黃寶萱 b90902020 蔡明潔

2 Outline Duplication model Constructing duplication model from phylogeny  Double duplication model  Arbitrary duplication model Discussion

3 Duplication A duplication replaces a stretch of DNA containing several repeats with two identical and adjacent copies of itself. If the stretch contain k repeats, the duplication is called a k-duplication.

4

5

6 DM ( duplication model ) A duplication model M for tandemly repeated sequence is a directed graph. A duplication model contains nodes, edges and blocks.

7 Phylogeny & DM

8 Node & Edge A node in DM represents a repeat. A directed edge (u,v) indicates that v is a child of u. Also means that u is an ancestor of v. Root & Leaf & Internal node.

9 Block A block in DM represents a duplication. Each internal node appears in a unique block. No node is an ancestor of another in a block. We draw a block representing a k- duplication only when the k>2.

10

11 Block (Cont.) lc(v) means the left child of v. rc(v) means the right child of v. If the block corresponds to a k-duplication, then it contains k nodes v 1, v 2, …… v k from left to right. Then lc(v 1 ),lc(v 2 ), …,lc(v k ),rc(v 1 ),rc(v 2 ), …,rc(v k )

12 Cont. Hence,for any i and j, 1 ≤ i < j ≤ k, the edge( v i, rc(v i )) and edge( v j, lc(v j )) cross each other. The left-to-right order of leaves in the model is identical to the order of the sequences on a chromosome.

13 Example lc(v 1 ),lc(v 3 ),lc(v 4 ),rc(v 1 ),rc(v 3 ),rc(v 4 ). An ordered phylogenetic tree for sequence {1,2, …,n} is a rooted phylogeny in which its leaves are listed from left to right in the increasing order.

14 LEMMA 1: l*c(u),r*c(u) denote the leftmost and the rightmost leaf in the subtree T M (u) rooted at u respectively. For each internal node u in T M, r*c(u)> r*c(lc(u)) and l*c(u)<l*c(rc(u)). r*c(lc(u)) and l*c(rc(u)) are the biggest and smallest labels in the subtree T M (u).

15 Constructing a duplication model from a phylogeny

16 Features: A duplication model M has a unique associated phylogeny T M. A phylogeny is not necessarily associated with a duplication model.

17 Problem: Reconstruct the Duplication model M in linear time Input: a phylogeny T Output: reconstruct the duplication model M

18 Problem (Cont.) Note: To represent a duplication model, we only need to list all non-single duplication blocks on the associated phylogeny [V 1, V 3, v 4 ] [V 5 V 6 ] [V 7 V 8 ]

19 Double duplication models Given a phylogeny T on sequence family F = {1,2,…,n}. Associate a pair (L v, R v ) of indices with each node v in T as follows: 1.The i th leaf node: (L v,R v ) = (i, i) 2.The internal node: (L v,R v ) = (l*c(v), r *c(v))

20 r (1,10) 1(1,1) 6(6,6) 2(2,2) 5(5,5) 8(8,8) 10(10,10)3(3,3) 9(9,9)7(7,7) 4(4,4) V 1 (1,6) V 5 (2,4) V 7 (7,9) V 3 (2,9) V 6 (3,5)V 8 (8,10) V 4 (3,10) V 2 (2,10)

21 Bottom up fashion for (L v, R v ) L v = min {L lc(v), L rc(v )} R v = max {R lc(v), R rc(v )} Recursively bottom up Since T contains 2n-1 nodes  linear time

22 Constructing DDM from phylogeny Double duplication model: A duplication model with all duplication in it are 1(or 2)- duplcation. By Lemma1 the leftmost and rightmost leaves in T are 1 and n respectively. Where does 2 locate? 2 must just next to 1 on the DDM

23 Let v 0 = r, v 1, v 2, · · ·, v p−1, v p = 1 u 1 = rc(v i ), u 2, · · ·, u q−1, u q = 2, where q ≥ p – i LEMMA 2. M must contain p-i-1 double duplications [v i+1, u j1 ], [v i+2, u j2 ], · · ·, [v p−1, u jp−i−1 ], i=2 P=5 q= 6

24 LEMMA 2. (Cont.) Since j p-i-1 ≤ q -1  q ≥ p – I PROOF. If v i+k does not belong to a double duplication block in M, the leaf labeled with 2 cannot be placed before the leftmost leaf in the subtree rooted at rc(v i+k ), contradicting the fact that 2 is right next to 1 in M. Hence, v i+k must appear in a double duplication block for each k, 1 ≤ k ≤ p − i − 1. This finishes the proof.

25 Note: R u1 > R u2 > · · · > R uq−1 > R uq and R vi+1 > R vi+2 > · · · > R vp−1 R vi+k appears between R u jk and R u jk+1 for [V i+k, u jk ] We can determine all u jk ’s in p – i +q ≤ 2q

26 After all the duplication blocks [v i+k, u jk ] are placed on T, the leaf 2 should be right next to the leaf 1

27 Derive a rooted binary tree T’’ from the subtree of T(u 1 ) by inserting a new node by inserting a new node v k in the edge (u jk, u jk+1 ) for each 1 ≤ k ≤ p − i − 1 assigning the subtree T(rc(v i+k )) rooted at rc(v i+k ) as the right subtree of v k Note : left child of v k is u jk+1 in T now. Then, form the new phylogeny T’ from T by replacing subtree T(v i ) with T’’

28

29

30 Linear time (Analysis) Since we can charge the number of comparisons taken in different recursive steps to disjoint left paths in the input tree T, the whole algorithm takes at most 2×2n comparisons for determining all the duplication blocks.  linear time algorithm. Each internal node will be compared in q (next to leftmost path) once and then be in p (leftmost path) once. And each internal node will be compared with its (R v,L v ). Therefore, 2x2n comparisons.

31 Arbitrary duplication models

32 Now, we generalize the above algorithm into arbitrary duplication models. Again, we assume the leftmost paths leading to leaf 1 and leaf 2 in T are given in (1) and (2) respectively.

33 Observation: Assume a phylogeny T is associated with a duplication model M. Then, there exist p − i − 1 double duplication blocks [v i+k, u jk ] (1≤k≤ p − i − 1) such that, after these duplications are placed in T, the leaf 2 is right next to the leaf 1. But, these double duplication blocks may not be in M.

34 Recall that there are two types of nodes on the leftmost path of T’. Some nodes are original ones in the input tree T ; some are inserted due to duplication blocks we have examined so far. To extend the existing duplication blocks to larger ones, we associate a flag to each original node on the leftmost path of T’, which indicates whether the node is in an existing duplication block or not.

35 Let x be an original node on the leftmost path P of T’ appearing in a duplication block [x 1, x 2, · · ·, x t, x] of size t + 1 so far, then, there are t inserted nodes x’ i right below x on the path P, which correspond to x i for i ≤ t. To determine whether [x 1, x 2, · · ·, x t, x] can be extended to a large duplication block in the model with which the original tree T is associated, we need to consider x and all the x’ i s (1≤i≤ t) simultaneously. For this purpose, we introduce the concept of hyper- double (duplication) blocks.

36 We say that x and y form a hyper-double block [x, y] in T’ if the following three conditions hold: (i) x is a node in some non-single duplication block that we have obtained so far; (ii) x and y are not an ancestor of each other; (iii) the block [x 1, x 2, · · ·, x t, x] can be extended to a block [x 1, x 2, · · ·, x t, x, y] of size t + 2 in the original tree T.

37 Hence, when we place a hyper-double block [x, y] in the current tree T’, the edge (y, l(y)) crosses not only the edge (x, r(x)), but also the edges (x’ i, r (x’ i )), 1≤ i ≤ t. So, we have that a phylogeny T is associated with a model if and only if: (i) there exist p − i − 1 double duplication blocks [v i+k, u jk ] (1≤k≤p − i − 1) in T such that, after these duplication blocks are placed in T, leaf 2 is right next to leaf 1, and (ii) T’ constructed above is associated to ‘a duplication model’ with introducing hyper-double duplication blocks.

38 To make the algorithm run in linear time, we refine the algorithm in two aspects. First, we assign a pair (R’x, R”x ) of indices to a node x on the leftmost path of T in each recursive step: if x is in a duplication block [x 1, x 2, · · ·, x t, x] in the current stage, we set R’x = Rx 1 and R”x = Rx, which are defined in Section 2.2.1. Since R’x < Rx i < R”x for 2≤i≤t, only R’x and R”x will be examined for determining if x is in a hyper-double block in next step.

39 Secondly, if the duplication block [x 1, x 2, · · ·, x t, x] is extended into a larger hyper-double block [x 1, x 2, · · ·, x t, x,y] in a step, the binary tree T’ for next step is constructed by inserting the right subtrees of x i ’s and x into the edge between y and its left child lc(y). To do these insertions, we need to point the left child of x 1 to l(y), and then point the left child of y to x. In this way, we are able to insert all the subtrees in only two pointer operations.

40 DS: [v 1,v 2 ][v 3,v 5 ][v 8,v 6 ]

41 DS: [v 1,v 2 ][v 3,v 5,v 4 ][v 8,v 6 ]

42 DS: [v 1,v 2 ][v 3,v 5,v 4,v 7 ][v 8,v 6 ]


Download ppt "Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)"

Similar presentations


Ads by Google