. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.

2 Recall: The Four Points Condition Theorem: A set M of L objects is additive iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l) We call {{i,j},{k,l}} the “split” of {i,j,k,l}. The four point condition implies an O(n 4 ) algorithm to decide whether a set is additive. The most common methods for constructing trees for additive sets use neighbor joining methods, which we study next.

3 Constructing additive trees: The neighbor joining problem Let M be additive set, and let i, j be neighboring leaves in the implied tree, let k be their parent, and let m be any other vertex. The formula shows that we can compute the distances of k to all other leaves. This suggest the following method to construct tree from a distance matrix: 1.Find neighboring leaves i,j in the tree, 2. Replace i,j by their parent k and recursively construct a tree T for the smaller set. 3.Add i,j as children of k in T.

4 Neighbor Finding How can we find from distances alone a pair of nodes which are neighboring leaves (called “cherries”)? Closest nodes aren’t necessarily cherries. A B C D Next we show one way to find neighbors from distances.

5 Neighbor Finding: Seitou&Nei method (87) Theorem (Saitou&Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. Definitions

6 Saitou&Nei proof Definitions path(i,j) = the path from leaf i to leaf j; d(u,path(i,j)) = distance in T from u to path(i,j). i j u d(u,path(i,j)) path(i,j)

7 Proof of Claim: -2d(u,path(i,j)) riri rjrj

8 Seitou&Nei proof (cont.) For a vertex i, and an edge e: N i (e) = |{u : e is on path(i,u)}| Then: Note: If e’ is a “leaf edge”, then w(e’) is added exactly once to Q(i,j). i j u Rest of T e

9 Let (see the figure below): path(i,j) = (i,,...,k,j). T 1 = the subtree rooted at k. WLOG that T 1 has at most L/2 leaves. T 2 = T \ T 1. i j k T1T1 T2T2 Assume for contradiction that Q(i,j) is maximized for i,j which are not neighboring leaves. i’ j’ Seitou&Nei proof (cont.) Let i’,j’ be any two neighboring leaves in T 1. We will show that Q(i’,j’) > Q(i,j).

10 i j k T1T1 T2T2 Proof that Q(i’,j’)>Q(i,j): i’ j’ Each leaf edge e adds w(e) both to Q(i,j) and to Q(i’,j’), so we can ignore the contribution of leaf edges to both Q(i,j) and Q(i’,j’) Seitou&Nei proof (cont.)

11 i j k T1T1 T2T2 i’ j’ Location of internal edge e # w(e) added to Q(i,j) # w(e) added to Q(i’,j’) e  path(i,j) 1N i’ (e)≥2 e  path(i’,j) N i (e) < L/2N i’ (e) ≥ L/2 e  T\path(i,i’) N i (e) =N i’ (e) Since there is at least one internal edge e in path(i,j), Q(i’,j’) > Q(i,j). QED Contribution of internal edges to Q(i,j) and to Q(i’,j’) Seitou&Nei proof (end)

12 A simpler neighbor finding method: Select an arbitrary node r. d(r,path(i,j)) i j r Claim (from final exam, Winter 02-3): Let i, j be such that d(r,path(i,j)) is maximized. Then i and j are neighboring leaves.

13 Neighbor Joining Algorithm u If L =3, return tree of three vertices u Set M to contain all leaves, and select a root r. u Compute for all i,j ≠ r, C(i,j)=(d(r,i)+d(r,j)-d(i,j))/2. Iteration: u Choose i,j such that C(i,j) is maximal u Create new vertex k, and set i j k r u remove i,j, and add k to M u Recursively construct a tree on the smaller set, then add i,j as children on k, at distances d(i,k) and d(j,k). C(i,j)

14 Naive Implementation: Initialization: θ(L 2 ) to compute d(r,i) and C(i,j) for all i,j  L. Each Iteration: u O(L 2 ) to find the maximal C(i,j). u O(L) to compute {C(m,k):m  L} for the new node k. Total of O(L 3 ). Complexity of Neighbor Joining Algorithm (using the simpler neighbor finding method) m k r C(m,k)

15 Complexity of Neighbor Joining Algorithm Using Heap to store the C(i,j)’s: Input: Distance matrix D= d(i,j), and an arbitrary object r. Initialization: θ(L 2 ) to compute and heapify the C(i,j)’s in a heap H. Each Iteration: u O(log L) to find and delete the maximal C(i,j) from H. u O(L) to add the values {d(k,m)} to D, for all objects m. u O(L) to delete {d(m,i), d(m,j)} from D (for all m). u O(L log L) to delete {C(i,m), C(j,m)} and add C(k,m) from H, for all objects m. Total of O(L 2 log L). (implementation details are omitted)

16 Some remarks on the Neighbor Joining Algorithm u Applicable to matrices which are not additive u Known to work good in practice (with the original neighbor finding method). u The algorithm and its variants are the most widely used distance- based algorithms today. Next we’ll learn a more efficient algorithm to construct trees from distances, which is based on ultra metric trees.

17 Ultrametric trees Definition: An ultrametric tree is a rooted weighted tree all of whose leaves are at the same depth. Basic property: Define the height of the leaves to be 0. Then edge weights can be represented by the heights of internal vertices. AEDCB 8 5 3 3 0: 33 3 3 2 5 5 3 Edge weights: Internal-vertices heights:

18 Least Common Ancestor and distances in Ultrametric Tree Let LCA(i,j) denote the least common ancestor of leaves i and j. Let height(LCA(i, j)) be its distance from the leaves, and dist(i,j) be the distance from i to j. Observation: For any pair of leaves i, j in an ultrametric tree: height(LCA(i,j)) = 0.5 dist(i,j). ABCDE A08853 B0388 C088 D05 E0 AEDCB 8 5 3 3

19 Ultrametric Matrices Definition: A distances matrix* U of dimension L  L is ultrametric iff for each 3 indices i, j, k : U(i,j) ≤ max {U(i,k),U(j,k)}. jk i96 j9 Theorem: The following conditions are equivalent for an L  L distance matrix U: 1. U is an ultrametric matrix. 2. There is an ultrametric tree with L leaves such that for each pair of leaves i,j: U(i,j) = height(LCA(i,j)) = ½ dist(i,j). * Recall: distance matrix is a symmetric matrix with positive non-diagonal entries, 0 diagonal entries, which satisfies the triangle inequality.

20 Ultrametric tree  Ultrametric matrix There is an ultrametric tree s.t. U(i,j)=½dist(i,j).  U is an ultrametric matrix: By properties of Least Common Ancestors in trees i j k U(k,i) = U(j,i) ≥ U(k,j)

21 Ultrametric matrix  Ultrametric tree: We start with two observations: Definition: Let U be an L  L matrix, and let S  {1,...,L}. U[S] is the submatrix of U consisting of the rows and columns with indices from S. Observation 1: U is ultrametric iff for every S  {1,...,L}, U[S] is ultrametric. Observation 2: If U is ultrametric and max i,j U(i,j)=M,, then M appears in every row of U. jk i?? jM One of the “?” Must be M

22 Ultrametric matrix  Ultrametric tree: Proof by induction U is an ultrametric matrix  U has an ultrametric tree : By induction on L, the size of U. Basis: L= 1: T is a leaf L= 2: T is a tree with two leaves 09 0 0 i j ij i i ii 9 ji

23 Induction step Induction step: L>2. Use the 1 st row to split the set {1,…,L} to two subsets: S 1 ={i: U(1,i) =M}, S 2 ={1,..,L}-S (note: 0<|S i |<L) 1 2 3 4 5 1 08285 S 1 ={2,4}, S 2 ={1,3,5}

24 Induction step By Observation 1, U[S 1 ] and U[S 2 ] are ultrametric. By induction,  tree T 1 for S 1, rooted labeled M 1 ≤ M, and a tree T 2 for S 2 with root labeled M 2 < M (M 2 is the 2 nd largest element in row 1; if M 2 =0 then T 2 is a leaf). Join T 1 and T 2 to T with a root labeled M. [The construction when M 1 = M] M=M 1 M 2 < M T2T2 T1T1 M - M 2

25 Correctness Proof Need to prove: T is an ultrametric tree for U ie, U(i,j) is the label of the LCA of i and j in T. If i and j are in the same subtree, this holds by induction. Else LCA(i,j) = M (since they are in different subtrees). Also, [U(1,i)= M and U(1,j) ≠ M]  U(i,j) = M. ij Ml iM M=M 2 M1M1 T1T1 T2T2

26 Complexity Analysis Let f(L) be the time complexity for L×L matrix. f(1) ≤ f(2) = constant. For L>2: u Constructing S 1 and S 2 : O(L). Let |S 1 | = k, |S 2 | = L-k. u Constructing T 1 and T 2 : f(k)+f(L-k). u Joining T 1 and T 2 to T: Constant. Thus we have: f(L) ≤ max k [ f(k) + f(L-k)] +cL, 0 < k < L. f(L) = cL 2 satisfies the above. Need an appropriate data structure! The condition U(i,j) ≤ max {U(i,k),U(j,k)} is easier to check than the 4 points condition. Therefore the theorem implies that ultrametric additive sets are easier to characterize then arbitrary additive sets.

27 Additive trees via Ultrametric trees Recent (and more efficient) ways for constructing and identifying additive trees use ultrametric trees. Idea: Reduce the problem to constructing trees by the “heights” of the internal nodes. For leaves i,j, U(i,j) represent the “height” of the common ancestor of i and j. A E D C B 8 5 3 3

28 Farris transform of Weighted Trees to Ultrametric Trees First we set the height of all leaves to 0, by transforming the weighted tree T to an ultrametric tree T’ as follows: Step 1: Pick a node r as a root, and “hang” the tree at r. a b c d 2 2 3 4 1 a b cd 2 1 3 4 2 r=a

29 Transforming Weighted Trees to Ultrametric Trees Step 2: Let M = max i d(i,r). M is taken to be the height of T’. Label the root by M, and label each internal node j by M-d(r,j). a b c d 2 2 3 4 1 a b c d 2 1 3 4 2 9 7 4 r=a, M=9

30 Transforming Weighted Trees to Ultrametric Trees Step 3 (and last): “Stretch” edges of leaves so that they are all at distance M from the root M=9 a b c d 2 1 3 4 2 9 7 4 (9) (6) (2) (0) a b cd 7 9 7 4 2 3 4 9 4

31 Reconstructing the Weighted Tree from the Ultrametric Tree M = 9 Weight of an internal edge is the difference between its endpoints. Weights of an edge to leaf i is obtained by subtracting M-d(r,i) from its current weight. a b cd 1 2 3 4 0 2 a b cd 7(-6) 9 7 4 2 3 4 9 (-9) 4(-2)

32 Solving the Additive Tree Problem by the Ultrametric Problem: Outline We solve the additive tree problem by reducing it to the ultrametric problem as follows: Given an input matrix D = D(i,j) of distances: 1.Select an arbitrary object r as a root 2.Transform D to a matrix U= U(i,j), where U(i,j) is the height of the LCA of i and j in the corresponding ultrametric tree T U. 3.Construct the ultrametric tree, T U, for U. 4.Reconstruct the additive tree T from T U.

33 How U is constructed from D U(i,j) should be the height of the Least Common Ancestor of i and j in T U, the ultrametric tree hanged at r: Thus, U(i,j) = M - d(r,m), where d(r,m) is computed by: a b cd 2 1 3 4 2 9 7 For r=a, i=b, j=c, we have: U(b,c)=9 - ½(3+9-8)=7

34 The transformation D  U  T U  T abcd a0999 b077 c04 d0 abcd a0397 b086 c06 d0 D a b c d 2 1 3 4 2 U ab cd 9 7 4 M=9 T TUTU

. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.

Similar presentations

Presentation on theme: ". Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.

Similar presentations

Presentation on theme: ". Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17."— Presentation transcript:

Similar presentations

About project

Feedback