. Distance-Based Phylogenetic Reconstruction ( part II ) Tutorial #11 © Ilan Gronau
. Phylogenetic Reconstruction We’d like to study the evolutionary history of species
. Distance-Based Reconstruction Given ML pairwise ( evolutionary ) distances between species, find the edge-weighted tree best describing this metric The input: distance matrix – D – D(i,i) = 0 – D(i,j) = D(j,i) – [ D(i,j) ≤ D(i,k) + D(k,j) ] The Output: edge-weighted tree – T If D is additive, then D T = D Otherwise, return a tree best ‘fitting’ the input – D. Note: Usually ML-estimated pairwise distances are not additive, but they are ‘close’ to some additive metric metric BearRaccoonWeaselSealDog Bear Raccoon Weasel Seal Dog Bear Raccoon Weasel Seal Dog
. Neighbor-Joining Algorithms Agglomerative approach: (bottom-up) 1.Find a pair of taxa neighbors – i,j 2.Connect them to a new internal vertex – v (Define edge weights) 3.Remove i,j from taxon-set, and add v (Define distances from v ) 4.Return to (1) When only 2 taxa are left, connect them Consistency: Given an additive metric D T : - We always choose a pair of neighbors in T (stage 1) - The reduced distance-matrix is consistent with the reduced tree (stage 3) Neighbors: taxa connected by a 2-edge path By induction: We eventually reconstruct T
. UPGMA (U nweighted P air G roup M ethod with A rithmetic-Mean ) UPGMA algorithm: 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) ) 4.Return to (1) When only 2 taxa are left, connect them Consistency ? - Given an additive metric D T, do we always choose a pair of neighbors in T ? abcd a b 0315 c 014 d 0 c a b d UPGMA chooses b,c Closest taxon is not necessarily a neighbor α, 1- α – proportional to the number of ‘original’ taxa i,j represent
. Ultrametric Trees Edge-weighted trees which have a point (root) equidistant from all leaves Additive metrics consistent with an ultrametric tree are called ultrametrics A distance-matrix is ultrametric iff it obeys the 3-point condition: “ Any subset of three taxa can be labelled i,j,k such that d(i,j) ≤ d(j,k) = d(i,k) ” time
. UPGMA Additional notes: In the reduction formula D(v,k) can be set to any value within the interval defined by D(i,k) and D(j,k). In particular: D(v,k) = ½(D(i,k) + D(j,k)) ( WPGMA algorithm) If we use: D(v,k) = min {D(i,k), D(j,k)} we get the ‘closest’ ultrametric from below (unique subdominant ultrametric) Run-time analysis: ―Naïve implementation: Θ(n 3 ) ―By keeping a sorted version of each row in D : Θ(n 2 log(n)) ―Third variant can be executed in: Θ(n 2 ) 1.Find a pair of taxa of minimal distace– i,j 2.Connect them to a new internal vertex v 3.Remove i,j from taxon-set, and add v ( D(v,k) = αD(i,k) +(1- α)D(j,k) )
. Consistent distance-based reconstruction: Given an additive metric D, find the unique tree T, s.t. D T = T. Reminder: A metric is additive iff it obeys the 4-point condition: “Any subset of four taxa can be labelled i,j,k,l such that d(i,j) + d(k,l) ≤ d(i,l) + d(j,k) = d(i,k) + d(j,l)” Next Time … Distance matrices Additive matrices Ultrametric matrices
. Saitou & Nei’s Neighbor Joining S&N algorithm: 1.Find a pair of taxa maximizing Q(i,j) = r(i) + r(j) – (n-2)D(i,j) 2.Connect them to a new internal vertex v with edges of weights: 3.Remove i,j from taxon-set, and add v - D(v,k) = ½ ( D(i,k) +D(j,k) -D(i,j) ) 4.Return to (1) When only 2 taxa are left, connect them (with edge of length D(i,j) ) If D is additive (consistent with some tree T ): Q(i,j) is maximized for neighbor-pairs If i,j are neighbors then stages (2,3) are consistent k ij v n – current #taxa shown in class Conclusion: In such a case, given D, NJ returns T
. Saitou & Nei’s Neighbor Joining Complexity analysis Run-time analysis: In each iteration we need to recalculate r(∙) for all taxa Q(∙,∙) values are ‘scrambled’ in each iteration Stage (1) takes O(n 2 ) Total complexity - O(n 3 ) No known way to speed this up significantly S&N algorithm: 1.Find a pair of taxa maximizing Q(i,j) = r(i) + r(j) – (n-2)D(i,j) 2.Connect them to a new internal vertex v with edges of weights: 3.Remove i,j from taxon-set, and add v - D(v,k) = ½ ( D(i,k) +D(j,k) -D(i,j) ) Note: There are consistent reconstruction algorithms which run in O(n 2 ) or even O(n∙log(n)) time.
. S&N’s NJ on Non-Additive Data Example: BearRaccoonWeaselSealDog Bear Raccoon Weasel Seal Dog D: D(B,R) + D(W,S) ; D(B,W) + D(R,S) ; D(B,S) + D(R,W) (68) ; (78) ; (71) D is not additive
. S&N’s NJ Example: 1 st iteration BRWSD B R W S 050 D 0 D: BearDogRaccoonWeaselSealB-D 626 BRWSD B R W S 0198 D 0 Q: BRWSD r :
. S&N’s NJ Example: 2 nd iteration B-DRWS R W 0 S 0 D: BearDogRaccoonWeaselSealB-D 626 B-DRWS R W 0136 S 0 Q: B-DRWS r : B-D-R Calculate difference from old values to new ones
. S&N’s NJ Example: 3 rd iteration B-D-RWS W 044 S 0 D: BearDogRaccoonWeaselSealB-D 626 Q: B-D-RWS r : B-D-R B-D-RWS 091 W 0 S 0 Reconstruct the unique tree over 3 taxa 1.5 W-S
. How Good Is The Tree? BearDogRaccoonWeaselSeal B-D 626 B-D-R W-S We observe the perturbations from the input matrix to the one implied by the output tree BRWSD B R W S 050 D 0 D: BRWSD B R W S D 0 D T : BRWSD B R W S D 0 |D-D T |: How good is this?
. How Good Is The Tree? BearDogRaccoonWeaselSeal B-D 626 B-D-R W-S Compare with other algorithms: BRWSD B R W S D 0 |D-D T2 |: BearRaccoonWeaselSeal Dog BR 13 BRS BRSW BRSWD |D-D T1 |: NJ UPGMA BRWSD B R W S D 0
. Can we do better? Given a distance-matrix D, find an edge-weighted tree T, which minimizes ||D,D T || p For p = 1,2,∞ this task was shown to be NP-hard For p = 1,2 this task was shown to be NP-hard for ultrametric trees as well For p = ∞ : ― this task is easy ( O(n 2 ) algorithm) for ultrametric trees ― 3-approximation algorithm for general trees No algorithm which gives any good guarantees for non-additive data