PLGW01 - September Inferring Phylogenies from LCA distances (back to the basics of distance-based phylogenetic reconstruction) Ilan Gronau Shlomo Moran Technion, Israel
PLGW01 - September Distance-Based Phylogenetic Reconstruction Compute distances between all taxon-pairs Find a tree (edge-weighted) best-describing the distances
PLGW01 - September Basics (Sanity check): Reconstruction algorithms should be consistent, i.e. reconstruct the true tree from accurate (ie, additive) distances. Essential Extras: Robustness to noise : Reconstruct the correct tree (or parts of it) given noisy distances. Efficiency: Low time/space complexity Distance-Based Reconstruction
PLGW01 - September Neighbor Joining Methods A taxon-pair i,j is chosen and replaced by a new taxon v i,j are connected to new taxon v (i.e. are cherries in the reconstructed tree) Method recursively applied on reduced matrix An agglomerative clustering approach:
PLGW01 - September The Two Basic Components of NJ Methods At each iteration the algorithm performs: 1.Selection: select neighboring taxons consistency: if input is additive, selected taxa are cherries in the corresponding tree 2.Reduction: compute distances from the new taxon consistency: the reduced matrix should fit the reduced tree. usually can be achieved in more than one way
PLGW01 - September Saitou & Nei’s NJ Algorithm (1987) Robustness: Considered highly reliable in practice Time complexity - θ(n 3 ) ~13,000 citations ( Science Citation Index ) Implemented in numerous phylogenetic packages Questions: What makes Saitou&Nei’s neighbor selection criterion so good? Is there any simpler consistent neighbor-selection criterion? Saitou & Nei’s selection criterion
PLGW01 - September Simple Selection Criterion: LCA Distances In a rooted tree, LCA(i,j) is the distance between the root and the least common ancestor of i,j Taxon-pair with deepest LCA are neighbors Also pair i,j with “locally deepest” LCA: For neighbors i,j with parent v : i j r j i j k Consistent (and complete) neighbor-selection criterion v
PLGW01 - September Deepest LCA Neighbor Joining Algorithm Phase I i r j calculate LCA-distances: Choose root taxon r Calculate LCA-distances from r using Farris Transform: L(i,j) = ½ ( D(r,i) + D(r,j) - D(i,j) )
PLGW01 - September n -1 neighbor-joining iterations At each iteration: Selection: Choose taxon pair i,j, s.t. L(i,j) = max i’≠j’ { L(i’,j’) } Connect i,j to new taxon v Reduction: Replace i,j with new taxon v, and reduce L : For k≠v, L(v,k)= α L(i,k) + (1- α )L(j,k) (α – reduction parameter, may be re-defined each iteration ) Deepest LCA Neighbor Joining Algorithm Phase II
PLGW01 - September Calculating LCA-distances (the matrix L) - θ(n 2 ) time Neighbor joining algorithm: n-1 neighbor joining iterations: -Reduction step takes O(n) time per iteration - Bottleneck is in neighbor selection An amortized θ(n 2 ) implementation of neighbor selection: Join “locally deepest” pair and not necessarily “globally deepest” pair, using the “Nearest Neighbor Chain” clustering technique [Benzecri 82, Juan 82, Murtagh 84, +] Simple and Optimal θ(n 2 ) Implementation of DLCA
PLGW01 - September DLCA: Intermediate Summary A simple and intuitive consistent neighbor selection criterion Implemented in optimal time complexity (faster than NJ) Robustness to noise: We consider two theoretical criteria for robustness: Reconstruction of “ Buneman edges ” Atteson ’ s “ edge-reconstruction radius ” What about the noise ?!
PLGW01 - September P Q Buneman’s Edges [Buneman ’71] D (i,j)+D (k,l) < D (i,k)+D (j,l), D (i,l)+D (j,k) e An edge e induces a split (P|Q) of the taxon set e is a “Buneman edge” (w.r.t. Distance matrix D) iff all taxon-quartet (i,j,k,l) which “crosses” e (i.e. i,j ∊ P, k,l ∊ Q ) agree with e’s split: “Buneman Robustness criterion”: the algorithm should reconstruct all the Buneman edges. j i l k
PLGW01 - September Edge reconstruction-radius: A has edge-reconstruction radius of ε if for each edge e: If ||D-D T || ∞ < ε∙w (e): Then A correctly reconstructs e. A satisfies Buneman’s criterion A has optimal edge-radius of ½ Atteson’s Edge-Reconstruction radius [Atteson ‘99] Atteson: edge-reconstruction radius ≤ ½ e w(e) Noise≤ ε w(e) (for all distances)
PLGW01 - September NJ : -edge-reconstruction radius = ¼ [Atteson ’99, Mihaescu et al ‘06] (hence it does not satisfy the Buneman Criterion) DLCA (using “conservative reductions”): - Satisfies the Buneman Criterion - Hence it has edge-reconstruction radius = ½ Robustness of NJ and DLCA By these criteria, DLCA is also more robust than NJ And in Practice…???
PLGW01 - September D Testing on Simulated Data DNAdist from PHYLIP T’ Compare topologies through RF-distance T ATTCG … ATACG … ACTGG … ATTCG … ATACG … ACTGG … ATTCG … ACTGG … ATTCG … ATACG … ACTGG … ATACG … AGTGG … DLCA / NJ Note that DLCA may produce n different trees – One for each taxon root. CTACG…
PLGW01 - September DLCA vs. Saitou&Nei’s NJ L(i,k) max{L(i,k),L(j,k)} L(i,k) ½(L(i,k) + L(j,k)) trees - 1 simulation per tree Tree Source: The Methods and Algorithms in Bioinformatics (MAB) lab, LIRMM.
PLGW01 - September Robustness of DLCA – a Summary DLCA is superior to NJ by Buneman&Atteson criteria, but (on the average) is inferior to NJ on simulated data. Where lies the reason for this “conflict”? Take another look at Saitou &Nei selection criterion
PLGW01 - September i.e., NJ tends to selects taxon-pairs with average deepest lca Averaging “smoothes” noise Averaging does not affect worst-case noise (The bound 1/4 on the reconstruction radius of NJ uses an highly improbable scenario) Saitou & Nei’s Selection Criterion… … expressed by LCA distances
PLGW01 - September Future Directions Use pivotal nature of DLCA to achieve better results: Pre-processing: use “good” taxa as roots Post-processing: return “best” tree among n possible outputs. Find robustness criteria which explain the robustness of NJ: Instead of considering worst-case noise (as Atteson’s criterion), consider stochastic noise.
PLGW01 - September For more information… "Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances" ( JCB 14(1) pp. 1-15, 2007) "Optimal Implementations of UPGMA and Other Common Clustering Algorithms” (to Appear in IPL) Our websites:
PLGW01 - September Thank You