Presentation is loading. Please wait.

Presentation is loading. Please wait.

Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology.

Similar presentations


Presentation on theme: "Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology."— Presentation transcript:

1 Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology Haifa, Israel

2 Plgw03, 17/12/07 2 Pairwise-Distance Based Reconstruction L G E H M B DTDT Butt ’ fly … AAGT … Eagle … CAGA … Gorrila … CCGT … Human … AACG … Lion … AATA … Mouse … CGCG … B E G H L M D calculate BEGHML 2134 3 7 4 2 5 T 1 reconstruct B E G H L M

3 Plgw03, 17/12/07 3 Optimization Criteria We wish the tree-metric D T to approximate simultaneously the pairwise distances in D. Maximal Difference ( l ∞ ) Maximal Distortion Two “ closeness ” measures studied here: B E G H L M should be “close” to= DD T =

4 Plgw03, 17/12/07 4 Maximal Difference (l ∞ ) vs. Maximal Distortion B E G H L M Goal: Find optimal T, which minimizes the maximal difference/distortion between D and D T D =D T =

5 Plgw03, 17/12/07 5 Previous works on Approximating Dissimilarities by Tree Distances Negative results: (NP-hardness) Closest tree-metric (even ultrametric ) to dissimilarity matrix under l 1 l 2 [Day ‘87] Closest tree-metric to dissimilarity matrix under l ∞ [ABFPT99]  Hard to approximate better than 1.125  Implicit: Hard to approximate closest MaxDist tree within any constant factor Positive results: Closest ultrametric to dissimilarity matrix under l ∞ [Krivanek ‘88] 3-approximation of closest additive metric to a given metric [ABFPT99] (implicit 6-approximation for general dissimilarity matrices)

6 Plgw03, 17/12/07 6 This Work: Triplet-Distances – Distances to Triplets Midpoints i j k τ T (i ; jk) τ T (i ; jk) = τ T (i ; kj) τ T (i ; ij) = 0 τ T (i ; jj) = D T (i, j) C(i,j,k)

7 Plgw03, 17/12/07 7 Triplet-Distances Defined by 2-Distances Each distance Matrix D defines 3-trees i k j 9 7 8 τ(i ; jk)= ½ [ D(i,j)+D(i,k)-D(j,k) ]. Any metric on 3 taxa… C(i,j,k) i j k 3 4 5 …is realizable by a 3-tree

8 Plgw03, 17/12/07 8 reconstruct Triplet-Distance Based Reconstruction BEGHML 2134 3 7 4 2 5 T 1  … AAGT … … CAGA … … CCGT … … AACG … … AATA … … CGCG … B E G H L M BB BE BG….. LL LM MM T T B E G H L M BB BE BG….. LL LM MM τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

9 Plgw03, 17/12/07 9 Why use Triplet-Distances? 1. They enable more accurate estimations of 2-distances. 2. They are used (de facto) by known reconstruction algorithms

10 Plgw03, 17/12/07 10 Improved Estimations of Pairwise Distances: B E G H L M D= Butt ’ fly … AAGT … Eagle … CAGA … Gorrila … CCGT … Human … AACG … Lion … AATA … Mouse … CGCG … “Information Loss” (In calculating D(H,E), all other taxa are ignored Human … AACG … Eagle … CAGA … (Maximum Likelihood) H E 13 Calculate D(H,E)

11 Plgw03, 17/12/07 11 Improved Estimations (cont): Estimate D(H,E) by calculating all the 3-trees on {H,E,X:X  H,E} (Or: calculate just one 3-tree, for a “ trusted ” 3 rd taxon X : V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002) B=(..AAGT..) H= (..AACG..) E=(..CAGA..) 3 2 (..****..) M=(..CGCG..) 3 3 (..****..) H= (..AACG..) E=(..CAGA..) G=(..CCGT..) H= (..AACG..) E=(..CAGA..) 1 5 (..****..) L=(..AATA..) H= (..AACG..) E=(..CAGA..) 2 4 (..****..)

12 Plgw03, 17/12/07 12 (Implicit) use of Triplet-Distances in 2-Distance Reconstruction Algorithms  BB BE BG….. LL LM MM B E G H L M D BEGHML 2134 3 7 4 2 5 T 1 τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

13 Plgw03, 17/12/07 13 1 st use : “ Triplet Distances from a Single Source ” : Fix a taxon r, and construct a tree T which minimizes: Optimal solution is doable in O(n 2 ) time, and is used eg in : (FKW95): Optimal approximation of distances by ultrametric trees. (ABFPT99): The best known approximation of distances by general trees (BB99): Fast construction of Buneman trees. i j r

14 Plgw03, 17/12/07 14 2 nd use: Saitou&Nei Neighbour Joining The neighbors-selection criterion of NJ selects a taxon-pair i,j which maximizes the sum : i j rr r r r r r r

15 Plgw03, 17/12/07 15 Previous Works on Triplet-Dissimilarities/Distances I. Gronau, S. Moran Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of Computational Biology 14(1) pp. 1-15 (2007). Works which use the total weights of 3 trees : S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp. 191-205 (1995) L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights, Applied Mathematics Letters 17 pp. 615-621 (2004) D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity estimates, Mol. Biol. Evol. 23(3) 491–498 (2006).

16 Plgw03, 17/12/07 16 Summary of Results Results for Maximal Difference ( l ∞ ): 1.Decision problem is NP-Hard  IS there a tree T s.t. ||τ,τ T || ∞ ≤ Δ ? 2.Hardness-of-approximation of optimization problem  Finding a tree T s.t. ||τ,τ T || ∞ ≤ 1.4||τ,τ OPT || ∞ 3.A 15-approximation algorithm  Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99] Result for Maximal Distortion : Hardness-of-approximation within any constant factor

17 Plgw03, 17/12/07 17 NP Hardness of the Decision Problem We use a reduction from 3SAT (the problem of determining whether a 3CNF formula is satisfiable) clause literals Satisfying assignment: If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable. We show:

18 Plgw03, 17/12/07 18 The Reduction The set of taxa: Taxa T, F. A taxon for every literal ( ). 3 taxa for every clause C j ( y j 1, y j 2, y j 3 ). Given a 3CNF formula φ we define triplet distances  and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ.

19 Plgw03, 17/12/07 19 One the following can be enforced on each taxa triplet (u,v,w): 1.taxon u is close to Path(v,w), or 2.taxon u is far to Path(v,w) u Properties Enforced by the Input ( ,Δ) v w

20 Plgw03, 17/12/07 20 A truth assignment to φ is implied by the following: 1.T is far from F 2.For each i, is far from, and both of and are close to Path( T, F) T F Enforcing Truth Assignmaent Thus we set x i =T iff x i is close to T.

21 Plgw03, 17/12/07 21 A clause C=( l 1  l 2  l 3 ) is satisfied iff At least one literal l i is true, i.e. is close to T. Enforcing Clauses-Satisfaction F l 3l 3 l 1l 1 l 2l 2 ( l 1  l 2  l 3 ) is satisfied iff it is not like this We need to guarantee that all clauses avoid the above by the close/far relations.

22 Plgw03, 17/12/07 22 -  ( l 1  l 2  l 3 ) is satisfied iff out of the three paths: Path( l 1, l 2 ), Path( l 1, l 3 ), Path( l 2, l 3 ), at least two paths are close to T. Clauses-Satisfaction (cont) T F l 1l 1 l 3l 3 l 2l 2 But we don’t know which two paths

23 Plgw03, 17/12/07 23 Clauses-Satisfaction (cont) We attach a taxon to each such path: y 1 is close to Path ( l 2,l 3 ) y 2 is close to Path ( l 1,l 3 ) y 3 is close to Path ( l 1,l 2 )  ( l 1  l 2  l 3 ) is satisfied iff at least two y i ’s can be located close to T.… T F l 1l 1 l 3l 3 l 2l 2 y1y1 y2y2 y3y3

24 Plgw03, 17/12/07 24 … and, at least two of the y i ’s can be located close to T Path ( y 2,y 3 ), Path ( y 1,y 3 ), Path ( y 1,y 2 ), are close to T Clauses-Satisfaction (end) So, (l 1  l 2  l 3 ) is satisfied iff all the above paths are close to T T F l 1l 1 l 3l 3 l 2l 2 y1y1 y2y2 y3y3

25 Plgw03, 17/12/07 25 vFvF vTvT TF 2β2β αα Construction Example α α y12y12 y11y11 y13y13 α y23y23 y21y21 α y22y22 φ is satisfiable  there is a tree T which satisfies all bounds A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α

26 Plgw03, 17/12/07 26 Hardness of Approximation Results Approximating Maximal Difference Finding a tree T s.t. ||τ,τ T || ∞ ≤ 1.4||τ,τ OPT || ∞ Approximating Maximal Distortion: Finding a tree T s.t. MaxDist(τ,τ T ) ≤ C MaxDist(τ,τ OPT ) for any constant C Details in: I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp. 44-55. By “stretching” the close/far restrictions, the following problems are also shown NP hard:

27 Plgw03, 17/12/07 27 Open Problems/Further Research Extending hardness results for 3-diss tables induced by 2-diss matrices ( τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] ) Extending hardness results for “ naturally looking ” trees ( binary trees with constant-bounded edge weights ) Check Performance of NJ when neighbor selection formula computed from “ real ” 3-distances. Devise algorithms which use 3-distances as input. Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution) ( it is known that optimization of 2-diss doesn’t lead to good topological accuracy )

28 Plgw03, 17/12/07 28 Thank You

29 Distance-Based Phylogenetic Reconstruction Compute distances between all taxon-pairs Find a tree (edge-weighted) best-describing the distances 4 5 7 2 1 2 10 6 1

30 Plgw03, 17/12/07 30 Optimization Criteria Known measures of closeness: l ∞ - l p - MaxDist - ( where 0/0≡1 )

31 Plgw03, 17/12/07 31 The Reduction 3CNF formula φ τ 3-diss table φ is satisfiable, Δ There is a tree T s.t. ||τ,τ T || ∞ ≤ Δ If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable.

32 Plgw03, 17/12/07 32 The Reduction Define a set of lower and upper bounds: A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α φ τlτl, 2Δ2Δ τuτu

33 Plgw03, 17/12/07 33 The Reduction 3CNF formula φ τlτl φ is satisfiable, There is a tree T s.t. τ l ≤ τ T ≤ τ u 2Δ2Δ τuτu If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τ T || ∞ ≤ Δ, then one can determine for every 3CNF formula φ whether it is satisfiable.

34 Plgw03, 17/12/07 34 The Reduction φ τlτl, 2Δ2Δ τuτu 1. Define the set of taxa. 2. Define a set of lower and upper bounds on some entries of τ T. [ φ is satisfiable  there is a tree T which satisfies all bounds ] 3. Define Δ according to the slackness required for the proof of .

35 Plgw03, 17/12/07 35 The Reduction Define the set of taxa: Taxa T, F. A taxon for every literal ( ). 3 taxa for every clause ( y j 1, y j 2, y j 3 ). φ τlτl, 2Δ2Δ τuτu

36 Plgw03, 17/12/07 36 vφvφ vFvF vTvT T F β β ≥α The Analysis A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α ≤α  Trees satisfying A1 and A2 imply a truth-assignment to x 1,..., x n.

37 Plgw03, 17/12/07 37 The Analysis B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 2 ) ≤ α  B1 and B2 imply that y j a = l j b l j c for {a,b,c}={1,2,3}.  B3 implies that at least two of y j 1, y j 2, y j 3 are satisfied. vFvF F l1l1 l2l2 y3y3 vφvφ There is a tree T which satisfies all bounds  φ is satisfiable

38 Plgw03, 17/12/07 38 The Reduction – τ(φ) A1 τ T ( T, F ) ≥ 2α+2β A2 i=1..n :τ T ( T ; ) ≤ α ; τ T ( F ; ) ≤ α B1 j=1..m :τ T (y j 1 ; l j 2 l j 3 ) ≤ α ; τ T (y j 2 ; l j 1 l j 3 ) ≤ α ; τ T (y j 3 ; l j 1 l j 2 ) ≤ α B2 j=1..m :τ T (y j 1 ; T F ) ≥ α ; τ T (y j 2 ; T F ) ≥ α ; τ T (y j 3 ; T F ) ≥ α B3 j=1..m :τ T ( T ; y j 2 y j 3 ) ≤ α ; τ T ( T ; y j 1 y j 3 ) ≤ α; τ T ( T ; y j 1 y j 2 ) ≤ α vFvF vTvT TF 2β2β αα α α y12y12 y11y11 y13y13 α y23y23 y21y21 α y22y22 A1 τ( T, F ) = 2α+3β A2 i=1..n :τ( T ; ) = α-β ; τ( F ; ) = α-β B1 j=1..m :τ(y j 1 ; l j 2 l j 3 ) = α-β ; τ(y j 2 ; l j 1 l j 3 ) = α-β ; τ(y j 3 ; l j 1 l j 2 ) = α-β B2 j=1..m :τ(y j 1 ; T F ) = α+β ; τ(y j 2 ; T F ) = α+β ; τ(y j 3 ; T F ) = α+β B3 j=1..m :τ( T ; y j 2 y j 3 ) = α-β ; τ( T ; y j 1 y j 3 ) = α-β ; τ( T ; y j 1 y j 2 ) = α-β Other 2-distances: τ(s, t ) = 2α+2β Other 3-distances: τ(s ; t u ) = α+2β In our constructed tree: All 2-distances are in [ 2α, 2α+2β ]. All 3-distances are in [α, α+2β].  Δ=β. Δ=β.


Download ppt "Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology."

Similar presentations


Ads by Google