Presentation is loading. Please wait.

Presentation is loading. Please wait.

Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.

Similar presentations


Presentation on theme: "Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University."— Presentation transcript:

1 Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA ISMB 2010 1

2 1 1 2 2 3 3 4 4 Keep two red edges Keep two black edges Reticulation event(s): nodes with in-degree two or more 1 1 2 2 3 3 4 4 1 1 3 3 2 2 4 4 TATA TBTB Reticulate Networks Gene trees: phylogenetic trees from gene sequences - Assume: Binary and rooted - Different topologies at different genes Reticulate evolution: one explanation - Hybrid speciation, horizontal gene transfer Gene A 1: T C G 2: T C A 3: C G G 4: C C G Gene B 1: A G C 2: T G T 3: A A C 4: A G T Reticulate network: A directed acyclic graph displaying each of the gene trees 2

3 The Minimum Reticulation Problem Given: a set of K gene trees G. Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. NP complete: even for K=2 Current approaches: exact methods for K=2 case (see Semple, et al) impose topological constraints (e.g. galled networks, see Huson, et al.) or work on small-scale topologies 1 1 2 2 3 3 4 4 T1T1 1 1 2 2 3 3 4 4 1 1 2 2 4 4 3 3 T2T2 T3T3 1 1 2 2 3 3 4 4 N For simplicity: a reticulation node with exactly two incoming edges (our approach allows more general case) 2 reticulation events. Minimum! 3

4 Close Lower and Upper Bounds for Minimum Reticulation of Multiple Gene Trees R min (G) RH(G) < < SIT(G) Rmin(G): Minimum Challenging for K  3 RH(G): Lower bound Novel: first non-trivial bound SIT(G): Upper Bound Works for any K Key idea: developing novel lower and upper bounds for Rmin(G): G is the set of K gene trees. Bounds provides range of R min (G) If RH(G)=SIT(G), then R min (G) = RH(G) = SIT(G) 4

5 Pairwise Distance Pairwise reticulation distance of T 1 and T 2 : d(T 1,T 2 ), the minimum reticulation in any reticulate network for T 1 and T 2 Pairwise distances forT 1, T 2 and T 3 T1T1 T2T2 T3T3 1 1 1 Rmin(T 1,T 2,T 3 )  max(1,1,1) = 1 1 1 2 2 3 3 4 4 1 1 2 2 4 4 3 3 T2 T3 Question : can R min (T 1,T 2,T 3 ) = 1? v 1 1 2 2 3 3 4 4 T1 ? Rmin(T 1, T 2 and T 3 )  2! Imaginary network with one reticulation node Choosing same reticulate edge  same gene trees

6 Display Vector 1 1 2 2 3 3 4 4 V T : 0 1V T’ : 1 0 1 1 2 2 3 3 4 4 1 1 3 3 2 2 4 4 T T’ Tree T is displayed in a network Each tree has a display vector 1 00 1 v1v1 v2v2 Intuition: display vectors can not be too similar Lemma: D(V T1,V T2 )  d(T 1,T 2 ) for any network displaying T 1 an T 2. D(V T1,V T2 ): Hamming distance of V T 1 and V T 2. d(T 1,T 2 ): pairwise reticulation distance of T 1 and T 2 V T : Display vector of T, how T is displayed in the network one bit per reticulation node  length of display vector = number of reticulation nodes in the network value 0/1: at each reticulation node, which edge (the 0-edge or 1-edge) is kept for T? 6

7 The RH Lower Bound Key: if R reticulation events possible, then exist K length R display vectors, satisfying the distance constraints: Hamming distance D(VT,VT’)  d(T,T’) Analogy: Selecting K points on R dimensional binary hypercube s.t. the points can not be too close If such K points do not exist, then we must need at least R+1 reticulation events. No polynomial time algorithm is known for general HPP problem. We use integer linear programming to solve it. Closed-form formula of RH bound for K=3. T1T1 T2T2 T3T3 2 3 2 Question: can R min (T 1,T 2,T 3 ) = 3? T1T1 T3T3 T2?T2? Rmin(T 1, T 2, T 3 )  4! RH lower bound: the smallest R s.t. K points can be selected on R-dimensional hypercube satisfying the distance constraints.

8 Upper Bound Key idea: sequentially insert gene trees T i into a growing network N Each step inserts a tree into N. New reticulation events are needed to display T i in N. Minimize the new reticulation events at each step. Problem: how to reconstruct a network for T 1, T 2, …, T K using small (may not be minimum) number, U, of reticulation events? U: an upper bound 1 1 2 2 3 3 4 4 T1T1 1 1 2 2 3 3 4 4 1 1 2 2 4 4 3 3 T2T2 T3T3 1 1 2 2 3 3 4 4 N 1 1 2 2 3 3 4 4 N 1 1 2 2 3 3 4 4 N

9 SIT Upper Bound: Stepwise Insertion of Trees Insertion of tree into a network: given a reticulate network N and a gene tree T, grow N by adding the minimum number of reticulation events to make T displayed in N NP complete Practical computation using integer linear programming SIT bound Try all ordering of T 1, T 2, …, T K For each ordering, insert each tree Ti and compute the number of reticulation events needed for inserting each T i. Obtain a network for each order. SIT bound = the smallest reticulation events in these networks. Heuristics when K is large or trees are large and different 9

10 Simulation 10 Generation of Simulation Data: a two-stage approach Simulate a reticulate network N backwards in time for n species Randomly select K trees embedded in N. Evaluation Creteria: How often exact minimum is found when lower and upper bounds match? The gap between the lower and upper bounds Average running time (see paper) PIRN: a downloadable open-source software tool Implemented in C++ and GNU GLPK (and CPLEX)

11 Performance of PIRN: % of Datasets Optimal Solution Found Lower and upper bounds often match for many data 11 Horizontal axis: number of taxa Vertical axis: % of datasets lower = upper bounds K: number of gene trees r: level of reticulation Average over 100 datasets Number of taxa % LB=UB

12 Performance of PIRN: Gap of Bounds Gap between the lower and upper bounds is often small for many data 12 Horizontal axis: number of taxa Vertical axis: gap between lower and upper bounds K: number of gene trees Number of taxa Gap

13 Reticulate Network for Five Poaceae Trees 13 rpoC2 phyB rbcL ndhF ITS RH bound: 11 SIT bound: 13

14 Reticulate Network for Five Poaceae Trees 14 SIT bound: 13 reticulation events used in the network ITS

15 15 Acknowledgement More information available at: http://www.engr.uconn.edu/~ywu Research supported by National Science Foundation


Download ppt "Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University."

Similar presentations


Ads by Google