Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIS786, Lecture 4 Usman Roshan.

Similar presentations


Presentation on theme: "CIS786, Lecture 4 Usman Roshan."— Presentation transcript:

1 http://creativecommons.org/licenses/by-sa/2.0/

2 CIS786, Lecture 4 Usman Roshan

3 Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search

4 ILS for MP We saw that ratchet improves upon iterative improvement We saw that TNT’s sophisticated and faster implementation outperforms ratchet and PAUP* implementations But can we do even better?

5 Disk Covering Methods (DCMs) DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. DCMs to date –DCM1: for improving statistical performance of distance-based methods. –DCM2: for improving heuristic search for MP and ML –DCM3: latest, fastest, and best (in accuracy and optimality) DCM

6 DCM2 technique for speeding up MP searches 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary

7 DCM1 and DCM2 decompositions DCM1 decomposition : NJ gets better accuracy on small diameter subproblems DCM2 decomposition: Getting a smaller number of smaller subproblems speeds up solution

8 Supertree Methods

9 Strict Consensus Merger 12 3 46 5 12 3 7 4 1 3 2 4 12 3 4 1 2 3 4 1 2 3 4 5 6 7

10 Tree Refinement e a bc d f g h a b cd f g h e d e a b c f g h a b cf g h de

11 The big question Why DCMs? Can DCMs improve upon existing Methods such as neighbor-joining or PAUP* or TNT?

12 Improving sequence length requirements of NJ Can DCM1 improve upon NJ? We examine this question under simulation

13 DCM1(NJ)

14

15 Computing tree for one threshold

16 Recall simulation studies

17 Experimental results True tree selection (phase II of DCM1) Uniformly random trees Birth-death random trees Sequence length requirements on birth- death random trees

18 Comparing tree selection techniques

19 Error rates on uniform random trees

20 Error as a function of evolutionary rate NJDCM1-NJ+MP

21 100 taxa, 90% accuracy Sequence length requirements as a function of evolutionary rates

22 400 taxa, 90% accuracy

23 Sequence length requirements as a function of #taxa DCM1-NJ+MPNJ

24 Conclusion DCM1-NJ+MP improves upon NJ on large and divergent settings Why did it work? Smaller datasets with low evolutionary diameters AND reliable supertree method  accurate subtrees (on subsets)  accurate supertree

25 Conclusion

26 Previously we saw a comparison of DCM components for solving MP DCM2 better than DCM1 decomposition SCM better than MRP (in DCM context) Constrained refinement better than Inferred Ancestral States technique Higher thresholds take longer but can produce better trees

27 Comparison of DCM components for solving MP

28 I. Comparison of DCMs (1,322 sequences) Base method is the TNT-ratchet.

29 I. Comparison of DCMs (1,322 sequences) Base method is the TNT-ratchet.

30 I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet.

31 I. Comparison of DCMs (4583 sequences) Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets.

32 DCM2 decomposition on 500 rbcL genes (Zilla dataset) DCM2 decomposition Blue: separator Red: subset 1 Pink: subset 2 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is very large 2.Subsets are very large 3.Scattered subsets

33 Doesn’t look anything like this

34 2.Find separator X in G which minimizes max where are the connected components of G – X 3.Output subproblems as. DCM2 Input: distance matrix d, threshold, sequences S Algorithm: 1a. Compute a threshold graph G using q and d 1b. Perform a minimum weight triangulation of G DCM3 decomposition DCM3 Input : guide-tree T on S, sequences S Algorithm: 1.Compute a short quartet graph G using T. The graph G is provably triangulated. DCM3 advantage: it is faster and produces smaller subproblems than DCM2

35 DCM3 decomposition - example

36 Approx centroid-edge DCM3 decomposition – example 1.Locate the centroid edge e (O(n) time) 2.Set the closest leaves around e to be the separator (O(n) time) 3.Remaining leaves in subtrees around e form the subsets (unioned with the separator)

37 Time to compute DCM3 decompositions An optimal DCM3 decomposition takes O(n 3 ) to compute – same as for DCM2 The centroid edge DCM3 decomposition can be computed in O(n 2 ) time An approximate centroid edge decomposition can be computed in O(n ) time (from hereon we assume we are using the approximate centroid edge decomposition)

38 DCM2 decomposition on 500 rbcL genes (Zilla dataset) DCM2 decomposition Blue: separator Red: subset 1 Pink: subset 2 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is very large 2.Subsets are very large 3.Scattered subsets

39 DCM3 decomposition on 500 rbcL genes (Zilla dataset) DCM3 decomposition Blue: separator (and subset) Red: subset 2 Pink: subset 3 Yellow: subset 4 Vizualization produced by graphviz program---draws graph according to specified distances. Nodes: species in the dataset Distances: p-distances (hamming) between the DNAs 1.Separator is small 2.Subsets are small 3.Compact subsets

40 Dataset: 4583 actinobacteria ssu rRNA from RDP. Base method is the TNT-ratchet. DCM2 takes almost 10 hours to produce a tree and is too slow to run on larger datasets. DCM3 followed by TNT-ratchet doesn’t improve over TNT Recursive-DCM3 followed by TNT-ratchet doesn’t improve over TNT Comparison of DCMs 0.00 0.05 0.10 0.15 0.20 0.25 0.30 04812162024 Hours Average MP score above optimal, shown as a percentage of the optimal TNTDCM2DCM3Rec-DCM3

41 Local optima is a problem Phylogenetic trees Cost Global optimum Local optimum

42 Local optima is a problem Average MP score above optimal, shown as a percentage of the optimal Hours

43 Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search

44 Iterated local search: Recursive-Iterative-DCM3 Local optimum Output of Recursive-DCM3 Local search

45 Rec-I-DCM3(TNT-ratchet) improves upon unboosted TNT-ratchet 0.00 0.05 0.10 0.15 0.20 0.25 0.30 04812162024 Hours Average MP score above optimal, shown as a percentage of the optimal TNTDCM2DCM3Rec-DCM3Rec-I-DCM3 Comparison of DCMs for solving MP

46 I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

47 I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

48 I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

49 I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet.

50 I. Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. Note the improvement in DCMs as we move from the default to recursion to iteration to recursion+iteration.

51 Improving upon TNT But what happens after 24 hours? We studied boosting upon TNT-ratchet. Other TNT heuristics are actually better and improving upon them may not be possible. Can we improve upon the default TNT search?

52 Improving upon TNT

53 2000 Eukaryotes rRNA

54 6722 3-domain+2-org rRNA

55 13921 Proteobacteria rRNA

56 Improving upon TNT What about better TNT heuristics? Can Rec-I- DCM3 improve upon them? Rec-I-DCM3 improves upon default TNT but we don’t know what happens for better TNT heuristics. Therefore, for a large-scale analysis figure out best settings of the software (e.g. TNT or PAUP*) on the dataset and then use it in conjunction with Rec-I-DCM3 with various subset sizes


Download ppt "CIS786, Lecture 4 Usman Roshan."

Similar presentations


Ads by Google