CIS786, Lecture 3 Usman Roshan.

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Greedy Algorithms Greed is good. (Some of the time)
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Molecular Evolution Revised 29/12/06
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CIS786, Lecture 5 Usman Roshan.
BNFO 602 Phylogenetics Usman Roshan.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
CIS786, Lecture 4 Usman Roshan.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Lecture 8 – Searching Tree Space. The Search Tree.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Escaping local optimas Accept nonimproving neighbors – Tabu search and simulated annealing Iterating with different initial solutions – Multistart local.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
The Disk-Covering Method for Phylogenetic Tree Reconstruction
Distance based phylogenetics
Challenges in constructing very large evolutionary trees
CIPRES: Enabling Tree of Life Projects
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
Lecture 8 – Searching Tree Space
Lecture 7 – Algorithmic Approaches
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Presentation transcript:

CIS786, Lecture 3 Usman Roshan

Maximum Parsimony Character based method NP-hard (reduction to the Steiner tree problem) Widely-used in phylogenetics Slower than NJ but more accurate Faster than ML Assumes i.i.d.

Maximum Parsimony Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized.

Maximum parsimony (example) Input: Four sequences –ACT –ACA –GTT –GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA

Maximum Parsimony ACT GTT GTA ACA GTA MP score = 5 ACA ACT GTA GTT ACAACT MP score = 7 ACT ACA GTT GTA ACAGTA MP score = 4 Optimal MP tree

Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)

Local search strategies Phylogenetic trees Cost Global optimum Local optimum

Local search for MP Determine a candidate solution s While s is not a local minimum –Find a neighbor s’ of s such that MP(s’)<MP(s) –If found set s=s’ –Else return s and exit Time complexity: unknown---could take forever or end quickly depending on starting tree and local move Need to specify how to construct starting tree and local move

Starting tree for MP Random phylogeny---O(n) time Greedy-MP

Greedy-MP takes O(n^3k) time

Faster Greedy MP 3-way labeling If we can assign optimal labels to each internal node rooted in each possible way, we can speed up computation by order of n Optimal 3-way labeling –Sort all 3n subtrees using bucket sort in O(n) –Starting from small subtrees compute optimal labelings –For each subtree rooted at v, the optimal labelings of children nodes is already computed –Total time: O(nk)

Faster Greedy MP 3-way labeling If we can assign optimal labels to each internal node rooted in each possible way, we can speed up computation by order of n Optimal 3-way labeling –Sort all 3n subtrees using bucket sort in O(n) –Starting from small subtrees compute optimal labelings –For each subtree rooted at v, the optimal labelings of children nodes is already computed –Total time: O(nk) With optimal labeling it takes constant Time to compute MP score for each Edge and so total Greedy-MP time Is O(n^2k)

Local moves for MP: NNI For each edge we get two different topologies Neighborhood size is 2n-6

Local moves for MP: SPR Neighborhood size is quadratic in number of taxa Computing the minimum number of SPR moves between two rooted phylogenies is NP-hard

Local moves for MP: TBR Neighborhood size is cubic in number of taxa Computing the minimum number of TBR moves between two rooted phylogenies is NP-hard

Tree Bisection and Reconnection (TBR)

Delete an edge

Tree Bisection and Reconnection (TBR)

Reconnect the trees with a new edge that bifurcates an edge in each tree

Local optima is a problem

Iterated local search: escape local optima by perturbation Local optimum Local search

Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search

Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search

ILS for MP Ratchet Iterative-DCM3 TNT

Iterated local search: escape local optima by perturbation Local optimum Output of perturbation Perturbation Local search

Ratchet Perturbation input: alignment and phylogeny –Sample with replacement p% of sites and reweigh them to w –Perform local search on modified dataset starting from the input phylogeny –Reset the alignment to original after completion and output the local minimum

Ratchet: escaping local minima by data perturbation Local optimum Output of ratchet Ratchet search Local search

Ratchet: escaping local minima by data perturbation Local optimum Output of ratchet Ratchet search Local search But how well does this perform? We have to examine this experimentally on real data

Experimental methodology for MP on real data Collect alignments of real datasets –Usually constructed using ClustalW –Followed by manual (eye) adjustments –Must be reliable to get sensible tree! Run methods for a fixed time period Compare MP scores as a function of time –Examine how scores improve over time –Rate of convergence of different methods (not sequence length but as a function of time)

Experimental methodology for MP on real data We use rRNA and DNA alignments Obtained from researchers and public databases We run iterative improvement and ratchet each for 24 hours beginning from a randomized greedy MP tree Each method was run five times and average scores were plotted We use PAUP*---very widely used software package for various types of phylogenetic analysis

500 aligned rbcL sequences (Zilla dataset)

854 aligned rbcL sequences

2000 aligned Eukaryotes

7180 aligned 3domain

13921 aligned Proteobacteria

Comparison of MP heuristics What about other techniques for escaping local minima? TNT: a combination of divide-and-conquer, simulated annealing, and genetic algorithms –Sectorial search (random): construct ancestral sequence states using parsimony; randomly select a subset of nodes; compute iterative-improvement trees and if better tree found then replace –Genetic algorithm (fuse): Exchange subtrees between two trees to see if better ones are found –Default search: (1) Do sectorial search starting from five randomized greedy MP trees; (2) apply genetic algorithm to find better ones; (3) output best tree

Comparison of MP heuristics What about other techniques for escaping local minima? TNT: a combination of divide-and-conquer, simulated annealing, and genetic algorithms –Sectorial search (random): construct ancestral sequence states using parsimony; randomly select a subset of nodes; compute iterative-improvement trees and if better tree found then replace –Genetic algorithm (fuse): Exchange subtrees between two trees to see if better ones are found –Default search: (1) Do sectorial search starting from five randomized greedy MP trees; (2) apply genetic algorithm to find better ones; (3) output best tree How does this compare to PAUP*-ratchet?

Experimental methodology for MP on real data We use rRNA and DNA alignments Obtained from researchers and public databases We run PAUP*-ratchet, TNT-default, and TNT-ratchet each for 24 hours beginning from randomized greedy MP trees Each method was run five times on each dataset and average scores were plotted

500 aligned rbcL sequences (Zilla dataset)

854 aligned rbcL sequences

2000 aligned Eukaryotes

7180 aligned 3domain

13921 aligned Proteobacteria

Can we do even better? Yes! But first let’s look at Disk-Covering Methods

Disk Covering Methods (DCMs) DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree. DCMs to date –DCM1: for improving statistical performance of distance-based methods. –DCM2: for improving heuristic search for MP and ML –DCM3: latest, fastest, and best (in accuracy and optimality) DCM

DCM2 technique for speeding up MP searches 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary

2.Find separator X in G which minimizes max where are the connected components of G – X 3.Output subproblems as. DCM2 Input: distance matrix d, threshold, sequences S Algorithm: 1a. Compute a threshold graph G using q and d 1b. Perform a minimum weight triangulation of G DCM2 decomposition

Threshold graph Add edges until graph is connected Perform minimum weight triangulation –NP-hard –Triangulated graph=perfect elimination ordering (PEO) –Max cliques can be determined in linear time –Use greedy triangulation heuristic: compute PEO by adding vertices which minimize largest edge added –Worst case is O(n^3) but fast in practice

1.Find separator X in G which minimizes max where are the connected components of G – X 2.Output subproblems as 3.This takes O(n^3) worst case time: perform depth first search on each component (O(n^2)) for each of O(n) separators Finding DCM2 separator

DCM2 subsets

DCM3 decomposition - example

DCM1 vs DCM2 DCM1 decomposition : NJ gets better accuracy on small diameter subproblems (which we shall return to later) DCM2 decomposition: Getting a smaller number of smaller subproblems speeds up solution

We saw how decomposition takes place, now on to supertree methods 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary

Supertree Methods

Optimization problems Subtree Compatibility: Given set of trees,does there exist tree,such that, (we say contains ). NP-hard (Steel 1992) Special cases are poly-time (rooted trees, DCM) MRP: also NP-hard

Direct supertree methods Strict consensus supertrees, MinCutSupertrees

Indirect supertree methods MRP, Average consensus

MRP---Matrix Representation using Parsimony (very popular)

Strict Consensus Merger---faster and used in DCMs

Strict Consensus Merger: compatible subtrees

Strict Consensus Merger: compatible but collision

Strict Consensus Merger: incompatible subtrees

Strict Consensus Merger: incompatible and collision

Strict Consensus Merger: difference from Gordon’s SC method

We saw how decomposition takes place, now on to supertree methods 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary

Tree Refinement Challenge: given unresolved tree, find optimal refinement that has an optimal parsimony score NP-hard

Tree Refinement e a bc d f g h a b cd f g h e d e a b c f g h a b cf g h de

We saw how decomposition takes place, now on to supertree methods 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary

Comparing DCM decompositions

Study of DCM decompositions DCM2 is faster and better than DCM1 Comparison of MP scoresComparison of running times

Best DCM (DCM2) vs Random Comparison of MP scoresComparison of running times DCM2 is better than RANDOM w.r.t MP scores and running times

DCM2 (comparing two different thresholds) Comparison of MP scoresComparison of running times

Threshold selection techniques Biological dataset of 503 rRNA sequences. Threshold value at which we get two subproblems has best MP score.

Comparing supertree methods

MRP vs. SCM 1.SCM is better than MRP Comparison of MP scoresComparison of running times

Comparing tree refinement techniques

Study of tree refinement techniques Comparison of MP scoresComparison of running times Constrained tree search had best MP scores but is slower than other methods

Next time DCM1 for improving NJ Recursive-Iterative-DCM3: state of the art in solving MP and ML