CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Ortholog vs. paralog? 1. Collect Sequence Data Good Dataset
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Understanding sets of trees CS 394C September 10, 2009.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Full modeling versus summarizing gene- tree uncertainty: Method choice and species-tree accuracy L.L. Knowles et al., Molecular Phylogenetics and Evolution.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
The Disk-Covering Method for Phylogenetic Tree Reconstruction
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Statistical tree estimation
Distance based phylogenetics
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Phylogenetic Inference
Bayesian inference Presented by Amir Hadadi
CIPRES: Enabling Tree of Life Projects
Tandy Warnow The University of Illinois
BNFO 602 Phylogenetics Usman Roshan.
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
Chapter 19 Molecular Phylogenetics
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
But what if there is a large amount of homoplasy in the data?
Advances in Phylogenomic Estimation
Presentation transcript:

CS 598AGB What simulations can tell us

Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method is statistically consistent) are about what happens as amount of data increases to infinity. Therefore, simulations cannot prove or disprove statistical consistency. (You need a mathematical proof for those.)

Questions simulations can address Every simulation is by necessity limited to its model condition(s). Therefore, try to avoid over- enthusiastic interpretations of results. Even so… Simulations can address: – Relative performance of methods – Factors that impact performance Simulations can also indicate gaps in theoretical understanding, and suggest theoretical results that you can then try to prove (or disprove).

Examples Many studies on four-leaf trees: – Success of Phylogenetic Methods in the Four- Taxon Case ( Huelsenbeck & Hillis, Systematic Biology 1993 – Hobgoblin of Phylogenetics (Hillis, Huelsenbeck & Swofford, Nature 1994) Main objectives: comparing methods, determining model conditions under which methods such as parsimony seem to be consistent.

Phylogeny Estimation Methods Distance-based methods – Naïve Quartet Method and other quartet-based methods – Neighbor joining (NJ), BioNJ, FastME, Weighbor, etc. – DCM NJ, HGT+FP, etc. (“afc” methods) Maximum Parsimony (MP) and Maximum Compatibility (MC) Maximum Likelihood (ML) Bayesian MCMC (e.g., MrBayes) What do you know about these methods, and what would you predict about the relative accuracy of these methods, based on what you know?

Papers Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor joining, St. John et al., J. Algorithms, 2003 (paper #61) Designing Fast Converging Phylogenetic Methods, Nakhleh et al., ISMB 2001 and Bioinformatics 2001 (paper #44) – see also paper #40 for background theory The Performance of Phylogenetic Methods on Trees of Bounded Diameter, Nakhleh et al., WABI 2001 (paper #45) The Accuracy of Fast Phylogenetic Methods for Large Datasets, Nakhleh et al., PSB 2002 (paper #48)

Quartet-based methods vs. NJ Paper: St. John et al., J. Algorithms, 2003 (paper #61) Basic question: how accurate are trees computed using quartet-based phylogeny estimation methods, compared to Neighbor Joining (NJ)? Quartet-based methods (all better than NQM): – Q* (aka “the Buneman tree”) – Quartet-cleaning (Berry and Gascuel 2000, Berry et al., SODA 2000) – Quartet Puzzling (Strimmer and von Haeseler, MBE 1996)

Observations and Conclusions Quartet-based methods are generally less accurate than NJ, but the differences depend on the rate of evolution (and may depend on other model parameters as well) The theoretical guarantees for NJ and NQM (and hence for the quartet-based methods studied) are identical, so theory does not explain differences in performance. Main hypothesis: independently estimating quartet trees is less powerful than estimating a large tree – “taxon sampling” improves tree estimation. “NJ should be regarded as a universal lowest common denominator in phylogenetics…We suggest that a proposed method should be compared with NJ and abandoned if it does not offer a demonstrable advantage over NJ for substantial subproblem families.” (St. John et al., J. Alg., 2001)

Absolute Fast Converging Methods NJ, Maximum Likelihood, and many other phylogeny estimation methods are statistically consistent. But how much data (sequence length) do they need to reconstruct the true tree with high probability? This can be studied theoretically or empirically

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

From Nakhleh et al., ISMB 2001 and Bioinformatics 2001; see also Warnow et al., SODA 2001 (paper #40) Absolute Fast Converging Methods

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Theorem (Atteson): NJ is accurate with high probability given exponential sequence lengths NJ Error Rate No. Taxa

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] DCM1-boosting makes distance-based methods more accurate Theoretical guarantees that DCM1-NJ converges to the true tree from polynomial length sequences NJ DCM1-NJ Error Rate

Figure 9 from Nakhleh et al., ISMB 2001 and Bioinformatics 2001 AFC: HGT-FP DCM*-NJ NOT AFC: DCM-NJ+MP NJ

Figure 3 from Nakhleh et al., ISMB 2001 and Bioinformatics 2001

Observations Several AFC methods (HGT-FP, DCM*-NJ, short quartet methods), but they are not all equally accurate – and may not even be more accurate than NJ (or other non-AFC methods) under some model conditions. AFC is a theoretical statement about asymptotic performance, but like statements about being polynomial time – the constants are hidden. Even so – the theory predicts difficulties for NJ given *high evolutionary distances*, and less error for AFC methods under those conditions.

Other questions Impact of rate of evolution? Impact of sequence length? Impact of number of taxa? Impact of substitution model (Jukes-Cantor, K2P, GTR, General Markov Model, etc.)? Comparison to MP, ML, Bayesian MCMC, etc?

Figure 3 from Nakhleh et al., WABI 2001

Figure 5(a) from Nakhleh et al., PSB 2002

Figure 4 from Nakhleh et al., PSB 2002

Figure 3(a), Nakhleh et al., PSB 2002

Figure 1 from Nakhleh et al., PSB 2002

Figure 6 from Nakhleh et al., PSB 2002

Things to think about What did you see about performance of MP and other methods that are not guaranteed to be statistically consistent? What was the best performing method so far? What would you suggest as a method to compute 4- taxon trees? What would you suggest as a method to compute a large tree with low evolutionary diameter? What would you suggest as a method to compute a large tree with high evolutionary diameter? Are there gaps between theoretical understanding and empirical performance for any methods?

Questions we did not address Performance of ML and Bayesian MCMC Impact of model misspecification What is easier – small trees or big trees? Impact of alignment error Species tree estimation from multiple genes

Are the big easy? Inferring complex phylogenies (Hillis, Nature 1996) Follow-up articles and letters, including – Are big trees indeed easy? (Trends in Ecology & Evolution, Volume 12, Issue 9, August 1997, Page 357, Ziheng Yang, Nick Goldman)