Algorithms for Inferring the Tree of Life

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Molecular Evolution Revised 29/12/06
BNFO 602 Phylogenetics Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Phylogenetic trees Sushmita Roy BMI/CS 576
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
Distance-based phylogeny estimation
The Disk-Covering Method for Phylogenetic Tree Reconstruction
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Statistical tree estimation
Distance based phylogenetics
Modelling language evolution
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
CIPRES: Enabling Tree of Life Projects
Professor Tandy Warnow
Mathematical and Computational Challenges in Reconstructing Evolution
New methods for simultaneous estimation of trees and alignments
Mathematical and Computational Challenges in Reconstructing Evolution
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Tandy Warnow Department of Computer Sciences
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Presentation transcript:

Algorithms for Inferring the Tree of Life Tandy Warnow Dept. of Computer Science The University of Texas at Austin

Phylogeny Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona Orangutan Gorilla Chimpanzee Human A phylogeny is a tree representation for the evolutionary history relating the species we are interested in. This is an example of a 13-species phylogeny. At each leaf of the tree is a species – we also call it a taxon in phylogenetics (plural form is taxa). They are all distinct. Each internal node corresponds to a speciation event in the past. When reconstructing the phylogeny we compare the characteristics of the taxa, such as their appearance, physiological features, or the composition of the genetic material.

Evolution informs about everything in biology Big genome sequencing projects just produce data – so what? Evolutionary history relates all organisms and genes, and helps us understand and predict interactions between genes (genetic networks) drug design predicting functions of genes influenza vaccine development origins and spread of disease origins and migrations of humans

Reconstructing the “Tree” of Life Handling large datasets: millions of species, NP-hard problems, Lots of computer science research to do

Steps in a phylogenetic analysis Gather data Align sequences Estimate phylogeny on the multiple alignment Estimate the reliable aspects of the evolutionary history (using bootstrapping, consensus trees, or other methods) Perform post-tree analyses.

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Phylogeny Problem U V W X Y X U Y V W AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Phylogenetic reconstruction methods Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Phylogenetic trees Cost Global optimum Local optimum Polynomial time distance-based methods: UPGMA, Neighbor Joining, FastME, Weighbor, etc.

Performance criteria Running time. Space. Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution. “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation. Accuracy with respect to a particular criterion (e.g. tree length or likelihood score), on real data.

How can we infer evolution? While there are more than two sequences, DO Find the “closest” pair of sequences and make them siblings Replace the pair by a single sequence

That was called “UPGMA” Advantages: UPGMA is polynomial time and works well under the “strong molecular clock” hypothesis. Disadvantages: UPGMA does not work well in simulations, perhaps because the molecular clock hypothesis does not generally apply. Other polynomial time methods, also distance-based, work better. One of the best of these is Neighbor Joining.

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Other standard polynomial time methods don’t improve substantially on NJ (and have the same problem with large diameter datasets). What about trying to “solve” maximum parsimony or maximum likelihood?

Maximum Parsimony Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T leaf-labeled by sequences in S additional sequences of length k labeling the internal nodes of T such that is minimized, where H(i,j) denotes the Hamming distance between sequences at nodes i and j

Maximum parsimony (example) Input: Four sequences ACT ACA GTT GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTA ACA ACT GTT ACA GTT GTA GTA ACA ACT GTT

Maximum Parsimony ACT GTA ACA ACT GTT GTA ACA ACT 2 1 1 2 GTT 3 3 GTT MP score = 7 MP score = 5 GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Optimal MP tree

Maximum Parsimony: computational complexity ACT ACA GTT GTA 1 2 MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)

Solving NP-hard problems exactly is … unlikely #leaves #trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900 Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia

Approaches for “solving” MP/ML Hill-climbing heuristics (which can get stuck in local optima) Randomized algorithms for getting out of local optima Approximation algorithms for MP (based upon Steiner Tree approximation algorithms). Phylogenetic trees Cost Global optimum Local optimum

Problems with current techniques for MP Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

Observations The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.

Empirical problems with existing methods Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) cannot handle large datasets (take too long!) – we need new heuristics for MP/ML that can analyze large datasets Polynomial time methods have poor topological accuracy on large diameter datasets – we need better polynomial time methods

My research Focused on the design and analysis of algorithms for large-scale phylogeny reconstruction and multiple sequence alignment. Objective: the design of new algorithms with better performance than existing algorithms, as evidenced by mathematical theory, experiment, or empirical studies. Collaborations with biologists for modelling and data analysis. Current group: four PhD students, one postdoc, and two undergrads.

What happens after the analysis? The result of a phylogenetic analysis is often thousands (or tens of thousands) of equally good trees. How should we analyze the set of trees? How can we store the set of trees? Current approaches use consensus methods, as well as other techniques, to try to infer what is likely to be the characteristics of the “true tree”. Current techniques use too much space, take too much time, and are not sufficiently informative.

General comments There is interesting computer science research to be done in computational phylogenetics, with a tremendous potential for impact. Algorithm development must be tested on both real and simulated data. The interplay between data, stochastic models of evolution, optimization problems, and algorithms, is important and instructive.