Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009

Minimizing Treelength Generalized  Input: set S of sequences and a function f(s, s') for the edit distance between sequences s and s'  Output: A tree T, leaf-labelled by set S, with additional sequences labelling the internal nodes of T, so as to minimize treelength (total edit distance on the edges of the tree) Fixed Tree variant

POY POY (from the American Museum of Natural History, Ward Wheeler and colleagues) is the main software for this. Minimizing treelength is also known as “Direct Optimization” POY has passionate adherents who believe in treelength POY also has been heavily criticized

POY Input: set S of sequences (unaligned), gap-open cost, gap-extend cost, and transition/transversion ratio Default settings for gap-open and gap-extend in POY are “simple” (gap-open cost is 0) POY can also be used to score a fixed input tree under the desired treelength definition.

Ogden and Rosenberg 2007 Ogden and Rosenberg study compared POY 3.0 to MP(ClustalW)  Model conditions – mostly 16 taxa (some 64 taxon trees), K2P substitution model, short gaps (expected length 4)‏  Optimization Problem – Multiple edit distances, all on simple gap penalties (gap-open cost is 0)  Performance metrics Tree errors Alignment errors No mention of treelength  Result: MP(ClustalW) much more accurate than POY

O&R concluded that Treelength is BAD! O&R simulation study showed that POY alignments worse than ClustalW more than 99% of the time, and POY trees less accurate than ClustalW on average. “Therefore, traditional multiple sequence alignment approaches appear to vastly outperform direct optimization-like approaches in terms of alignment accuracy, at least for the data sets and parameter settings that have been examined thus far.”  Ogden and Rosenberg 2007

Treelength is BAD! “Although our data represents a fairly simple case, for data sets similar to these the traditional two-step approach will almost always give a more accurate alignment and will most likely recover equally or more accurate phylogenetic relationships than direct optimization as implemented in POY.”  Ogden and Rosenberg 2007

Our question Does minimizing treelength work poorly in general, or Is it minimizing treelength under simple gap penalties that works poorly?

Gap penalties Simple: a gap of length k costs kC Affine: a gap of length k costs C open +kC extend Other types of penalties are possible

“Treelength not so bad!” (paraphrasing Liu et al 2009)‏ Liu et al. 2009 show Treelength can be a good criterion, if based upon affine gap penalty We developed POY*: a version of POY which uses:  a particular affine gap penalty,  and a particular starting tree

Our Study 2008 Our study compares POY 4.0 to multiple methods  Model conditions – 25 and 100 taxa, GTR+Gamma for the substitution model, short and long gaps  Optimization Problem – Multiple edit distances, based upon both simple and affine gap penalties  Results Tree error Alignment error Treelength

Gap cost functions we studied Simple1 – all mismatches and indels cost 1 Simple2 – indels cost 2, transversions cost 2 and transitions cost 1 Affine – gap of length k costs 4 + k, transversions cost 2, and transitions cost 1

Simulation Study Overview Model trees  Birth-death  Deviation from ultrametricity Sequence evolution Estimation of trees and alignments Statistics

Simulation Study Overview Model trees Sequence evolution  GTR model of evolution from Tree of Life project  Gamma-distributed rates across sites  Gap model Estimation of trees and alignments Statistics

Simulation Study Overview Model trees Sequence evolution Estimation of trees and alignments  POY  POY* - POY with particular starting tree (Probtree, using a particular Affine gap penalty  Several two-phase methods (best alignments followed by MP and ML)‏  PS (POY-score) on various trees Statistics

Simulation Study Overview Model trees Sequence evolution Estimation of trees and alignments Statistics 1. Alignment error 2. Tree error 3. Treelength under each gap cost function

Simulation Study Model Conditions 4 model conditions 80 replicate datasets apiece Different numbers of taxa allow us to explore taxonomic sampling effects

Results – Alignment Errors Simple vs. affine penalties Note: story changes for affine penalties, especially on long gap event distribution

Alignment Error: ClustalW vs. POY* POY* better than ClustalW over 50% in (b), and 90% of time under (a)‏ Compare with Ogden and Rosenberg, who find ClustalW better than POY 99.9% of time

Results – Alignment Errors PS is POY used to estimate alignments on various trees Note: PS produces worse alignments than ClustalW if simple gap cost functions are used, even if applied to the true tree‏

Tree error POY and POY* both use the same gap penalty (affine) Results shown on 100 taxon short gap simulated datasets (results for other models similar)‏

Tree Error POY and POY* both use the same gap penalty (affine) Results shown on 100 taxon short gap simulated datasets (results for other models similar)‏

Tree error POY and POY* both use the same gap penalty (affine) Results shown on 100 taxon short gap simulated datasets (results for other models similar)‏

How well does POY solve its optimization problem? We examine the treelength found by POY for various model conditions We let treelength be defined by simple1, simple2, or affine We compare treelengths found by POY to treelengths achievable in each model condition (as produced by scoring the true tree and other trees)

Results – Simple Treelength Criteria

Results – Affine Treelength Criterion

Results - Treelengths POY search finds short trees for simple gap penalties, but not for affine Can we propose a better POY search for affine penalties? POY*

How well does POY solve its optimization problem? Simple gap penalties: excellent performance Affine gap penalties: poor performance But POY* optimizes both well. The difference is just the starting tree.

Is it a good idea to optimize treelength? Simple gap penalties: NO! Worse trees and worse alignments. Affine gap penalties: Let’s see.

POY vs. POY* using affine gap

Insights Simple gap penalties were a main cause behind Ogden and Rosenberg's findings Unable to obtain accurate POY alignments and trees under a simple treelength criterion Using affine penalties, POY*: Obtains alignments that are more accurate than ClustalW 90% of long gap datasets, 75% of medium, 55% of short Has tree accuracy that is comparable to the best two-phase method (ML on good alignments) But poorer alignments than the best alignment methods (e.g., Probtree)

Conclusions Distinguish between the optimization problem, and the heuristic methods used for those problems The treelength optimization criteria chosen has a significant impact on the tree and alignment error  Simple alignment and trees aren't competitive relative to two-phase methods, and improving simple criteria treelengths doesn't get better trees  Affine criteria story is still open Can we find shorter trees than two-phase trees? How accurate are such shorter trees?

Questions?

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Similar presentations

Presentation on theme: "Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Similar presentations

Presentation on theme: "Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009."— Presentation transcript:

Similar presentations

About project

Feedback