Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.

Slides:



Advertisements
Similar presentations
Great Theoretical Ideas in Computer Science
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Instructor Neelima Gupta Table of Contents Approximation Algorithms.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Great Theoretical Ideas in Computer Science for Some.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 The TSP : Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell ( )
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Molecular Evolution Revised 29/12/06
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Approximation Algorithms: Combinatorial Approaches Lecture 13: March 2.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
The Evolution Trees From: Computational Biology by R. C. T. Lee S. J. Shyu Department of Computer Science Ming Chuan University.
Heuristic alignment algorithms and cost matrices
Bioinformatics Algorithms and Data Structures
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
5 - 1 Chap 5 The Evolution Trees Evolutionary Tree.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Multiple Sequence Alignments
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Phylogenetic trees Sushmita Roy BMI/CS 576
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Busby, Dodge, Fleming, and Negrusa. Backtracking Algorithm Is used to solve problems for which a sequence of objects is to be selected from a set such.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
The Traveling Salesman Problem Approximation
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Representing and Using Graphs
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
MOLECULAR PHYLOGENETICS Four main families of molecular phylogenetic methods :  Parsimony  Distance methods  Maximum likelihood methods  Bayesian methods.
Calculating branch lengths from distances. ABC A B C----- a b c.
CSE332: Data Abstractions Lecture 24.5: Interlude on Intractability Dan Grossman Spring 2012.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
An Algorithm for the Traveling Salesman Problem John D. C. Little, Katta G. Murty, Dura W. Sweeney, and Caroline Karel 1963 Speaker: Huang Cheng-Kang.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
CIRCUITS, PATHS, AND SCHEDULES Euler and Königsberg.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
SPANNING TREES Lecture 21 CS2110 – Fall Nate Foster is out of town. NO 3-4pm office hours today!
Approximation Algorithms by bounding the OPT Instructor Neelima Gupta
Instructor Neelima Gupta Table of Contents Introduction to Approximation Algorithms Factor 2 approximation algorithm for TSP Factor.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Spanning Trees Dijkstra (Unit 10) SOL: DM.2 Classwork worksheet Homework (day 70) Worksheet Quiz next block.
Mathematical modeling To describe or represent a real-world situation quantitatively, in mathematical language.
Phylogenetic Trees - Parsimony Tutorial #12
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Courtsey & Copyright: DESIGN AND ANALYSIS OF ALGORITHMS Courtsey & Copyright:
Multiple Alignment and Phylogenetic Trees
CS 581 Tandy Warnow.
Multiple Sequence Alignment
Phylogeny.
Presentation transcript:

Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider

The problem… Construction of optimal evolutionary trees is NP complete. We want heuristics!

Definition 1 A phylogenetic tree T = (V, E) ‘V’ are the vertices or nodes ‘V’ are the vertices or nodes ‘E’ are the edges ‘E’ are the edges T(S) is a leafset of a tree

Let’s break it down A tree defined by T(S) contains a set of sequences. A sequence is defined as S = {s 1 ….,s n ) A sequence is defined as S = {s 1 ….,s n ) The root of the tree has no relevance in our context. A phylogenetic tree T = (V, E) ‘V’ represents (usually known) ancestor sequences. ‘V’ represents (usually known) ancestor sequences.

Scoring schemes: Parsimony and compatibility methods Distance based methods Maximum likelihood methods

Parsimony Count the number of amino acid or nucleotide substitutions in a weighted or un-weighted manner. Take a multiple sequence alignment (MSA) as input and minimize the number of changes to explain the evolutionary tree. To construct an optimal MSA is also NP complete. RATS! RATS!

Parsimony drawback Many algorithms for calculating a MSA need an evolutionary tree as input. You are only as good as your last model. You are only as good as your last model. DOUBLE RATS! DOUBLE RATS!

Distance Matrix Methods Fit a tree to a matrix of pairwise distances between the sequences. Usually use some form of weighted or un- weighted least squares measure Usually use some form of weighted or un- weighted least squares measure

Distance Matrix Drawbacks To find distances such that the score of the tree is minimized. In order to be truly assured of a minimum value you must try all tree topologies. The number of possible tree topologies grows as you add additional nodes to your tree. The number of possible tree topologies grows as you add additional nodes to your tree.

Maximum Likelihood Method Choose a tree which maximizes the probability that the observed data would have occurred. Generate all possible topologies and use the lengths of the edges that maximize the likelihood.

And the salesman chooses… Maximum Likelihood Method Input: A set of unaligned amino acid sequences A set of unaligned amino acid sequencesOutput: Produce a tree with a minimum score Produce a tree with a minimum score Error checking: That tree is correct if each distance is no greater than x/2, where x is the length of the shortest edge in the tree. That tree is correct if each distance is no greater than x/2, where x is the length of the shortest edge in the tree.

Definition 3 Let T be the set of all possible trees that can be generated for a given set of sequences ‘S’. ‘S’ = {s 1 ….,s n ) ‘S’ = {s 1 ….,s n ) The optimal tree ‘t*’, is a tree such that F(t*) = min F(T). Think golf. The lower the score the better. If your function gives a higher score to better alignments, then multiply it by -1.

Definition 4 The optimal pairwise alignment of two sequences (s 1,s 2 ) is an alignment with the maximum score where a probabilistic scoring method is used. Use PAM distances.

Definition 5 A PAM unit of evolution changes an average percentage of amino acids. The function PAM(s 1,s 2 ), maximizes the optimal pairwise alignment.

Why not use the Sum of Pairs Sum of Pairs is a well known scoring function for MSAs. If we were to add ‘ticks’ when calculating the sum of the edges we would get this…

Sum of Pairs Example

Sum of Pairs Drawbacks There is no theoretical justification to weigh some branches more than others. It is not simply the root that is weighted more than others. Sum of Pairs methods are intrinsically problematic from an evolutionary perspective for scoring MSAs.

So we grab the salesman… (Definition 6) A circular order C(T) of a set of sequences (S) is a tour through the tree T(S) where each edge is traversed exactly twice and each leaf is visited only once. More pictures…

The Tour

Therefore we score our tree… The scoring function is based on the circular order. Add all the PAM distances (represented by the edges) from our circular path. Add all the PAM distances (represented by the edges) from our circular path. Divide by two, because we want to count each edge only once. Divide by two, because we want to count each edge only once.

Does this save time? The problem is basically a symmetric Traveling Salesman Problem (TSP). The problem is to find the shortest route where is city is visited once. *Our cities are amino acid sequences and our distances are the PAM distances of the pairwise alignments.* TSP optimal solutions can be calculated in a few hours for up to 1000 cities. For up to 100 cities it only takes a few seconds. For up to 100 cities it only takes a few seconds. We will rarely have greater than 100 amino acid sequences to compare at any single time.

What about error? Determine how large the distance measurement error may be, such that we still get a correct order. Do the opposite and determine the smallest possible error such that we get a wrong circular order. Do the opposite and determine the smallest possible error such that we get a wrong circular order. This means at least one edge was traversed more than twice. That edge is the smallest edge because we want to find the smallest possible error.

What about error?

If the output of the TSP algorithm is a wrong circular order, then the following inequality must be satisfied…

What about error?

Figure 4

Conclusion: That tree is correct if each distance is no greater than x/2, where x is the length of the shortest edge in the tree. Using the TSA as a heuristic saves us some time, but is not always correct. But it’s better than looking at every possible tree topology!

The End Questions? Comments? Concerns?