Character-Based Phylogeny Reconstruction

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Maximum Parsimony Probabilistic Models of Evolutions Distance Based Methods Lecture 12 © Shlomo Moran, Ilan Gronau.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
CS774. Markov Random Field : Theory and Application Lecture 17 Kyomin Jung KAIST Nov
Phylogenetic Trees Lecture 4
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Phylogenetic reconstruction
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
Tree Reconstruction.
Problem Set 2 Solutions Tree Reconstruction Algorithms
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
. Phylogenetic Trees - Parsimony Tutorial #12 Next semester: Project in advanced algorithms for phylogenetic reconstruction (236512) Initial details in:
. Perfect Phylogeny Tutorial #11 © Ilan Gronau Original slides by Shlomo Moran.
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
. Phylogenetic Trees Lecture 3 Based on: Durbin et al 7.4; Gusfield 17.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Phylogeny Tree Reconstruction
. Phylogenetic Trees - Parsimony Tutorial #11 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Perfect Phylogeny MLE for Phylogeny Lecture 14
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Phylogenetics II.
. Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
5.5.3 Rooted tree and binary tree  Definition 25: A directed graph is a directed tree if the graph is a tree in the underlying undirected graph.  Definition.
GENE 3000 Fall 2013 slides wiki. wiki. wiki.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Foundation of Computing Systems
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Phylogenetic Trees - Parsimony Tutorial #13
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Phylogenetic Trees - Parsimony Tutorial #12
Chapter 5 : Trees.
Recitation 5 2/4/09 ML in Phylogeny
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood
Multiple Sequence Alignment
Phylogeny.
CS 394C: Computational Biology Algorithms
Perfect Phylogeny Tutorial #10
Presentation transcript:

Character-Based Phylogeny Reconstruction Tanya Berger-Wolf CS502: Algorithms in Computational Biology February 28, 2017

Character-based methods for constructing phylogenies In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures) or molecular (nucleotides in homologous DNA sequences). One common approach is Maximum Parsimony Common Assumptions: Independence of characters (no correlations) Best tree is one where minimal changes take place

Character based methods: Input species C1 C2 C3 C4 … Cm dog A C G T horse frog human pig * Each character (column) is processed independently. The green character will separate the human and pig from frog, horse and dog. The red character will separate the dog and pig from frog, horse and human. We seek for a tree that will best explain all characters simultaneously.

1. Maximum Parsimony A Character-based method Input: h sequences (one per species), all of length k. Goal: Find a tree with the input sequences at its leaves, and an assignment of sequences to internal nodes, such that the total number of substitutions is minimized.

Total #substitutions = 4 Example Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. By the parsimony principle, we seek a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. Here is one possible tree. AGA AAA GGA AAG 2 1 Total #substitutions = 4

There are many assignments for this tree. For example: AGA GGA AAA AAG 1 Total #substitutions = 3 2 Total #substitutions = 4 The left tree is preferred over the right tree. The total number of changes is called the parsimony score.

Example with one letter sequences Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position Minimal tree has only one evolutionary change: C T C T C C C T T  C

Parsimony Based Reconstruction Two separate components: A procedure to find the minimum number of changes needed to explain the data for a given tree topology, where species are assigned to leaves. A search through the space of trees. We will see efficient algorithms for (1). (2) is hard.

Example of input for a given Tree Aardvark Bison Chimp Dog Elephant A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA The tree and assignments of strings to the leaves is given, and we need only to assign strings to internal vertices.

Fitch Algorithm Input: A rooted binary tree with characters at the leaves Output: Most parsimonious assignment of states to internal vertices Work on each position independently. Make one pass from the leaves to the root, and another pass from the root to the leaves. A A/T A A/C A A C T A

Fitch’s Algorithm traverse tree from leaves to root, fix a set of possible states (e.g. nucleotides) for each internal vertex traverse tree from root to leaves, pick a unique state for each internal vertex

Fitch’s Algorithm – Phase 1 Do a post-order (from leaves to root) traversal of tree, assign to each vertex a set of possible states. Each leaf has a unique possible state, given by the input. The possible states Ri of internal node i with children j and k is given by:

Fitch’s Algorithm – Phase 1 AGC CT GC C T G C A T # of substitutions in optimal solution = # of union operations

Fitch’s Algorithm – Phase 2 do a pre-order (from root to leaves) traversal of tree select state rj of internal node j with parent i as follows:

Fitch’s Algorithm – Phase 2 The algorithm could also select C as the assignment to the root. All other assignment are unique. C AGC CT GC C T G C A T Complexity: O(nk), where n is the number of leaves and k is the number of states. For m characters the complexity is O(nmk).

Generalization: Weighted Parsimony Weighted Parsimony score: Each change is weighted by a score c(a,b). The weighted parsimony score reduces to the parsimony score when c(a,a)=0 and c(a,b)=1 for all b other than a.

Weighted Parsimony on a Given Tree Each position is independent and computed by itself. Use Dynamic programming. if i is a node with children j and k, then S(i,a) = minb(S(j,b)+c(a,b)) + minb’(S(k,b’)+c(a,b’)) i j k S(j,b)the optimal score of a subtree rooted at j when j has the character b. S(i,a) S(j,b) S(k,b’)

Evaluating Parsimony Scores (Sankoff’s algorithm) Dynamic programming on a given tree Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) =  Iteration: if i is node with children j and k, then S(i,a) = minx(S(j,x)+c(a,x)) + miny(S(k,y)+c(a,y)) Termination: cost of tree is minxS(r,x) where r is the root

Cost of Evaluating Parsimony for binary trees For a tree with n nodes and a single character with k values, the complexity is O(nk2). When there are m such characters, it is O(nmk2).

Approximating Parsimony

2. Finding the right tree: The Perfect Phylogeny Problem The algorithms of Fitch (and Sankoff) assume that the tree is known. Finding the optimal tree is harder. Recall the general problem: Input: A set of species, specified by strings of characters. Output: A tree T, and assignment of species to the leaves of T, with minimum parsimony score. A restricted variant of this problem is the Perfect Phylogeny problem.

The Perfect Phylogeny Problem Basic assumption for the perfect phylogeny problem: A character is a significant property, which distinguishes between species (e.g. dental structure). Hence, characters in evolutionary trees should be “Homoplasy free”, as we define next.

Homoplasy-free characters 1 Characters in Phylogenetic Trees should avoid: reversal transitions A species regains a state it’s direct ancestor has lost. Famous known reversals: Teeth in birds. Legs in snakes.

Homoplasy-free characters 2 …and also avoid convergence transitions Two species possess the same state while their least common ancestor possesses a different state. Famous known convergence: The marsupials.

Characters as Colorings A coloring of a tree T=(V,E) is a mapping C:V [set of colors] A partial coloring of T is a mapping defined on a subset of the vertices U  V: C:U [set of colors] U=

Characters as Colorings (2) Each character defines a (partial) coloring of the corresponding phylogenetic tree: Species ≡ Vertices States ≡ Colors

Convex Colorings (and Characters) Let T=(V,E) be a colored tree, and d be a color. The d-carrier is the minimal subtree of T containing all vertices colored d Definition: A (partial/total) coloring of a tree is convex iff all d-carriers are disjoint C

Convexity  Homoplasy Freedom A character is Homoplasy free (avoids reversal and convergence transitions) ↕ The corresponding (partial) coloring is convex

The Perfect Phylogeny Problem Input: a set of species, and many characters. Question: is there a tree T containing the species as vertices, in which all the characters (colorings) are convex?

The Perfect Phylogeny Problem (pure graph theoretic setting) Input: Partial colorings (C1,…,Ck) of a set of vertices U (in the example: 3 total colorings: left, center, right, each by two colors). Problem: Is there a tree T=(V,E), s.t. UV and for i=1,…,k,, Ci is a convex (partial) coloring of T? RBR RRR BBR RRB NP-Hard In general, in P for some special cases. Next we show a polynomial time algorithm for the case of binary characters.