CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
CSE182-L17 Clustering Population Genetics: Basics.
Phylogeny Tree Reconstruction
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Perfect Phylogeny MLE for Phylogeny Lecture 14
Phylogenetic Tree Construction and Related Problems Bioinformatics.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetic trees Sushmita Roy BMI/CS 576
What Is Phylogeny? The evolutionary history of a group.
Terminology of phylogenetic trees
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Phylogenetic Analysis
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetics II.
Introduction to Bioinformatics Algorithms Molecular Evolution.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Introduction to Bioinformatics Algorithms Clustering and Molecular Evolution.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Evolutionary tree reconstruction (Chapter 10). Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Phylogenetic Trees - Parsimony Tutorial #13
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
Parsimony and searching tree-space. The basic idea To infer trees we want to find clades (groups) that are supported by synapomorpies (shared derived.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
Phylogenetic Trees - Parsimony Tutorial #12
Molecular Evolution and Phylogeny
Phylogenetic basis of systematics
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Character-Based Phylogeny Reconstruction
Multiple Alignment and Phylogenetic Trees
CSE 5290: Algorithms for Bioinformatics Fall 2009
CSE 5290: Algorithms for Bioinformatics Fall 2011
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
Molecular Evolution.
Outline Cancer Progression Models
Phylogeny.
Analysis of Algorithms CS 477/677
Perfect Phylogeny Tutorial #10
Presentation transcript:

CSCI2950-C Lecture 8 Molecular Phylogeny: Parsimony and Likelihood http://cs.brown.edu/courses/csci2950-c/

Phylogenetic Trees How are these trees built from DNA sequences? 1 4 3 2 5 Leaves represent existing species Internal vertices represent ancestors Root represents the oldest evolutionary ancestor

Phylogenetic Trees How are these trees built from DNA sequences? 1 4 3 2 5 Methods Distance Parsimony Minimum number of mutations Likelihood Probabilistic model of mutations

Outline Last Lecture: distance-based Methods Additive distances 4 Point condition UPGMA & Neighbor joining Today: Parsimony-based methods Sankoff + Fitch’s algorithms Likelihood Methods Perfect Phylogeny

Weighted Small Parsimony Problem: Formulation Input: Tree T with each leaf labeled by elements of a k-letter alphabet and a k x k scoring matrix (ij) Output: Labeling of internal vertices of the tree T minimizing the weighted parsimony score

Sankoff Algorithm Dynamic Programming Calculate and keep track of a score for every possible label at each vertex st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t The score at each vertex is based on scores of its children: st(parent) = mini {si( left child ) + i, t} + minj {sj( right child ) + j, t}

Sankoff Algorithm (cont.) Begin at leaves: If leaf has the character in question, score is 0 Else, score is 

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} si(u) i, A sum A T  3 G 4 C 9 sA(v) = 0 sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A}

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sj(u) j, A sum A  T 3 G 4 C 9 sA(v) = 0 sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} + 9 = 9

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} Repeat for T, G, and C

Sankoff Algorithm (cont.) Repeat for right subtree

Sankoff Algorithm (cont.) Repeat for root

Sankoff Algorithm (cont.) Smallest score at root is minimum weighted parsimony score In this case, 9 – so label with T

Sankoff Algorithm: Traveling down the Tree The scores at the root vertex have been computed by going up the tree After the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with optimal character.

Sankoff Algorithm (cont.) 9 is derived from 7 + 2 So left child is T, And right child is T

Sankoff Algorithm (cont.) And the tree is thus labeled…

Fitch’s Algorithm Solves Small Parsimony problem Published 4 years before Sankoff (1971) Assigns a set of letters to every vertex in the tree, S(v) S(l) = observed character for each leaf l

Fitch’s Algorithm: Example {a,c} {t,a} c t a a a a a a {a,c} {t,a} a a c t a a c t

Fitch Algorithm 1) Assign a set of possible letters Sv to every vertex vertex v, traversing the tree from leaves to root For vertex v with children u and w: Sv = Su “intersect” Sw if non-empty intersection Su “union” Sw , otherwise E.g. if the node we are looking at has a left child labeled {A, C} and a right child labeled {A, T}, the node will be given the set {A, C, T}

Fitch Algorithm (cont.) 2) Assign labels to each vertex, traversing the tree from root to leaves Assign root arbitrarily from its set of letters For all other vertices, if its parent’s label is in its set of letters, assign it its parent’s label Else, choose an arbitrary letter from its set as its label

Fitch Algorithm (cont.)

Fitch vs. Sankoff Both have an O(nk) runtime Are they actually different? Let’s compare …

Fitch As seen previously:

Comparison of Fitch and Sankoff As seen earlier, the scoring matrix for the Fitch algorithm is merely: So let’s do the same problem using Sankoff algorithm and this scoring matrix A T G C 1

Sankoff

Sankoff vs. Fitch The Sankoff algorithm gives the same set of optimal labels as the Fitch algorithm For Sankoff algorithm, character t is optimal for vertex v if st(v) = min1<i<ksi(v) Let Sv = set of optimal letters for v. Then Sv = Su “intersect” Sw if non-empty intersection Su “union” Sw , otherwise This is also the Fitch recurrence The two algorithms are identical

Large Parsimony Problem Input: An n x m matrix M describing n species, each represented by an m-character string Output: A tree T with n leaves labeled by the n rows of matrix M, and a labeling of the internal vertices such that the parsimony score is minimized over all possible trees and all possible labelings of internal vertices

Large Parsimony Problem (cont.) Possible search space is huge, especially as n increases (2n – 3)!! possible rooted trees (2n – 5)!! possible unrooted trees Problem is NP-complete Exhaustive search only possible w/ small n(< 10) Hence, branch and bound or heuristics used

Nearest Neighbor Interchange A Greedy Algorithm A Branch Swapping algorithm Only evaluates a subset of all possible trees Defines a neighbor of a tree as one reachable by a nearest neighbor interchange A rearrangement of the four subtrees defined by one internal edge Only three different rearrangements per edge

Nearest Neighbor Interchange

Nearest Neighbor Interchange Start with an arbitrary tree and check its neighbors Move to a neighbor if it provides the best improvement in parsimony score No way of knowing if the result is the most parsimonious tree Could be stuck in local optimum

Nearest Neighbor Interchange

Subtree Pruning and Regrafting Another Branch Swapping Algorithm http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/SPR.gif

Tree Bisection and Reconnection Another Branch Swapping Algorithm Most extensive swapping routine

Homoplasy Given: 1: CAGCAGCAG 2: CAGCAGCAG 3: CAGCAGCAGCAG 4: CAGCAGCAG 5: CAGCAGCAG 6: CAGCAGCAG 7: CAGCAGCAGCAG Most would group 1, 2, 4, 5, and 6 as having evolved from a common ancestor, with a single mutation leading to the presence of 3 and 7

Homoplasy But what if this was the real tree?

Homoplasy 6 evolved separately from 4 and 5 Parsimony groups 4, 5, and 6 together as having evolved from a common ancestor Homoplasy: Independent (or parallel) evolution of same/similar characters Parsimony results minimize homoplasy, so if homoplasy is common, parsimony may give wrong results

Contradicting Characters An evolutionary tree is more likely to be correct when it is supported by multiple characters Human Lizard MAMMALIA Hair Single bone in lower jaw Lactation etc. Frog Dog Note: In this case, tails are homoplastic

Perfect Phylogeny Evolutionary model Binary characters {0,1} Each character changes state only once in evolutionary history (no homoplasy!). Tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. For simplicity, assume root = (0, 0, 0, 0, 0) How can one reconstruct such a tree? 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 species traits 1

The 4-gamete condition A column i partitions the set of species into two sets i0, and i1 A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogeneous. Example: i is heterogeneous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1

4 Gamete Condition There exists a perfect phylogeny if and only if for all pair of columns (i, j), j is homogenous w.r.t i0 or i1. Equivalently, There exists a perfect phylogeny if and only if for all pairs of columns (i, j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1) i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1

4-gamete condition: proof (only if) Every perfect phylogeny satisfies the 4-gamete condition Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. (if) If the 4-gamete condition is satisfied, does a perfect phylogeny exist? Need to give an algorithm… i0 i1 i

An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.

Inclusion Property For any pair of columns i, j: i < j if and only if i1  j1 Note that if i < j then the edge containing i is an ancestor of the edge containing j i j

Example r A B C D E 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent

Sort columns Sort columns according to the inclusion property: i < j if and only if i1  j1 This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0

Add first column In adding column i 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 In adding column i Check each edge and decide which side you belong. Finally add a node if you can resolve a clade r u B D A C E

Adding other columns 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Add other columns on edges using the ordering property r 1 3 E 2 B 5 4 D A C

Unrooted case Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case

Problems with Parsimony Ignores branch lengths on trees A A A A A A A C A A A A A C Same parsimony score. Mutation “more likely” on longer branch.

Maximum Likelihood See Class Notes

Algorithm Summary Method Input Output Neighbor Joining Distance matrix D T, B UPGMA Sankoff’s & Fitch’s Alg. Characters, T A, B Perfect Phylogeny Characters A, B, T Felsenstein Characters, T, B A Distance based Parsimony Probabilistic T = tree topology B = branch lengths A = ancestral states

Gene Tree vs. Species Tree

Non-tree evolution Recombination, hybridization, horizontal gene transfer

Using Multiple Methods Important to keep in mind that reliance on purely one method for phylogenetic analysis provides incomplete picture When different methods (parsimony, distance-based, etc.) all give same result, more likely that the result is correct

How Many Times Evolution Invented Wings? Whiting, et. al. (2003) looked at winged and wingless stick insects

Reinventing Wings Previous studies had shown winged  wingless transitions Wingless  winged transition much more complicated (need to develop many new biochemical pathways) Used multiple tree reconstruction techniques, all of which required re-evolution of wings

Most Parsimonious Evolutionary Tree of Winged and Wingless Insects The evolutionary tree is based on both DNA sequences and presence/absence of wings Most parsimonious reconstruction gave a wingless ancestor

Will Wingless Insects Fly Again? Since the most parsimonious reconstructions all required the re-invention of wings, it is most likely that wing developmental pathways are conserved in wingless stick insects

Phylogenetic Analysis of HIV Virus Lafayette, Louisiana, 1994 – A woman claimed her ex-lover (who was a physician) injected her with HIV+ blood Records show the physician had drawn blood from an HIV+ patient that day But how to prove the blood from that HIV+ patient ended up in the woman?

HIV Transmission HIV has a high mutation rate, which can be used to trace paths of transmission Two people who got the virus from two different people will have very different HIV sequences Three different tree reconstruction methods (including parsimony) were used to track changes in two genes in HIV (gp120 and RT)

HIV Transmission Took multiple samples from the patient, the woman, and controls (non-related HIV+ people) In every reconstruction, the woman’s sequences were found to be evolved from the patient’s sequences, indicating a close relationship between the two Nesting of the victim’s sequences within the patient sequence indicated the direction of transmission was from patient to victim This was the first time phylogenetic analysis was used in a court case as evidence (Metzker, et. al., 2002)

Evolutionary Tree Leads to Conviction

Current popular methods HUNDREDS of programs available! http://evolution.genetics.washington.edu/phylip/software.html#methods Some recommended programs: Discrete—Parsimony-based Rec-1-DCM3 http://www.cs.utexas.edu/users/tandy/mp.html Tandy Warnow and colleagues Probabilistic SEMPHY http://www.cs.huji.ac.il/labs/compbio/semphy/ Nir Friedman and colleagues

Sources Metzker et al. Molecular evidence of HIV-1 transmission in a criminal case. PNAS 2002. Whiting et al. “Loss and recovery of wings in stick insects” Nature 421, 264-267 Serafim Batzoglou http://ai.stanford.edu/~serafim/CS262_2006/ (Phylogeny slides) http://bioalgorithms.info (Phylogeny slides) V. Bafna (Perfect Phylogeny slides)