An introduction to maximum parsimony and compatibility

Presentation on theme: "An introduction to maximum parsimony and compatibility"— Presentation transcript:

An introduction to maximum parsimony and compatibility
Trevor Bruen PhD Candidate McGill Centre for Bioinformatics

Overview The point of this talk is to give a sense how discrete mathematics enters into phylogenetic and genetic inference. I will illustrate these ideas by describing two approaches in detail namely maximum compatibility and maximum parsimony. I will also show how ideas from these two criteria can be used to develop applications such as bounds and tests for recombination. My goal is to give the basis for further study in this type of area and to give greater insight into these methods.

Outline Introduction to compatibility and parsimony
Overview of basic notation/concepts Compatibility Compatibility as a graph theory problem Compatibility for pairs of characters Interpretation of compatibility Parsimony Parsimony score with connections to graph theory Connections between parsimony and compatibility Homoplasy Parsimony for pairs of characters Connections between SPRs/TBRs and parsimony Applications to recombination Parsimony as a consensus method

Introduction Maximum parsimony and maximum compatibility that are used in phylogenetics, linguistics and population genetics Phylogenetics goal is to infer an evolutionary tree Linguistics often the same Population genetics uses compatibility for recombination For general phylogenetic inference with molecular data, likelihood (probability based) methods are generally preferred. BUT compatibility and parsimony are computationally tractable. ALSO the mathematics behind parsimony and compatibility is very well developed. We can show that parsimony=likelihood in certain circumstances (Tuffley and Steel 1997). This gives us insight in where to go in terms of research.

Formalism A character is a mapping from a set of taxa to a set of states. In this case, X={S1,S2,S3,S4} Also, C={A,C} Informally, a character is a “column” in a multiple sequence alignment

Binary Character / Splits
If character has two states then it induces a split of the taxa set. Example: Let X be the taxa set {S1,S2,S3,S4}. Let C be the state set {A,C}. Then {S1,S2} | {S3,S4} is the split induced by the first character. In general a character induces a set of equivalence classes

Tree and Labeling Informally we would like to be able to mathematically describe a tree and a labeling structure. In graph theory a tree T=(V,E) consists of a graph with no cycles. Informally, we would also like to be able to add taxa (members of X) to our tree (actually the leaves). Define a labeling function (such that leaves of V(T) are labeled by members of X):

X-Trees An X-tree consists of pair: (T, phi) where phi is a labeling function that labels the leaves of T. Recall:

Extensions Informally, we have an X-tree consisting of the pair (T,phi). We also have a character chi. We need to relate the character to the tree. Define an extension of character as a function (which is consistent at the leaves with chi): Informally, an extension provides a description of how the internal vertices are labeled.

Quick Summary Summary so far:
X-tree are trees along with functions labeling the leaves with members of X A character is a function from X into a state set C An extension is a labeling of the vertices of T with states of C

Compatibility - Definition
A character is compatible with a tree if and only if there exists an extension of the character to the tree so that the subgraphs induced by each of the states are connected. Example: First tree character is compatible with tree Second tree character is incompatible since both A’s are disconnected

Compatibility Problem definition: Given a sequence of characters
determine whether there exists a tree on which all character are compatible. Related problem: Given a sequence of characters determine largest set of characters that are compatible with some tree

Intersection Graph Suppose we have sequence of characters where
Then each character induces a partition of X - I.e. Create a graph where the vertex set consists of There is an edge between two vertices iff only the intersection of the two subsets are non-empty

Intersection Graph To figure out whether the sequence of characters
are compatible, we will be able to determine this directly from the intersection graph. First we need to define two concepts: a chordal graph and a restricted chordal completion of the intersection graph.

Chordal Graphs A graph G=(V,E) is chordal graph if every cycle with at least four vertices contains a chord (an edge connecting two non-consecutive vertices). A chordalization of graph is a graph G’=(V,E’) where such that G’ is chordal

Restricted Chordal Completions
Imagine the vertices of our graph G=(V,E) are colored. Then a restricted chordalization of G is a graph G’=(V,E’), where G’ is chordal but all edges of G connect vertices of different colors.

Restricted chordal completions
A restricted chordal completion of the intersection graph is a chordalization where there is no edge between vertices that share the same character. In this case, the “colors” correspond to characters

Main Theorem for Compatibility
Let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph.

Pairs of Characters A simple corollary of main theorem arises when we restrict our attention to two characters. Corollary: Two characters are compatible if and only if the intersection graph, G for both characters is acyclic Proof: (backwards direction) If graph is acyclic then it is chordal so the characters are compatible. (forward direction) OTOH Suppose G contains a cycle. Then any chordal completion of G must contain a three cycle. But no restricted completion of G can contain a three cycle! So G is acyclic.

Interpretation Recall: a set of characters are compatible with a X-tree if and only if there exists an extension of the character to the tree so that the subgraphs induced by each of the states are connected. Informally speaking this is a very strict condition. This corresponds to an “all or nothing” condition - either a character is compatible with a tree or it isn’t. Relaxing this condition is the subject of the next section.

Parsimony Informally: given an leaf labeled tree and a character, how can we define the fit of the character to the tree? Consider a character, along with an extension to a leaf labeled tree. Then the length of the extension is the number edges where Define the parsimony score of a character on a tree as the length of a minimal extension of the character to the tree. Denote this value by

Parsimony Then the maximum parsimony score for a set of characters
on a tree is defined as: The tree that minimizes this score is referred to as the maximum parsimony tree.

Parsimony and graph theory
A minimal cut-set for a leaf-labeled tree T=(V,E) and a character is a minimal set of edges whose removal ensure that if that x and y are in different components. Claim: There is a bijection between the set of minimal cut sets and minimal extensions. So the cardinality of the minimal cut set is equal to the parsimony score.

Parsimony and Graph Theory
Recall Menger’s Theorem (1927): Let G=(V,E) be a graph with V1 and V2 as two disjoint subsets of V. Then the minimum number of edges whose removal from G leaves vertices of V1 and V2 in different components is equal to the maximum number of edge disjoint paths between V1 and V2. Corollary: For a binary character, the maximal number of edge disjoint paths corresponds to the parsimony score.

Compatibility and parsimony
Recall: let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph. Question: How can characterize parsimony with respect to an intersection graph?

Compatibility Graph Recall: Each character induces a partition of X - I.e. A block for a character is a subset taxa on which is constant. Thus we may identify the blocks of with the vertices of the intersection graph.

Character Refinement A character refines another character if implies
Thus characters that refine other characters correspond to refinements of the partition

Compatibility and Parsimony
Recall: Let be a collection of characters. Then is compatible if and only if there is a restricted chordal completion of the intersection graph. Main:

Special Case: Two characters
Recall: Two characters are compatible if and only if the intersection graph, G for both characters is acyclic Using the previous theorem we can show that the parsimony score for two characters corresponds to: where k is the number of components in the graph. Note: This score corresponds to the maximum parsimony score over all trees.

Homoplasy Recall: The parsimony score of a character on a tree, corresponds to minimum number of changes of a character on a tree. Informally: What is an intuitive way to think about the parsimony score? Define the homoplasy of character on a tree as

Homoplasy Note that with equality if and only if is convex on T
Informally: Homoplasy corresponds to the number of “extra” mutations of the character on the tree. These “extra” mutations correspond to recurrent mutations Informally: Thus a character is not compatible on a tree iff it cannot be placed on a tree without “extra” mutations.

Homoplasy For Two Characters
Recall: The parsimony score for a pair of characters can be found directly from the bipartite intersection graph. Recall: This score corresponds to an optimum over all trees. Thus for two characters, we can define a pairwise homoplasy score as Recall: Up to now homoplasy refers to “extra” mutations on a tree.

A second look at homoplasy
Example: Two characters with a pairwise homoplasy score equal to one. Informally: We have seen that the homoplasy corresponds to the number of “extra” mutations on a tree. But in certain situations, this is biologically implausible. The state 1 may correspond to a mutation that has only arisen once. In this case, the fact that the pairs of characters are incompatible can be explained by a recombination event. This will be defined more precisely later.

A quick aside - tree distances.
Differences between leaf labeled trees can be defined using various metrics - e.g. Subtree Prune and Regrafts A “subtree prune and regraft” corresponds to a specific re-arrangement of a tree. For two leaf-labeled trees, dSPR(T1, T2) is minimum #SPRs between T1 and T2

Homoplasy for two characters
Theorem: If and are two characters then corresponds to the minimum number of SPRs from any leaf-labled tree on which is compatible to any leaf labeled tree on which is compatible! Informally: Thus we have a whole new interpretation of homoplasy.

Application - Testing for Recombination
If recombination has occurred sites will have different histories Nearby sites will tend to have “greater” genealogical correlation than distant sites Idea: If recombination has occurred, genealogical correlation will be partially reflected by a tendency for pairs of closely linked sites to have than less homoplasy than distant sites

Test for Recombination
Idea: We would like to distinguish between two possibilities - recurrent mutation and recombination. Idea: Use previous observations to develop test for recombination. H0: Single history describe all sites. H0 ’ : Nearby sites share no more compatibility than arbitrary pairs of sites Use statistic to capture information and solve analytically for p-values

Application: Parsimony and supertrees
Supertree: MRP - parsimony with characters that represent trees. What does homoplasy mean in this context? Courtesy of TREE 12:

Parsimony as a consensus tree
Recall: If and are two characters then corresponds to the minimum number of SPRs from any leaf-labeled tree on which is compatible to any leaf labeled tree on which is compatible. Informally: This can be generalized to show that the maximum parsimony tree for a set of charaters minimizes the SPR distance to each of the set of tree on which each character is compatible…

Acknowledgements Thanks for listening! Background and further reading:
Phylogenetics, Semple and Steel (book 2003) Some results I presented are not on this book - they are from work I have worked on. Please talk to me if you are interested. I have many other references- please see me if interested.