Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.

Similar presentations


Presentation on theme: ". Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1."— Presentation transcript:

1 . Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1

2 2 Final Exam Details The Final Exam will take Place on Thursday, 3.2.04, 0900, at Taub 4. Allowed Material: Course&Tutorial slides+ the textbooks of the course (Durbin et el, Setubal&Meidanis, Gusfield).

3 3 2. The perfect phylogeny problem u A character is assumed to be a property which distinguishes between species (e.g. dental structure). u A characters state is a value of the character (human dental structure). u Problem: Given set of species, specified by their characters, reconstruct their evolutionary tree.

4 4 Characters as Colorings A coloring of a tree T=(V,E) is a mapping C:V  [set of colors] A partial coloring of T is a mapping defined on a subset of the vertices U  V: C:U  [set of colors] U=

5 5 Each character defines a (partial) coloring of the correspondeing phylogenetic tree: Characters as Colorings (2) Species ≡ Vertices States ≡ Colors

6 6 Convex Colorings (and Characters) C Definition: A (partial/total) coloring of a tree is convex iff its d-carriers are mutually disjoint Let T=(V,E) be a partially colored tree, and d be a color. The d-carrier is the minimal subtree of T containing all vertices colored d

7 7 A character is Homoplasy free (avoids reversal and convergence transitions) ↕ The corresponding (partial) coloring is convex Convexity  Homoplasy Freedom

8 8 The Perfect Phylogeny Problem u Input: a set of species, and many characters. u Question: is there a tree T containing the species as vertices, in which all the characters (colorings) are convex? (always possible for one chracter)

9 9 Input: Partial colorings (C 1,…,C k ) of a set of vertices U (in the example: 3 total colorings: left, center, right, each by two colors). Problem: Is there a tree T=(V,E), s.t. U  V and for i=1,…,k,, C i is a convex (partial) coloring of T? RBRRBRRRR BBRRRB The Perfect Phylogeny Problem (pure graph theoretic setting) NP-Hard In general, in P for some special cases

10 10 Perfect Phylogeny for a 0-1 Matrix Rows correspond to objects, columns to characters. Each character has two states: 0 (non exists) or 1 (exists). A tree T is a perfect phylogeny for the matrix iff it has the following properties: A.Each of the n objects corresponds to a leaf of T. B.Each of the m characters labels exactly one edge of T. C.Object p has character i  i labels an edge on the path from p to the root. Note: [B and C]  [each character is convex on T] C1C1C2C3C4C5 A11000 B00100 C11001 D00110 E01000 A E D C B C4 C3 C2 C1 C5

11 11 Perfect Phylogeny for a 0-1 Matrix By the definition, for each character C there is one edge in which it is converted from 0 to 1. In the below tree, the edge on which character C2 is converted to 1 is marked. The resulted tree is convex for this character. C1C2C3C4C5 A1 B0 C1 D0 E1 A E D C B C2

12 12 The (Binary) Perfect Phylogeny Problem Problem: Given a 0-1 matrix M, determine if it has a perfect phylogeny in which the root has 0 for all characters, and construct one if it does. (Note: edges are labeled by characters: edge labeled by i represent changing character i’s state from 0 to 1). As we show below, the answer is yes for our matrix: C1C2C3C4C5 A11000 B00100 C11001 D00110 E01000 A E D C B C4 C3 C2 C1 C5

13 13 Efficient algorithm for the Binary Perfect Phylogeny Problem Definition: Given a 0-1 matrix M, O k ={j:M jk =1}, ie: O k is the set of objects that have character Ck. Theorem: M has a perfect phylogenetic tree iff the sets {O i } are laminar, ie: for all i, j, either O i and O j are disjoint, or one includes the other. C1C2C3C4C5 A11000 B00100 C11001 D00110 E01000 C1C2 C3 C4C5 A11000 B00101 C11001 D00110 E01001 LaminarNot Laminar

14 14 Proof  : Assume M has a perfect phylogeny, and let Ci, Cj be given. Consider the edges labeled Ci and Cj. Case 1: There is a root to leaf path containing both edges. Then one is included in the other (C2 and C1 below). Case 2: not case 1. Then they are disjoint (C2 and C3). A E D C B C4 C3 C2 C1 C5

15 15 Proof (cont.)  : Assume for all i, j, either O i and O j are disjoint, or one includes the other. We prove by induction on the number of characters that M has a perfect phylogenetic tree for the matrix. Basis: one character. Then there are at most two objects, one with and one without this character. C1 A1 B0 AB

16 16 Proof (cont.)  : Induction step: Assume correctness for n-1 characters, and consider a matrix with n characters (non-zero columns). WLOG assume that O 1 is not contained in O j for j > 1. Let S 1 be the set of objects j for which M j1 = 1, and S 2 be the remaining objects. Then each character belongs to objects in S 1 or S 2, but not both (prove!). By induction there are trees T 1 and T 2 for S 1 and S 2. Combining them as below gives the desired tree. C1C2C3C4C5 A11000 B00100 C11001 D00110 E10000 T1T1 T2T2 1

17 17 Efficient Implementation 1 Sort the columns (characters) by decreasing value when considered as binary numbers. (Time complexity: O(mn), using radix sort). Claim: If the binary value of column i is larger than that of column j, then O i is not a proper subset of O j. Proof: O i – O j > 0 means the 1’s in O i are not covered by the 1’s in O j. C1C2C3C4C5 A11000 B00100 C11001 D00110 E01000 C2C1C3C5C4 A11000 B00100 C11010 D00101 E10000

18 18 Efficient Implementation(2) 2. Make a backwards linked list of the 1’s in each row (leftmost 1 in each row points at itself). Time complexity: O(mn). C2C1C3C5C5C4 A11000 B00100 C11010 D00101 E10000 Claim: If the columns are sorted, then the set of columns is laminar iff for each column i, all the links leaving column i point at the same column. Can be checked in O(mn) time.

19 19 Examples A11000 B00100 C11010 D00101 E10000 A11000 B00100 C11010 D00101 E10110 laminarNot laminar

20 20 Efficient Implementation(3) 3. When the matrix is laminar, the tree edges corresponding to characters are defined by the backwards links in the matrix. C2C1C3C5C4 A11000 B00100 C11010 D00101 E10000 A E D C B C3 C2 C1 C5 remaining edges and leaves are determined by the characters of each object. Needs O(mn) time.

21 21 A scenario where Maximum Parsimony (and Perfect Phylogeny) are misleading A AA 1 4 32 Consider a model with 4 letters (DNA), where the probability for a substitution is proportional to time. In the following topology, 2 and 3 are likely to be like the origin, but 4 and 5 can be different. In this case, Maximum Parsimony is misleading.

22 22 Parsimony may be useless/misleading A A C G A G G G I Uninformative II Uninformative III Uninformative A AA 1 4 32 IV Misinformative For leaves 1,4 there are 4 combinations of substitution. In the first three, all three topologies will obtain the same parsimony score. In the fourth, a wrong topology will score best

23 23 Parsimony may be Useless Case I A AA 1 4 32 AA 1 2 3 4 A A A A 1 3 2 4 A A A A 1 4 2 3 A A A A Score=0

24 24 Parsimony Imay be useless Case II A AA 1 4 32 GA 1 2 3 4 A A A G 1 3 2 4 A A A G 1 4 2 3 A G A A Score=1

25 25 Parsimony may be misleading Case III A AA 1 4 32 GC 1 2 3 4 A A C G 1 3 2 4 A A C G 1 4 2 3 A G C A Score=2

26 26 Parsimony may be misleading Case III A AA 1 4 32 CC 1 2 3 4 A A C C 1 3 2 4 A A C C 1 4 2 3 A C C A Score=2 Score=1

27 27 Parsimony may be misleading A CA 1 4 32 AC CA A CA 1 4 32 AC AA Will infer correctly only in the rare case of a change on the central edge, or In an even more rare case of a parallel change from A to C on the pendant edges to 1 and 2.

28 28 3. Maximum Likelihood Approach Consider the phylogenetic tree to be a stochastic process. AGA GGA AAA AAG AAA AGA AAA The likelihood of transition from character a to charcter b is given by parameters  b|a. The liklihood of a letter a in the root is q a. Given the complete tree, its probability is defined by the values of the  b|a ‘s and the q a ’s.

29 29 Maximum Likelihood Approach(2) When the data consists only of the leaves sequences (but the topology is fixed): AGA GGA AAA AAG Write down the likelihood of the data (leaves sequences) given the tree. Use EM to estimate the  b|a parameters. When the tree is not given: Search for the tree that maximizes Prob(data|Tree,  EM )


Download ppt ". Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1."

Similar presentations


Ads by Google