Molecular Evolution and Phylogenetic Tree Reconstruction

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Phylogenetic Trees Lecture 4
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Problem Set 2 Solutions Tree Reconstruction Algorithms
DNA Sequencing.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Phylogeny Tree Reconstruction
Overview of Phylogeny Artiodactyla (pigs, deer, cattle, goats, sheep, hippopotamuses, camels, etc.) Cetacea (whales, dolphins, porpoises)
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Bioinformatics Algorithms and Data Structures
Phylogeny Tree Reconstruction
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.
Phylogeny Tree Reconstruction
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogeny Tree Reconstruction
Perfect Phylogeny MLE for Phylogeny Lecture 14
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogenetic trees Sushmita Roy BMI/CS 576
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
PHYLOGENETIC TREES Dwyane George February 24,
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Calculating branch lengths from distances. ABC A B C----- a b c.
Evolutionary tree reconstruction (Chapter 10). Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Multiple Sequence Alignment
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
CSCI2950-C Lecture 7 Molecular Evolution and Phylogeny
dij(T) - the length of a path between leaves i and j
Character-Based Phylogeny Reconstruction
Multiple Alignment and Phylogenetic Trees
Recitation 5 2/4/09 ML in Phylogeny
CS 581 Tandy Warnow.
Phylogeny.
Presentation transcript:

Molecular Evolution and Phylogenetic Tree Reconstruction 1 4 3 2 5 Molecular Evolution and Phylogenetic Tree Reconstruction

Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length represents evolution time AKA genetic distance Not necessarily chronological time

Inferring Phylogenetic Trees Trees can be inferred by several criteria: Morphology of the organisms Can lead to mistakes! Sequence comparison Example: Mouse: ACAGTGACGCCCCAAACGT Rat: ACAGTGACGCTACAAACGT Baboon: CCTGTGACGTAACAAACGA Chimp: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Inferring Phylogenetic Trees Sequence-based methods Deterministic (Parsimony) Probabilistic (SEMPHY) Distance-based methods UPGMA Neighbor-Joining Can compute distances from sequences 4

Distance Between Two Sequences Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is certain – avoid areas with (too many) gaps 5

Distance Between Two Sequences Given sequences xi, xj, Define dij = distance between the two sequences One possible definition: dij = fraction f of sites u where xi[u]  xj[u] Better scores are derived by modeling evolution as a continuous change process

Outline Molecular Evolution Distance Methods Sequence Methods UPGMA / Average Linkage Neighbor-Joining Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY)

Molecular Evolution Q: How can we model evolution on nucleotide level? (ignore gaps, focus on substitutions) A: Consider what happens at a specific position for small time interval Δt P(t) = vector of probabilities of {A,C,G,T} at time t μAC = rate of transition from A to C per unit time μA = μAC + μAG + μAT rate of transition out of A pA(t+Δt) = pA(t) – pA(t) μA Δt + pC(t) μCA Δt + … 8

P(t+Δt) = P(t) + Q P(t) Δt Molecular Evolution In matrix/vector notation, we get P(t+Δt) = P(t) + Q P(t) Δt where Q is the substitution rate matrix 9

Molecular Evolution This is a differential equation: P’(t) = Q P(t) A substitution rate matrix Q implies a probability distribution over {A,C,G,T} at each position, including stationary (equilibrium) frequencies πA, πC, πG, πT Each Q is an evolutionary model (some work better than others) 10

Evolutionary Models Jukes-Cantor Kimura Felsenstein HKY 11

Estimating Distances Solve the differential equation and compute expected evolutionary time given sequences Jukes-Cantor Kimura 12

Outline Molecular Evolution Distance Methods Sequence Methods UPGMA / Average Linkage Neighbor-Joining Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY) 13

A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––––––– {p Ci, q Cj}dpq |Ci|  |Cj| Claim that if Ck = Ci  Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|

Algorithm: Average Linkage Initialization: Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 Iteration: Find two clusters Ci, Cj s.t. dij is min Let Ck = Ci  Cj Define node connecting Ci, Cj, and place it at height dij/2 Delete Ci, Cj Termination: When two clusters i, j remain, place root at height dij/2 1 4 3 2 5 1 4 2 3 5

Average Linkage Example w x y z 6 8 4 2 v w xyz 6 8 vw xyz 8 4 v w x yz 6 8 4 3 2 1 v w x y z

Ultrametric Distances and Molecular Clock Definition: A distance function d(.,.) is ultrametric if for any three distances dij  dik  dij, it is true that dij  dik = dij The Molecular Clock: The evolutionary distance between species x and y is 2 the Earth time to reach the nearest common ancestor That is, the molecular clock has constant rate in all species The molecular clock results in ultrametric distances years 1 4 2 3 5

Ultrametric Distances & Average Linkage 1 4 2 3 5 Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances Proof: Exercise

Weakness of Average Linkage Molecular clock: all species evolve at the same rate (Earth time) However, certain species (e.g., mouse, rat) evolve much faster Example where UPGMA messes up: Correct tree AL tree 3 2 4 1 4 2 3 1

Additive Distances 1 d1,4 12 4 8 3 13 7 9 5 11 10 6 2 Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them Given a tree T & additive distances dij, can uniquely reconstruct edge lengths: Find two neighboring leaves i, j, with common parent k Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m  i, j

d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z) Additive Distances z x w y For any four leaves x, y, z, w, consider the three sums d(x, y) + d(z, w) d(x, z) + d(y, w) d(x, w) + d(y, z) One of them is smaller than the other two, which are equal d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z)

Reconstructing Additive Distances Given T x T D y 5 4 v w x y z 10 17 16 15 14 9 3 z 3 4 7 w 6 v If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths

Reconstructing Additive Distances Given T x T D y v w x y z 10 17 16 15 14 9 z w v

Reconstructing Additive Distances Given T x v w x y z 10 17 16 15 14 9 T y z a w D1 v dax = ½ (dvx + dwx – dvw) a x y z 11 10 9 15 14 day = ½ (dvy + dwy – dvw) daz = ½ (dvz + dwz – dvw)

Reconstructing Additive Distances Given T x a x y z 11 10 9 15 14 T y 5 4 b 3 z 3 a c 4 7 w D2 6 d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! v a b z 6 10 D3 a c 3

Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define Dij = (N – 2) dij – ki dik – kj djk Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighbors 1 3 0.1 0.1 0.1 0.4 0.4 2 4

Algorithm: Neighbor-Joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. Dij is minimal Define a new node k, and set dkm = ½ (dim + djm – dij) for all m  L Add k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dik where ri = (N – 2)-1 ki dik Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length dij

Outline Molecular Evolution Distance Methods Sequence Methods UPGMA / Average Linkage Neighbor-Joining Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY) 28

Parsimony One of the most popular methods: Idea: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: Find the parsimony cost of a given tree (easy) Search through all tree topologies (hard)

Example: Parsimony Cost of One Column {A, B} C++ A B A B A A {A} {B} {A} {A}

Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species If k is not a leaf, Let i, j be the daughter nodes; Set Rk = Ri  Rj if intersection is nonempty Set Rk = Ri  Rj, and increment C if intersection is empty Termination: Minimal cost of tree for column u, = C

Example {B} {A,B} {A} {B} {A} {A,B} {A} A A A A B B A B {A} {A} {A}

Parsimony Traceback Traceback: Choose an arbitrary nucleotide from R2N – 1 for the root Having chosen nucleotide r for parent k, If r  Ri choose r for daughter i Else, choose arbitrary nucleotide from Ri Easy to see that this traceback produces some assignment of cost C

inadmissible with Traceback Example Admissible with Traceback x B Still optimal, but inadmissible with Traceback A {A, B} A B x {A} B {A, B} A B A B x B x A B A B A {A} {B} {A} {B} A B A B A x A x A B A B

Another Parsimony Algorithm Let C(v) be cost for subtree rooted at node v Let C(v,x) be cost for subtree rooted at v if we force v to have value x Initialization: For each leaf v C(v) = 0 C(v,x) = 0 if x is input character that labels v; C(v,x) = ∞ otherwise Iteration: Let u, w be children of v C(v,x) = min(C(u) + 1, C(u,x)) + min(C(v) + 1, C(v,x)) C(v) = min C(v,x) Termination: Minimal cost is C(root) 35

Probabilistic Methods xroot t1 t2 x1 x2 A more refined measure of evolution along a tree than parsimony P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot) If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1, = pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)

Probabilistic Methods xroot xu x2 xN x1 If we know all internal labels xu, P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j)) Usually we don’t know the internal labels, therefore P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t)

Felsenstein’s Likelihood Algorithm To calculate P(x1, x2, …, xN | T, t) Initialization: Set k = 2N – 1 Iteration: Compute P(Lk | a) for all a   If k is a leaf node: Set P(Lk | a) = 1(a = xk) If k is not a leaf node: 1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j 2. Set P(Lk | a) = b,c P(b | a, ti) P(Li | b) P(c | a, tj) P(Lj | c) Termination: Likelihood at this column = P(x1, x2, …, xN | T, t) = aP(L2N-1 | a)P(a) Let P(Lk | a) denote the prob. of all the leaves below node k, given that the residue at k is a

Felsenstein’s Likelihood Algorithm Define: and recursively compute: 39

Felsenstein’s Likelihood Algorithm Now using u and U we can compute: and 40

Probabilistic Methods Given M (ungapped) alignment columns of N sequences, Define likelihood of a tree: L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm | T, t) Maximum Likelihood Reconstruction: Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)

Current popular methods HUNDREDS of programs available! http://evolution.genetics.washington.edu/phylip/software.html#methods Some recommended programs: Discrete—Parsimony-based Rec-1-DCM3 http://www.cs.utexas.edu/users/tandy/mp.html Tandy Warnow and colleagues Probabilistic SEMPHY http://www.cs.huji.ac.il/labs/compbio/semphy/ Nir Friedman and colleagues