Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distance based phylogenetics

Similar presentations


Presentation on theme: "Distance based phylogenetics"— Presentation transcript:

1 Distance based phylogenetics
Usman Roshan

2 Phylogenetics Study of how species relate to each other
“Nothing in biology makes sense, except in the light of evolution”, Theodosius Dobzhansky, Am. Biol. Teacher (1973) Rich in computational problems Fundamental tool in comparative bioinformatics

3 Why phylogenetics? Study of evolution
Origin and migration of humans Origin and spead of disease Many applications in comparative bioinformatics Sequence alignment Motif detection (phylogenetic motifs, evolutionary trace, phylogenetic footprinting) Correlated mutation (useful for structural contact prediction) Protein interaction Gene networks Vaccine devlopment And many more…

4 Phylogeny Problem U V W X Y X U Y V W AGGGCAT TAGCCCA TAGACTT TGCACAA
TGCGCTT X U say hello Y V W

5 Bipartitions Phylogenies are equivalent to bipartitions

6 Topological differences

7 Phylogeny Problem Two main methodologies:
Alignment first and phylogeny second Construct alignment using one of the MANY alignment programs in the literature Do manual (eye) adjustments if necessary Apply a phylogeny reconstruction method Fast but biologically not realistic Phylogeny is highly dependent on accuracy of alignment (but so is the alignment on the phylogeny!) Simultaneously alignment and phylogeny reconstruction Output both an alignment and phylogeny Computationally much harder Biologically more realistic as insertions, deletions, and mutations occur during the evolutionary process

8 First methodology Compute alignment (for now we assume we are given an alignment) Construct a phylogeny (two approaches) Distance-based methods Input: Distance matrix containing pairwise statistical estimation of aligned sequences Output: Phylogenetic tree Fast but less accurate Character-based methods Input: Sequence alignment Accurate but computationally very hard

9 Distance-based methods

10 Evolution on a single edge
Poisson process Number of changes in a fixed time interval t is independent of changes in any other non-overlapping time interval u Number of changes in time interval t is proportional to the length of the interval No changes in time interval of length 0 Let X be the number of nucleotide changes on a single edge. We assume X is a Poisson process Probability dictates that

11 Evolution on a single edge
We want to compute (the probability of a nucleotide change on edge e) The probability of observing a change is just the sum of probabilities of observing k changes over all possible values of k (excluding even ones because those changes cannot be seen)

12 Evolution on a single edge
Expected number of nucleotide changes on a given edge is given by Key: is additive

13 Additivity Assume we have a path of k edges and that p1, p2,…, pk are the probabilities of change on each edge of the path Using induction we can show that Multiplicative term is hard to deal with and does not easily decompose into a product or sum of pi’s

14 Additivity But the expected number of nucleotide changes on the path p is elegant

15 Evolutionary models Simple 0,1 alphabet evolutionary model
i.i.d. model uniformly random root sequence Jukes-Cantor: Uniformly random root sequence

16 Evolutionary models General Markov Model
Uniformly random root sequence i.i.d. model For time reversible models

17 Variation across sites
Standard assumption of how sites can vary is that each site has a multiplicative scaling factor Typically these scaling factors are drawn from a Gamma distribution (or Gamma plus invariant)

18 Special issues Molecular clock: the expected number of changes for a site is proportional to time No-common-mechanism model: there is a random variable for every combination of edge and site

19 Evolutionary distance estimation

20 Estimating evolutionary distances
For sequences A and B what is the evolutionary distance under the Jukes-Cantor model? ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG But we don’t know what is

21 Estimating evolutionary distances
Assume nucleotide changes are Bernoulli trials (i.i.d. trials of success or failure) is probability of head in n Bernoulli trials (n is sequence length) Compute a maximum likelihood estimate for ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG

22 Estimating evolutionary distance
We want to find the value of p that maximizes the probability: Set dP/dp to 0 and solve for p to get

23 Estimating evolutionary distances
= 5/18 Continuing in this manner we estimate for all pairs of sequences in the alignment We now have a distance matrix under a biologically sound evolutionary model ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG

24 Distance methods

25 Distance methods UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?

26 Additivity

27 UPGMA UPGMA is not additive but works for
ultrametric trees. Takes O(n^3) time A B C D A 6 26 26 10 10 B 26 26 C 6 3 3 3 3 D A B C D

28 UPGMA Initialize n clusters where each cluster i contains the sequence i Find closest pair of clusters i, j, using distances in matrix D Make them neighbors in the tree by adding new node (ij), and set distance from (ij) to i and j as Dij/2 Update distance matrix D: for all clusters k do the following (ni and nj are size of clusters i and j respectively) Delete columns and rows for i and j in D and add new ones corresponding to cluster (ij) with distances as computed above Goto step 2 until only one cluster is left

29 UPGMA A B C D A 6 26 26 13 13 B 26 26 C 6 3 3 3 3 D A B C D

30 UPGMA Doesn’t work (in general) for non-ultrametric trees A B C D 3 3
13 16 26 3 3 B 12 19 10 B C 10 C 13 D A D

31 UPGMA UPGMA constructs incorrect tree here 7.25 A B C D 7.25 A 13 16
26 7.25 7.25 B 12 19 6 6 C 13 B C A D D

32 UPGMA Bipartition (BC,AD) is not in true tree 7.25 3 3 3 3 7.25 10 10
6 6 A D B C A D True tree UPGMA tree

33 Neighbor joining Additive and O(n^3) time
Initialization: same as UPGMA For each species compute Select i and j for which is minimum Make them neighbors in the tree by adding new node (ij), and set distance from (ij) to i and j as

34 Neighbor joining Update distance matrix D: for all clusters k do the following Delete columns and rows for i and j in D and add new ones corresponding to cluster (ij) with distances as computed above Go to 3 until two nodes/clusters are left

35 NJ NJ constructs the correct tree for additive matrices A B C D 3 3 A
13 16 26 3 3 B 12 19 10 B C 10 C 13 D A D

36 Simulation studies

37 Simulation studies The true evolutionary tree is never known in practice. Simulation allows us to study accuracy of methods under biologically realistic scenarios Mathematics behind the phylogenetics is often complex and challenging. Simulation allows us to study algorithms when not possible theoretically and also examine algorithm performance under various conditions such as different evolutionary rates, sequence lengths, or numbers of taxa

38 Statistical consistency
As sequence lengths tend to infinity the distance estimation improves and eventually leads to the true additive matrix If a method like NJ is then applied we get the true tree. In practice, however, we have limited sequence length. Therefore we want to know how much sequence length a method requires to achieve low error

39 Convergence rates Can be studied experimentally or theoretically
Theoretical results offer loose bounds Experiments (under simulation) provide more realistic bounds on sequence lengths

40 Sequence length requirements

41 Sequence length requirements

42 Typical performance study

43 Sequence lengths for NJ
Sequence lengths required to obtain 90% accuracy

44 Error rate of NJ


Download ppt "Distance based phylogenetics"

Similar presentations


Ads by Google