Presentation is loading. Please wait.

Presentation is loading. Please wait.

 14.4. Tue Introduction to models (Jarno)  16.4. Thu Distance-based methods (Jarno)  17.4. Fri ML analyses (Jarno)  20.4. Mon Assessing hypotheses.

Similar presentations


Presentation on theme: " 14.4. Tue Introduction to models (Jarno)  16.4. Thu Distance-based methods (Jarno)  17.4. Fri ML analyses (Jarno)  20.4. Mon Assessing hypotheses."— Presentation transcript:

1

2  14.4. Tue Introduction to models (Jarno)  16.4. Thu Distance-based methods (Jarno)  17.4. Fri ML analyses (Jarno)  20.4. Mon Assessing hypotheses (Jarno)  21.4. Tue Problems with molecular data (Jarno)  23.4. Thu Problems with molecular data (Jarno)  Phylogenomics  24.4. Fri Search algorithms, visualization, and  other computational aspects (Jarno) J

3  Character based ◦ Parsimony ◦ Model based analyses  maximum likelihood  bayesian methods  Similarity based ◦ Distance methods

4  Distance Estimates attempt to estimate the mean number of changes per locus (~gene) since 2 taxa last shared a common ancestor based upon a model of how the sequences may have evolved

5 J

6

7

8  Number of changes between two sequences.  Amino acid sequences (similarly also for nucleotide):  KIMMOKIMMO  KIMMATI-MO  d H =1d H =1  Hamilton distance does not count gaps.  Sometimes used in parsimony methods. J

9  Edit distance does count gaps (-):  KIMMOKIMMO  KIMMATI-MO  d H =1d H =1  d E =1d E =2  Often used in parsimony methods. J

10  The p-distance is a normalized Hamilton or edit distance. ◦ Normalized to the length of the sequence alignment  p d =n d /n, ◦ where p d is the distance ◦ n d is the number of differing nucleotides between the (aligned) sequences (Hamilton distance) ◦ n is the total length of the alignment J

11  KIMMOKIMMO  KIMMATI-MO  d H =1d H =1  d E =1d E =2  d p =1/5=0.2d p =1/4=0.25 J

12  Distance models are often based upon some of the same assumptions as the models in ML but they are implemented in a different way ◦ Jukes Cantor model: assumes all changes equally likely ◦ General time reversible model (GTR): assigns different probabilities to each type of change ◦ LogDet (Paralinear) distance model: was devised to deal with unequal base frequencies in different sequences

13  All models include a correction for multiple substitutions at the same site  All (except Logdet distances) can be modified to include a gamma correction for site rate heterogeneity CA C G T A 1 2 3 1 Seq 1 Seq 2 Number of changes

14  d xy = distance between sequence x and sequence y expressed as the number of changes per site  (note d xy = r/n where r is number of replacements and n is the total number of sites. This assumes all sites can vary and when unvaried sites are present in two sequences it will underestimate the amount of change which has occurred at variable sites)  D = is the observed proportion of nucleotides which differ between two sequences (fractional dissimilarity)  ln = natural log function to correct for superimposed substitutions Jukes & Cantor model: d xy = -(3/4) ln (1-4/3 D)

15  The 3/4 and 4/3 terms reflect that there are four types of nucleotides and three ways in which a second nucleotide may not match a first - with all types of change being equally likely (i.e. unrelated sequences should be 25% identical by chance alone)

16  If two sequences are 95% identical they are different at 5% or 0.05 (D) of sites thus:  d xy = -3/4 ln (1-4/3*0.05) = 0.0517  Note that the observed dissimilarity 0.05 increases only slightly to an estimated 0.0517 - this makes sense because in two very similar sequences one would expect very few changes to have been superimposed at the same site in the short time since the sequences diverged apart  However, if two sequences are only 50% identical they are different at 50% or 0.50 (D) of sites thus:  d xy = -3/4 ln (1-4/3*0.5) = 0.824  For dissimilar sequences, which may have diverged a long time ago, the use of ln infers that a much larger number of superimposed changes have occurred at the same site The natural logarithm ln is used to correct for superimposed changes at the same site

17  The most common additional parameters are: ◦ A correction to allow different substitution rates for each type of nucleotide change ◦ A correction for the proportion of sites which are unable to change ◦ A correction for variable site rates at those sites which can change

18  LogDet (paralinear) distances was designed to deal with unequal base frequencies in each pairwise sequence comparison - thus it (putatively) allows base compositions to vary over the tree!  This distinguishes it from the GTR distance model which takes the average base composition and applies it to all comparisons

19  LogDet distances assume all sites can vary - thus it is important to remove those sites which cannot change  The proportion of such sites is typically slightly smaller than the observed number of constant sites and is estimated using ML  Invariable sites are removed according to the base composition of constant sites (rather than the base composition of all sites - which may be different) in order to preserve the correct base frequencies among remaining constant sites

20  d xy = estimated distance between sequence x and sequence y  ln = natural log function to correct for superimposed substitutions  F xy = 4 x 4 (there are four bases in DNA) divergence matrix for seq X & Y - this matrix summarises the relative frequencies of bases in a given pairwise comparison  det = is the determinant (a unique mathematical value) of the matrix LogDet Distances d xy = -ln (det F xy )

21 Sequence B a c g t a 224 5 24 8 Sequence A c 3 149 1 16 g 24 5 230 4 t 5 19 8 175  For sequences A and B, for 900 sequence positions, this matrix summarises pairwise site by site comparisons (it uses the data very efficiently)

22  The matrix Fxy expresses this data as the proportions (e.g. 224/900 = 0.249) of sites: a c g t a.249.006.027.009 Fxy = c.003.166.001.018 g.027.006.256.004 t.006.021.009.194  dxy = -ln [det Fxy] = -ln [.002] = 6.216 (the logDet distance between sequences A and B)

23  Very good for situations where base compositions vary significantly between sequences  Even when base compositions do not appear to vary the LogDet distances model performs at least as well as other distance methods  A drawback is that it assumes sites evolve identically and rates are equal for all sites  However, a correction whereby a proportion of invariable sites are removed prior to analysis appears to work very well in simulations

24  Occurs when different sites in a molecule evolve at different rates due to different functional constraints  Many models (Jukes Cantor, LogDet, some ML models) assume all sites can vary and all evolve at the same rate  This underestimates the amount of change that has occurred - and thus distances between sequences - leading to incorrect trees  A gamma correction for site rate heterogeneity can be included - if model allows this (many do)

25 O.O.O.O.O.O. eremitaitalicumcristinaelassalleibarnabita¹barnabita O. eremita- O. italicum0.038- O. cristinae0.0490.044- O. lassallei0.1010.0980.100- O. barnabita¹0.1060.0980.0870.062- O. barnabita0.1150.1150.1050.0680.006- This summary of the data is then used to infer the phylogenetic relationships of taxa

26  Four mathematical conditions must be satisfied: ◦ d(x,y)>=0 ◦ d(x,x)=0 and d(y,y)=0 ◦ d(x,y)=d(y,x)# symmetric ◦ d(x,z)<=d(x,y) + d(y,z) # metric  The last is also called triangle unequality  If the aforementioned conditions are not satisfied then d is not a distance, but a dissimilarity J

27  Euclidean distance (green)  Taxicab, city block or Manhattan distance (other colors) J

28  A metric is an ultrametric, if: ◦ d(x, z) ≤ max(d(x, y), d(y, z)), ◦ which means that points can never fall between other points.  If the distances between the sequences are ultrametric, then the tree formed by certain clustering methods (UPGMA) will be an accurate ultrametric tree. This results into an accurately rooted tree.  If the distance is not ultrametric, the resulting tree will be an "unaccurate ultrametric tree".  In practise this would mean a molecular clock exists! J

29  Additive trees are generalizations of ultrametric trees.  An additive metric is a one for which: ◦ d(x,y)+d(u,v) <= max(d(x,u)+d(y,v), d(x,v)+d(y,u))  An additive tree is further restricted by: ◦ d(x,y)+d(u,v) <= d(x,u)+d(y,v) = d(x,v)+d(y,u)) J

30 Additive distances:  If we could determine exactly the true evolutionary distance implied by a given amount of observed sequence change, between each pair of taxa under study, these distances would have the useful property of tree additivity.

31 Additive trees:  A phylogenetic tree is additive if the evolutionary distance separating any two points on a tree is equal to the total of the lengths of the branches that join the two points. Obtaining a tree using pairwise distances

32 A B C D A - 4 4 8 B 4 - 6 10 C 4 6 - 8 D 8 10 8 - A B C D 1 1 3 6 2 Note that the branch lengths in the matrix and the tree path lengths match perfectly - this is a single unique additive tree

33  Unfortunately due to the finite amount of available data, stochastic (random) errors will cause deviation of the estimated distances from perfect tree additivity even when evolution proceeds exactly according to the distance model used  Poor estimates obtained using an inappropriate model will compound the problem  How can we identify the tree which best fits the experimental data from the many possible trees?

34 We have uncertain data that we want to fit to a particular mathematical model (an additive tree) and find the optimal value for the adjustable parameters (branching pattern and branch lengths)

35  Seeks to minimise the squared deviation of the tree path length distances from the distance estimates

36 A E C D B v1v1 v6v6 v7v7 v5v5 v3v3 v2v2 v4v4 A B C D E A 0 0.23 0.16 0.20 0.17 B 0.23 0 0.23 0.17 0.24 C 0.16 0.23 0 0.15 0.11 D 0.20 0.17 0.15 0 0.21 E 0.17 0.24 0.11 0.21 0 Observed D ij Inferred d ij Least squares methods Minimise discrepancy between Observed D ij and inferred d ij

37  For 20 taxa there are ~2 x 10 20 unrooted trees (close to Avogadro’s constant)  For 50 taxa there are ~3 x 10 74 unrooted trees (number of electrons in the universe?) How can we find the best tree ?

38 Minimum Evolution Method  For each possible alternative tree one can estimate the length of each branch from the estimated pairwise distances between taxa and then compute the sum (S) of all branch length estimates  The minimum evolution criterion is to choose the tree with the smallest S value

39  Clustering methods do not optimize a criterion ◦ apply a particular algorithm to the observed data to come up with a tree  UPGMA and Neighbour-joining

40  Unweighted Pair Group Method using Arithmetic averages  Assumes that sequences evolve under a ”molecular clock”  Always connect those taxa or amalgamates of taxa that have the shortest distance  Gives an ultrametric tree

41

42 Distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances. Mon-Hum MonkeyHumanSpinachMosquitoRice

43 After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances: Dist[Spinach, MonHum] = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = (90.8 + 86.3)/2 = 88.55 Mon-Hum MonkeyHumanSpinach

44 HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum)

45 HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum) Spin-Rice

46 HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum) Spin-Rice (Spin-Rice)-(Mos-(Mon-Hum))

47 A BC D 13 4 4 22 10 True tree A B C D A 0 17 21 27 B 17 0 12 18 C 21 12 0 14 D 27 18 14 0 Distance matrix A B C D 66 2 8 2.833 10.833 UPGMA tree

48  Does not assume a molecular clock  Approximates the minimum evolution method  Guaranteed (supposedly) to recover the true tree if the distance matrix is an exact reflection of the tree

49  Calculate a corrected distance matrix ◦ Adjust distance of each pair of taxa with average distance to all other taxa  Join two least distant taxa together to create a new node  Calculate branch lengths from node to each taxon separately taking into account average distance to all other taxa  Combine joined taxa and calculate corrected distances to remaining taxa and go through cycle again

50 A B C D E B 5 C 4 7 D 7 10 7 E 6 9 6 5 F 8 11 8 9 8 1. Compute net divergences for every node: rA = 5+4+7+6+8=30 rD = 38 rB = 5+7+10+9+11=42 rE = 34 rC = 32 rF = 44 2. Compute rate corrected distance matrix: Mij =dij – (ri+rj)/(N-2) M AB = d AB – (rA+rB)/(N-2) = 5 – (30+42)/4 = -13 M AC etc etc A B C D E B -13 C -11.5 -11.5 D -10 -10 -10.5 E -10 -10 -10.5 -13 F -10.5 -10.5 -11 -11.5 -11.5 A B C D E F

51 3. Join neighbours A and B to form node U and 4. compute their branch lengths: S AU = d AB /2+(rA-rB)/2(N-2) = 5/2+(30-42)/2(6-2)=1 S BU = d AB -S AU =4 A B C D E F U 5. Distance from U to remaining terminals: d CU = (d AC +d BC -d AB )/2 = 3 d DU = (d AD +d BD -d AB )/2 = 6 d EU = (d AE +d BE -d AB )/2 = 5 d FU = (d AF +d BF -d AB )/2 = 7 U C D E C 3 D 6 7 E 5 6 5 F 7 8 9 8 Repeat steps 1-5 1 4 1. Compute net divergences for every node: rU = 3+6+5+7=21 rE = 24 rC = 24 rF = 32 rD = 27 U C D E C -12 D -10 -11 E -10 -10 -12 F -10.7 -10.7 -10.7 -10.7 2. Compute rate corrected distance matrix: Mij =dij – (ri+rj)/(N-2) 3. Join neighbours U and C to form node V and 4. compute their branch lengths: S UV = d CU /2+(rU-rC)/2(N-2)=1 S CV = d CU -S UV =2 A B C D E F U 1 4 1 2 5. Distance from U to remaining terminals: dDV = (dDU+dCD-dCU)/2 = 5 dEV = (dEU+dCE-dCU)/2 = 4 dFV = (dFU+dCF-dCU)/2 = 6 V Repeat steps 1-5

52 1. Compute net divergences for every node: rV = 5+4+6=15 rE = 17 rD = 19 rF = 23 V D E D -12 E -12 -13 F -13 -12 -12 2. Compute rate corrected distance matrix: Mij =dij – (ri+rj)/(N-2) 3. Join neighbours D and E to form node W and 4. compute their branch lengths: S DW = d DE /2+(rD-rE)/2(N-2)=3 S EW = d DE -S DW =2 A B C D E F U 1 4 1 2 5. Distance from W to remaining terminals: dVW = (dDV+dEV-dDE)/2 = 2 dFW = (dDF+dEF-dDE)/2 = 6 V Repeat steps 1-5 2 3 W V D E D 5 E 4 5 F 6 9 8

53 V W W 2 F 6 6 1. Compute net divergences for every node: rV = 2+6=8 rF = 12 rW = 8 V W W -14 F -14 -14 2. Compute rate corrected distance matrix: Mij =dij – (ri+rj)/(N-2) 3. Join neighbours F and V to form node X and 4. compute their branch lengths: S VX = d FV /2+(rV-rF)/2(N-2)=1 S FX = d FV -S VX =5 A B C D E F U 1 4 1 2 5. Distance from W to remaining terminals: d WX = (d FW +d VW- d FV )/2=1 V Repeat steps 1-5 2 3 W 5 1 X W X 1 A B C D E F U 1 4 1 2 V 2 3 W 5 1 X 1

54 A BC D 13 4 4 22 10 A B C D A 0 17 21 27 B 17 0 12 18 C 21 12 0 14 D 27 18 14 0 Compute net divergences for every node: rA = 17+21+27=65 rB = 17+12+18=47 rC = 21+12+14=47 rD = 27+18+14=59 Compute rate corrected distance matrix: Mi =dij – (ri+rj)/(N-2) M AB = dAB – (rA+rB)/(N-2) = 17 – (65+47)/2 = -47.5 M AC = dAC – (rA+rC)/(N-2) = 21 – (65+47)/2 = -45.5 M AD = dAD – (rA+rD)/(N-2) = 27 – (65+59)/2 = -48.5 M BC = dBC – (rB+rC)/(N-2) = 12 – (47+47)/2 = -41 M BD = dBD – (rB+rD)/(N-2) = 18 – (47+59)/2 = -44 M CD = dCD – (rC+rD)/(N-2) = 14 – (47+59)/2 = -46 A B C D A 0 B -47.5 0 C -45.5 -41 0 D -48.5 -44 -46 0 Beware! There are cases where NJ does not work!

55  Fast when using clustering algorithms - suitable for analysing data sets which are too large for ML  A large number of models are available with many parameters - improves estimation of distances

56

57

58  Distance estimates are only correct if model used is correct  Rate variations in different parts of a tree are intractable for distance measures ◦ Information on variation in characters is lost once sequence differences are converted to distances

59  Information is lost - given only the distances it is impossible to derive the original sequences  Only through character based analyses (ML, parsimony) can the most informative positions be inferred  Generally outperformed by Maximum likelihood methods in choosing the correct tree in computer simulations (but logDet is better in some situations)

60  ”Nothing makes sense in biology except in the light of evolution” - Dobzhansky 1973  ”Nothing in evolution makes sense except in the light of phylogeny” - Savage 1997

61  The study of character evolution  The study of historical biogeography  The study of the temporal framework of evolution and diversification  The study of molecular evolution

62  Ease of data generation for large numbers of taxa  Ease of generating a large number of independent data sets for given taxa  Molecular characters behind the morphological characters we see

63  The butterfly family Nymphalidae

64 Wahlberg et al (2009) Proc R Soc 276: 4295-4302

65 104 mya 94 mya 65 mya Libytheinae Danaini Tellervini+Ithomiini Limenitidinae Heliconiinae Pseudergolinae Apaturinae Biblidinae Cyrestinae Nymphalinae Calinaginae Charaxinae Satyrinae Wahlberg et al (2009) Proc R Soc 276: 4295-4302

66 Peña & Wahlberg (2008) Biology Letters 4: 274-278. satyrine clade

67 Widespread Neotropics and/or Oriental Neotropics, Oriental, Australia Oriental Oriental and/or Afrotropics Oriental Oriental and/or Neotropics Neotropics and/or Oriental Oriental and/or Neotropics Neotropics, Oriental, or widespread including Afrotropics Neotropics, Oriental, Afrotropics Neotropics and/or Oriental Oriental (and Neotropics?) Wahlberg et al (2009) Proc R Soc 276: 4295-4302

68 90 Mya

69 80 Mya

70 70 Mya

71 60 Mya

72 Wahlberg et al (2009) Proc R Soc 276: 4295-4302


Download ppt " 14.4. Tue Introduction to models (Jarno)  16.4. Thu Distance-based methods (Jarno)  17.4. Fri ML analyses (Jarno)  20.4. Mon Assessing hypotheses."

Similar presentations


Ads by Google