Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 7 Difficult problems….and solutions Platypus (Ornithorhynchus anatinus)

Similar presentations

Presentation on theme: "Lecture 7 Difficult problems….and solutions Platypus (Ornithorhynchus anatinus)"— Presentation transcript:

1 Lecture 7 Difficult problems….and solutions Platypus (Ornithorhynchus anatinus)

2 Non-homogenous evolution Taxon1 ACGTAAGTCATCGTAGC Taxon2 ATGGAAATTATCGCGGT Taxon3 ACATAAATCATCGTAGA Taxon4 ACGCAAGTCATCGAAGT 3 1 2 1 4 3 4 2 Assuming equal substitution rates across sites Allowing some sites to be invariant – reveals more parallel evolution among the variant sites Mutations at some sites are lethal, so they are invariant

3 Rates can also differ among the variable sites due to fitness effects, differential mutability and codon bias - again leading homogenous models to underestimate parallel change Such rate variation can often be accommodated by assuming a gamma distribution of rates across sites in the likelihood (or distance) model

4 Non-homogenous data partitions Partition 1Partition 2 Reconstructed under a single likelihood model Kolaczkowski and Thornton (Nature, 2004) Rifleman GTAACACTAGCC Broadbill GTCACACTAGCC Flycatcher GTTACATTAGCC Lyrebird GTTACTTTAGCA Indigobird GTAACCCTAGCC ZebraFinch GTAACCTTAGCA Rook GTAACTCTAGCA Codon pos. 123123123123 Red for variable sites, most change at 3rd positions Rifleman

5 reptiles monotremes marsupials placentals MarsupiontaTheria Competing hypotheses for the interrelations of the mammalian sub-classes

6 Janke et al. (PNAS, 1997) ML analysis of complete mitochondrial genome protein-coding sequences Marsupionta

7 Purine base frequency 0 0.1 0.8 0.4 0.2 1.0 ppn. constant sites Model TN93+I+ (concatenated) TN93+I+ (partitioned) df 40 480 AIC 162260.5 158054.3 Grouping of protein - coding and RNA - coding genes based on observed constant site proportions and Purine base frequency. RNAloops ( ); RNAstems ( ); COI ( ); NADH6; ( ); ATPase8, NADH2, NADH4L ( ); ATPase6, NADH1, NADH3, NADH4, NADH5( ); COII, COIII, Cytb ( ).

8 Partitioned ML: Theria is favoured KH-test p-value - Phillips et al. (MPE, 2003) MarsupialsMonotremesPlacentalsReptiles Theria

9 Compositional heterogeneity Stationarity: A standard assumption of most phylogeny reconstruction methods is that underlying substitution processes are the same across the tree When violated, biases arise that provide signals in the data that can overwhelm the true phylogenetic signal Shifting substitution processes (e.g. A G being favoured in some branches but G A in others) can result in signals for relationships arising due to similar DNA or protein sequence composition, rather than shared ancestry.

10 Elephant Platypus Opossum Bandicoot Aardvark Rook Hippopotamus Rhea Vidua Wallaroo Brushtail Possum Fin Whale Mole Armadillo Green Turtle Painted Turtle Ostrich 61 53 52 68 Extreme example: NJ tree - mt 3rd codon positions, transitions only Branch thickness proportional to T:C ratio

11 Composition 2 test (stochastic test) Taxon A C G T ----------------------------------------------- Rifleman 165 154 82 95 Broadbill 203 142 48 103 Flycatcher 195 115 60 126 Lyrebird 138 142 127 89 Indigobird 137 144 128 87 Zebra Finch 141 143 124 88 Rook 145 144 118 89 Expected 160.57 140.57 98.14 96.71 Chi-square = (Exp-Obs) 2 Exp* = 119.211273 df= (n-1)(t-1)= 18 P < 0.0001 Tells only of the presence of a bias and is unreliable when most of the variation occurs among a small number of character states

12 Relative compositional variability (magnitude metric) Allows the magnitude of compositional heterogeneity to be compared between sequences or coding regimes (for the same taxa) Where Ai is the observed frequency of adenine for taxon i, A* is the average frequency of adenine across all taxa, n is the number of taxa and t is the number of sites

13 Accounting for compositional heterogeneity 1. LogDet distances - recover additive distances between sequences when base composition varies For each pair of DNA sequences x and y, a 4 4 matrix with each possible pair of sites Olithodiscus(x) A C G T 224 5 24 8 3 149 1 16 24 5 230 4 5 19 8 175 0.249 0.006 0.027 0.009 0.003 0.166 0.001 0.018 0.027 0.006 0.256 0.004 0.006 0.021 0.009 0.194 Euglena(y) A C G T F xy = D xy = -ln[det F xy ] = 6.216

14 Rates-across-sites LogDet has yet to be developed, so this method is often inconsistent due to poor branch-length estimation Euglena Liverwort Chlamydomonas Rice Tobacco Anacystis Chlorella Olithodiscus Lockhart et al. (MBE, 1994) a. Jukes-Cantor distances b. LogDet distances Chlorophyll a/b Chlorophyll a/c Phycobilin uncertain Euglena Liverwort Chlamydomonas Rice Tobacco Anacystis Chlorella Olithodiscus

15 2. Non-homogenous base composition Maximum likelihood Galtier and Gouy (MBE, 1998) ω λ1.Φθ1λ1.Φθ1 λ 1.1 Φ θ 1 λ2θ2λ2θ2 λ3θ3λ3θ3 λ4θ4λ4θ4 λ6θ6λ6θ6 λ5θ5λ5θ5 λ7θ7λ7θ7 Parameters symbol number root G+C% ω 1 branch-length λ 2n-3 root location Φ 1 Ts/Tv ratio κ 1 equilibrium G+C% θ 2n-2 Limitations 1. restricted to GC vs. AT bias 2. computer time intensive

16 3. Character state re-coding Often much of the compositional heterogeneity arises within specific classes of character state e.g. Purine and Pyrimidine transitions These can be re-coded: RY-coding involves A,G R and C,T Y Similarly, lumping amino acids into functionally similar groups e.g. Valine, leucine and Isoleucine as single category of mid-sized aliphatic amino acids.

17 Nardi et al. (Science, 2003) found Hexapoda to be paraphyletic

18 Delsuc et al. (Science, 2003) 1 st and 3 rd codon positions RY-coded RCV nt = 0.1064 RCV ry = 0.0413 Hexapoda

19 Mistaking precision for accuracy 106 nuclear genes: Different methods provide conflicting Yeast topologies, each with 100% bootstrap support The results underline the importance of understanding how non- phylogenetic signals will bias inference under the model used Phillips et al. (MBE, 2004)

20 Not enough phylogentic signal to resolve the tree Branch-length too short Ans. Increase gene sequencing Signal erosion with time Ans. Use high-value (often slower evolving) characters Long unbroken branches make for noisier data Ans. Increase taxon sampling

21 Stemminess (Fiala and Sokal: Evol., 1985) on uncorrected distance trees indicates the relative extent of phylogenetic signal erosion among alternative sequemces (or coding regimes) for the same taxa Σ external branch-lengths total tree-length Stemminess = Greater phylogenetic signal retention for slower evolving genes results in higher stemminess

22 Tigercat Dunnart Wombat Brushtail Wallaroo Monodelphis Opossum Spiny Bandicoot Northern Brown Bandicoot Tigercat Dunnart Wombat Brushtail Wallaroo Monodelphis Opossum 12 mitochondrial protein-coding genes Stemminess =0.086 5 nuclear protein-coding genes Stemminess =0.440 Spiny Bandicoot Northern Brown Bandicoot

23 Saturation – the problem of multiple changes at the same sites Theory, simulations, and practical experience all indicate that the sequences must eventually lose information about events that were long ago. Part of the problem with using DNA sequence alignments to infer deep events is that the state space is small {A,C,G,T}

24 Other sorts of characters In an idealised situation where each site had an infinite state space there would be no parallel changes or reversals and our character matrices would be homoplasy free. Obviously it is interesting to try and find characters that are closer to this ideal than DNA sequences.

25 SINEs and LINEs SINEs (and LINEs) are Short (or Long) interspersed nuclear elements. Retrotransposed DNA elements that are copied into the genome. Low expectations for the same retrotransposon sequence to insert in exactly the same position independently (low homoplasy markers)

26 Taxon1 ATGCT-------//-------GTCTAGT Taxon2 AGGCTGTTATGT//TCTCTAGGTCAAGT Taxon3 ATGCTGCTATGT//TCTCTAGGTCTATT Taxon4 ATACT-------//-------GTATAGT Insertion event 1 into chromosome A The SINE/LINE is copied from loci 1 on chromosome A to loci 2 on chromosome B Loci 2 sequence Taxon3 (present at loci 1 and 2) Taxon2 (present at loci 1 and 2) Taxon4 (only present at loci 1) Taxon1 (not present at loci 1 or loci 2)

27 Competing hypothesis for the position of the whales

28 SINEs and LINEs provide homoplasy free support for the position of the whales as sister group to the hippos.

29 Genome-order based phylogeny Large state-space DNA sequences : 4 states per site Signed circular genomes with n genes: states, 1 site Circular genomes (1 site) –with 37 genes: states –with 120 genes: states 2 n-1 (n 1)! 2.56×10 52 3.70×10 232

30 Reference sequence Inversion (of orange and blue) Transposition (of grey) Indicates sequence read direction Inverted transposition (of grey) Genome rearrangements

31 Breakpoint Distance Breakpoint distance=5 1 2 3 4 5 6 7 8 9 10 1 –3 –2 4 5 9 6 7 8 10

32 Minimum Inversion Distance 1 2 3 4 5 6 7 8 9 10 1 2 3 –8 –7 –6 –5 –4 9 10 1 8 –3 –2 –7 –6 –5 –4 9 10 1 8 –3 7 2 –6 –5 –4 9 10 Inversion distance=3

33 Distance-based methods Tandy Warnow, UT-Austin

34 Maximum Parsimony on Rearranged Genomes (MPRG) The leaves are rearranged genomes. Find the tree that minimizes the total number of rearrangement events A B C D 3 6 2 3 4 A B C D E F Total length = 18 Tandy Warnow, UT-Austin

35 Mitochondrial genome rearrangement maximum parsimony Fritzsch et al. (J.Theor. Biol., 2006) Data choice and analytical methods are in their infancy Note non-monophyly of Nematoda and Mollusca; Well resolved sequence and morphology clades ?

36 An additional possibility is that there are multiple signals: 1. Biases in the data (e.g. compositional heterogeneity), 2. genes have different histories (e.g. lineage sorting or hybridization) If a gene has a long coalescent time, then its relationships among taxa may differ from the species tree Gene tree Species tree A B C D

37 Molecular dating e.g. Zukerkandl and Pauling (J. Theor Biol., 1965) The molecular clock Time since divergence Genetic change Time since divergence Genetic divergence observed corrected for saturation Human – Chimpanzee Human – Mouse Human – Bird

38 Is the data clock-like? Can the deviation from an ultrametric tree be explained by the stochastic nature of substitution (sampling error), or do substitution rates differ across the tree?

39 Relative rates tests H O : Two sister taxa are evolving at the same rate (by comparison with an outgroup) Hebsgaard et al. (TIM, 2005)

40 Molecular clock likelihood ratio test H O : That a clock model explains the data as well as a non-clock model 1.Optimize the likelihood of the (unrooted) tree under a non-clock model ( lnL n ) 2.Optimise the likelihood of the (rooted) tree under a clock model ( lnL c ) 3.Calculate the test statistic = 2 ( lnL c minus lnL n ) 4.This is compared to a 2 distribution critical value (where the degrees of freedom are the difference in the number of free parameters being estimated between the two models = n 2)

41 Linearized trees: Takezaki et al. (MBE, 1995) Prune the taxa that are the most non-clock-like until the molecular clock likelihood ratio test is passed Concerns: 1. removing any branches reduces the power of the test (so increases the probability of passing) and 2. remaining branches may hide complementary rate shifts that cancel out

42 Relaxing the molecular clock 1. Local clocks2. Autocorrelated rate evolution r1r1 r2r2 r3r3 r6r6 r5r5 r4r4 r3r3 r1r1 r2r2 r 10 r9r9 r8r8 r7r7 Relies on the identification of rate classes with respect to clades Each rate r i is a function of the rate of its parent branch. Many different models of rate change have been applied including: quadratic, lognormal, exponential, gamma, Ornstein-Uhlenbeck

43 3. Uncorrelated rate evolution r6r6 r5r5 r4r4 r3r3 r1r1 r2r2 r 10 r9r9 r8r8 r7r7 Method of Drummond et al. (PLoS Biol., 2006) Rates r i do not depend on the rate of their parent branch, but are drawn from a lognormal or exponential distribution that maximises the posterior probability of the tree

44 Performance of correlated rates methods on trees simulated under uncorrelated rates among branches

45 Calibrating molecular clocks Biogeographical divergences e.g. New Zealand split from Gondwana about 80 million years ago and so did some of New Zealands endemic fauna Fossils that post-date divergences 61 Ma calibration PenguinsAlbatrossDucks 90 Ma estimate Slack et al., (MBE, 2006)

46 time Point calibration Calibration bounds upper lower Flat Prior Normal Prior

47 Using a lognormal (19Ma-25Ma upper 95%, mean=21Ma) calibration for cats/hyaenas Barnett et al. (Curr. Biol., 2005) 25 20 15 10 5 0 Millions of years ago

Download ppt "Lecture 7 Difficult problems….and solutions Platypus (Ornithorhynchus anatinus)"

Similar presentations

Ads by Google