Accuracy and precision of phylogenomic divergence-time estimates

Accuracy and precision of phylogenomic divergence-time estimates
Michael Matschiner University of Zurich @m_matschiner

Accuracy and precision of phylogenomic divergence-time estimates
Michael Matschiner University of Zurich @m_matschiner Hi everybody. My name is Michael Matschiner, I’m a postdoc at the University of Zurich, and I’m going to talk about [quote]. Feel free to tweet about my talk if you like, my Twitter ID is shown here and you can find more about my work at evoinformatics.eu.

Divergence-time estimation
So, you probably know that in principle, phylogenetic divergence-time estimation is quite simple: You use ▶︎ sequence data to infer the relationships among species, and ▶︎ fossil information to convert branch lengths into an absolute time scale, usually in millions of years. AGCCTG AGCCTA AGCGTA

Unprecise estimate When you do that in a Bayesian framework, you obtain probabilistic confidence intervals, and the precision of these - the width of the confidence intervals - often depends on the amount of data that you use for the analysis. So with a low amount of data as shown here, you’re likely to end up with very ▶︎ unprecise age estimates, but fortunately, nowadays, genetic data are not so limited anymore,… AGCCTG AGCCTA AGCGTA

Precise estimate …so we should in principle be able to use larger genetic datasets - genomic datasets - to ▶︎ achieve a greater precision in the age estimates. AGCCTGTCAGTCA GCTACAAGTCTAAG CTGTCCAATCAGT GCCACAAGTCTAAG GCCACAAGTCTAAC AGCCTATCAGTCA GCTACACGTCTCAG GTGTCCAATCAGT GCTACACGTCTAAG AGCGTATCAGTCA CTGTCCGATCAGT GCTACACGTCGAAG

Inaccurate estimate However, with greater precision, model violations can more easily lead to inaccurate estimates, which in my view are more problematic than unprecise estimates because they may lead us to draw conclusions that are not supported by the data. AGCCTGTCAGTCA GCTACAAGTCTAAG CTGTCCAATCAGT GCCACAAGTCTAAG GCCACAAGTCTAAC AGCCTATCAGTCA GCTACACGTCTCAG GTGTCCAATCAGT GCTACACGTCTAAG AGCGTATCAGTCA CTGTCCGATCAGT GCTACACGTCGAAG

Tree discordance Species tree Gene trees
One cause for model violation can be tree discordance, the frequency of which has become increasingly apparent with the use of genome-scale datasets. So when we have a ▶︎ species tree as is now shown in gray, ▶︎ some genomic regions may have histories - gene trees - that are concordant with the species tree, meaning that they have the exact same topology as the species tree.

Tree discordance Incomplete lineage sorting
Other gene trees, however, can be discordant, for example due to incomplete lineage sorting, which can occur when two versions of the same gene remain in a population over the entire duration of an internal branch in the phylogeny.

Tree discordance Introgression
Discordant gene trees can also arise from introgression as the result of hybridization, however, that is something that I’m going to ignore in my talk.

Tree discordance recombination recombination recombination
▶︎Each gene tree is associated with a certain region of the genome - which is here illustrated as a single chromosome - and these regions are separated by recombination breakpoints. recombination

Tree discordance c-gene recombination recombination recombination
The regions in between these breakpoints have been termed c-genes - coalescent genes - by Doyle in 1995, and I’m going to use this word c-gene more often in my talk so maybe keep in mind what it means. So this region is a c-gene,… recombination Doyle (1995) Syst Bot

Tree discordance c-gene Doyle (1995) Syst Bot
…this region is another c-gene,… Doyle (1995) Syst Bot

Tree discordance c-gene c-gene Doyle (1995) Syst Bot
…and on this chromosome there are four c-genes in total, corresponding to the four gene trees. Doyle (1995) Syst Bot

Tree discordance Alignment Doyle (1995) Syst Bot
For phylogenetic inference, we usually assume that the alignments used for the inference lie completely within such c-genes, and that they therefore result from a single evolutionary history. However, whether this assumption is true… Doyle (1995) Syst Bot

How long are c-genes? …depends directly on this question: How long are these c-genes? Are they even long enough to fit any alignments? We may expect that the lengths of c-genes depend on the recombination rate and on the degree of incomplete lineage sorting, which in turn depends on population sizes and the lengths of internal branches of the species tree, but the absolute lengths of c-genes are usually very uncertain.

Simulations Stick spiders 5 million years Notothenioid fishes
So I ran some simulations to test this. I simulated species trees with ▶︎ an age of 5 million years and ▶︎ 20 extant species, which is comparable to some rapidly radiating groups like ▶︎ stick spiders on Hawaii or notothenioid fishes in Patagonia. 20 species Stick spiders: Gillespie et al. (2018) Curr Biol, notothenioid fishes: Ceballos et al. (2019) BMC Evol Biol

Simulations In total, I simulated 20 species trees with this age and species richness.

Simulations Ne = 100,000 5 million years 20 species
And for each of the simulated species trees, I then simulated population processes with the software ▶︎ msprime, assuming population sizes of 100,000 individuals,… 20 species msprime: Kelleher et al. (2016) PLoS Comput Biol

… 50,000 individuals,… 20 species msprime: Kelleher et al. (2016) PLoS Comput Biol

…or 200,000 individuals, and… 20 species msprime: Kelleher et al. (2016) PLoS Comput Biol

Simulations Ne = 100,000 r = 5×10-9/g 5 million years 20 species
…per-site recombination rates of 5 times ten to the minus 9,… 20 species msprime: Kelleher et al. (2016) PLoS Comput Biol

Simulations Ne = 100,000 r = 5×10-9/g 5 million years r = 10-8/g
…10 to the -8,… 20 species msprime: Kelleher et al. (2016) PLoS Comput Biol

Simulations Ne = 100,000 5 million years r = 10-8/g r = 2×10-8/g
…or 2 times ten to the minus 8 per generation. 20 species msprime: Kelleher et al. (2016) PLoS Comput Biol

c-gene sizes Mean size (bp) Ne = 50,000 100,000 200,000 r = 10-8/g 25
15 Mean size (bp) 10 Across the twenty species trees, the resulting mean c-gene sizes were already as short as bp when the population size was 50,000, ▶︎ and they further decreased in size with increasing population sizes, so that ▶︎ with a population size of 200,000, mean c-gene sizes are on the order 7-8 bp. ▶︎ These results are for simulations with a recombination rate of 10 to the -8 per generation,… 5 Ne = 50,000 100,000 200,000 r = 10-8/g c-genie: Malinsky & Matschiner (2019)

c-gene sizes Mean size (bp) r = 5×10-9/g 10-8/g 2×10-8/g Ne = 100,000
25 20 15 Mean size (bp) 10 …but when I fix the population size and vary the recombination rate instead, the results are on a similar scale. 5 0.0 r = 5×10-9/g 10-8/g 2×10-8/g Ne = 100,000 c-genie: Malinsky & Matschiner (2019)

*in rapidly diverging groups
c-genes are short. * *in rapidly diverging groups So from this we learn that c-genes can in fact be extremely short, ▶︎ at least in rapidly diverging groups with species trees like those that I simulated.

Tree discordance c-gene Doyle (1995) Syst Bot
However, one could argue that it is actually not the size of c-genes that is relevant for the accuracy of phylogenetic analyses, but that we can ignore ▶︎ those recombination breakpoints at which only node ages change but not the topology. Doyle (1995) Syst Bot

Single-topology tract
Tree discordance Single-topology tract This would mean that the lengths of what I call “single topology tracts” are more relevant than those of c-genes. Doyle (1995) Syst Bot

Single-topology tract sizes
600 500 400 Mean size (bp) 300 These tracts are somewhat longer than c-genes, with mean sizes on the order of bp with a population size of 50,000, ▶︎ around bp with a population size of 100,000, ▶︎ and around 20 bp with a population size of 200,000. 200 100 Ne = 50,000 100,000 200,000 r = 10-8/g c-genie: Malinsky & Matschiner (2019)

Single-topology tract sizes
200 150 Mean size (bp) 100 Again, if I modify the recombination rate instead of the population size, the results are similar. Notably, in all cases, the probabilities that an alignment of 5,000 bp is completely within one c-gene or one single-topology tract is below 1%. 50 r = 5×10-9/g 10-8/g 2×10-8/g Ne = 100,000 c-genie: Malinsky & Matschiner (2019)

How bad is this? So, how bad is this for phylogenetic inference, and in particular for divergence-time estimation?

To find out, I used the simulated datasets despite their short c-genes for phylogenetic divergence-time estimation, and I estimated the divergence times with three different strategies.

Concatenation Long alignment (100,000 bp)
First, I ignored all recombination breakpoints and simply concatenated long regions of 100,000 bp, which I then used for divergence-time estimation with the software BEAST2. (100,000 bp) BEAST2.5: Bouckaert et al. (2019) PLoS Comput Biol

Concatenation Estimated node age Ne = 50,000 Ne = 200,000 r = 10-8/g
4 4 3 3 Estimated node age 2 2 The results are shown here, as a comparison of the true node ages, which I knew from the simlations, and the estimated node ages obtained with BEAST2. The orange dots indicate mean node ages and the vertical bars show the Bayesian 95% confidence intervals. So this doesn’t look so bad as there is clearly a correlation between the two, but you can see that particularly with a population size of 200,000, all of the younger dots are above the diagonal, meaning that their ages are overestimated. 1 1 Ne = 50,000 Ne = 200,000 r = 10-8/g 1 2 3 4 5 1 2 3 4 5 True node age True node age BEAST2.5: Bouckaert et al. (2019) PLoS Comput Biol

Concatenation Overestimated Overestimated Estimated node age /
4 4 2 2 Overestimated Overestimated Estimated node age / true node age A better illustration may be this type of plot, comparing the ratio between the estimated and true node ages for the same set of simulations. So if this is ▶︎ above 1, the age is overestimated, and if it’s ▶︎ below 1, the age is underestimated. ▶︎ So this shows again that young ages are overestimated, and that the degree of this overestimation increases with the population size. 1 1 Underestimated Ne = 50,000 Underestimated Ne = 200,000 0.5 0.5 1 2 3 4 5 1 2 3 4 5 True node age True node age BEAST2.5: Bouckaert et al. (2019) PLoS Comput Biol

Concatenation Estimated node age / true node age r = 5×10-9/g
4 4 2 2 Estimated node age / true node age When I vary the recombination rate instead of the population size, I find a similar degree of overestimation, but this degree does not increase with higher recombination rates. 1 1 r = 5×10-9/g r = 2×10-8/g Ne = 100,000 0.5 0.5 1 2 3 4 5 1 2 3 4 5 True node age True node age BEAST2.5: Bouckaert et al. (2019) PLoS Comput Biol

Gene tree / species tree
As a second approach for divergence-time estimation, I used StarBEAST, a gene-tree / species-tree approach based on the multi-species coalescent, with 20 alignments that were each 5,000 bp long. And because the c-genes were so short that each of these alignments contained a large number of them, I expected to see the same age overestimation as with concatenation,… “Gene” alignments (20 × 5,000 bp) StarBEAST2: Ogilvie et al. (2018) Mol Biol Evol

4 4 2 2 Estimated node age / true node age … and that was indeed the case. As before with concatenation, young node ages are overestimated, and this is particularly so when the population size is large. 1 1 Ne = 50,000 Ne = 200,000 r = 10-8/g 0.5 0.5 1 2 3 4 5 1 2 3 4 5 True node age True node age StarBEAST2: Ogilvie et al. (2018) Mol Biol Evol

4 4 2 2 Estimated node age / true node age …and the overestimation occurred again when I varied the recombination rate instead of the population size, and this time a small increase in the overestimation could be observed with the higher recombination rate. 1 1 r = 5×10-9/g r = 2×10-8/g Ne = 100,000 0.5 0.5 1 2 3 4 5 1 2 3 4 5 True node age True node age StarBEAST2: Ogilvie et al. (2018) Mol Biol Evol

SNAPP Individual SNPs (5,000 SNPs)
As a third approach for divergence-time estimation, I applied SNAPP, which also implements the multi-species coalescent model just like StarBEAST, but instead of sequence aligments, it uses individual SNPs to infer the species tree, and it should therefore be entirely robust to recombination. Individual SNPs (5,000 SNPs) SNAPP: Bryant et al. (2012) Mol Biol Evol, Stange et al. (2018) Syst Biol

SNAPP Estimated node age / true node age Ne = 50,000 Ne = 200,000
4 4 2 2 Estimated node age / true node age This in fact seems to be the case, because regardless of population size… 1 1 Ne = 50,000 Ne = 200,000 r = 10-8/g 0.5 0.5 1 2 3 4 5 1 2 3 4 5 True node age True node age SNAPP: Bryant et al. (2012) Mol Biol Evol, Stange et al. (2018) Syst Biol

SNAPP Estimated node age / true node age r = 5×10-9/g r = 2×10-8/g
4 4 2 2 Estimated node age / true node age …or recombination rate, the ratio between estimated and true node age is almost always very close to 1. So that’s good. 1 1 r = 5×10-9/g r = 2×10-8/g Ne = 100,000 0.5 0.5 1 2 3 4 5 1 2 3 4 5 True node age True node age SNAPP: Bryant et al. (2012) Mol Biol Evol, Stange et al. (2018) Syst Biol

Precision Gene tree / species tree Concatenation SNAPP Mean precision
0.1 0.5 0.3 0.2 0.4 0.5 0.5 0.4 0.4 0.3 0.3 Mean precision When I directly compare the mean precision - the mean width of confidence intervals - of node age estimates across the three approaches, I find that young ages always have the smallest confidence intervals and therefore the best precision,… 0.2 0.2 0.1 0.1 0-1 1-2 2-3 3-4 4-5 0-1 1-2 2-3 3-4 4-5 0-1 1-2 2-3 3-4 4-5 True node age True node age True node age

Accuracy Gene tree / species tree Concatenation SNAPP Mean accuracy
0.1 0.5 0.3 0.2 0.4 0.5 0.5 0.4 0.4 0.3 0.3 Mean accuracy …however, young ages also have the lowest accuracy, at least when concatenation or the gene-tree / species-tree approach were used. ▶︎ With the largest simulated population size of 200,000, the accuracy was even zero for nodes with ages up to 1 or 2 million years, meaning that none of their confidence intervals included the true node age. In contrast, node ages remained accurate when they were estimated with SNAPP, regardless of population size or recombination rate. So there’s some of the bright side of phylogenetics, I’ld say. 0.2 0.2 0.1 0.1 Ne = 200,000 Ne = 200,000 0-1 1-2 2-3 3-4 4-5 0-1 1-2 2-3 3-4 4-5 0-1 1-2 2-3 3-4 4-5 True node age True node age True node age

The Bright Side of Phylogenetics
And before I finish, I’ld like to highlight what I think will also become an important part of the Bright Side of Phylogenetics in the future. ▶︎ And that has to do with two preprints with the humble names [quote] and [quote]. Both of these preprints are available on bioRxiv and will soon come out in a high-profile journal.

Ancestral recombination
Relate / tsinfer Ancestral recombination graph The two preprints describe amazing progress in the development of so-called “ancestral recombination graph methods”, implemented in the programs Relate by Speidel et al. and tsinfer by Kelleher et al. ▶︎ What’s so awesome about these methods is that they infer all gene trees jointly with the positions of the recombination breakpoints, and that they are extremely fast in doing so. I haven’t had the time yet to apply these methods to my simulated datasets, but that’s something that I’ll do in the near future. Relate: Speidel et al. (2019) bioRxiv, tsinfer: Kelleher et al. (2018) bioRxiv

Thanks Milan Malinsky Marcelo Sanchez University of Basel, Switzerland
University of Zurich, Switzerland With this, I’ll finish, and I would like to thank Milan Malinsky who developed c-genie together with me, the tool with which I calculated c-gene size distributions from msprime output, Marcelo Sanchez who hosts me at the University of Zurich; and I am employed currently by the University of Oslo and my funding comes from the Research Council of Norway.

Code Slides https://github.com/mmatschiner/evol2019
The code for my analyses is available on github, and my slides can be downloaded from my webpage, evoinformatics.eu/presentations.htm Thank you for listening!

Code Slides https://github.com/mmatschiner/evol2019

Accuracy and precision of phylogenomic divergence-time estimates

Similar presentations

Presentation on theme: "Accuracy and precision of phylogenomic divergence-time estimates"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accuracy and precision of phylogenomic divergence-time estimates

Similar presentations

Presentation on theme: "Accuracy and precision of phylogenomic divergence-time estimates"— Presentation transcript:

Similar presentations

About project

Feedback