Presentation on theme: "Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable."— Presentation transcript:
Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable within and between extant populations (see Figure 1). For example, the number of alleles shared between very closely related species depends on the time at which the species split and whether gene flow occurred since the split (see Figure 2). Thus, polymorphism data can be used to estimate the demographic parameters describing the history of two incipient species (see Figure 3). Here, we consider a simple model in which two populations split T generations ago and the number of migrants exchanged between them is M per generation. Na, N1 and N2 are the effective population sizes for the ancestral, first and second descendant populations, respectively. We denote the set of parameters by . Our goal is to estimate the posterior distribution of given the data. Rather than using all the data to estimate these parameters, we summarize the data for each locus by four statistics known to be sensitive to the parameters of interest (see Figure 1 for details). Given a genealogy, the probability of obtaining these statistics can be calculated explicitly. We therefore take the following approach to obtain an estimate of the posterior distribution of the parameters: Specifically, we pick a set of parameters independently from prior distributions, then simulate a genealogical history for each locus and calculate u= p(D|G, ). We then weight the values of the parameters by u to obtain an estimate of their posterior probability. Keystone Symposia. Genome Sequence Variation. Jan 08 – Jan 13, 2006 Estimating divergence times and testing for migration using multi- locus polymorphism data Céline Becquet a, Andrea S. Putnam b, Peter Andolfatto b, and Molly Przeworski a Dept. of Human Genetics, Chicago, IL, USA, 60637 a ; University of California at San Diego, La Jolla, CA, 92093 b Table 1 – Comparison between methods PaperSpeedAdvantagesDrawbacks Wakeley & Hey 1997FastMethod of moments estimatorSummary of data Allows for recombinationLow accuracy Multiple lociS s & S f >0 required Can use genotype dataModel with no migration Nielsen & Wakeley 2001SlowUses all the dataUses one locus Allows for uncertainty in nuisance parameters Recombination not allowed Haplotype data required More general model (allows for migration) Hey & Nielsen 2004SlowSame as Nielsen & Wakeley 2001 Recombination not allowed Haplotype data required Multiple loci Leman et al. 2005Fast Approximate Bayesian Computation approach Summary of data Uses one locus No need for S s & S f >0Recombination not allowed Can use genotype dataModel with no migration Our methodSlowSame as Leman et al. 2005Summary of data More general model Allows for recombination Multiple loci An example of polymorphism data at a locus in three sequences sampled from each of two populations. The horizontal lines represent aligned sequences; the colored squares, disc and ovals stands for segregating sites. We use the following summaries of the polymorphism data at each locus: the number of segregating sites specific to sample one (S 1 ), specific to sample two (S 2 ), shared between samples from both populations (S shared ) and fixed in either population sample (S fixed ). S 1 =1 S 2 =2 S shared =1 S fixed =1 1 2 3 a b c Fig-2. Effects of divergence time and migration on polymorphism data Examples of genealogical histories for three sequences sampled from each of two closely related populations, under different models. The patterns of polymorphism and divergence expected under each model are indicated below. For simplicity, we present a single genealogy, but for recombining loci, there may be many histories within a single region (i.e. there is an ancestral recombination graph, rather than a tree). The vertical branches represent ancestral lineages for the six sequences; they are colored according to whether a mutation would lead to a fixed, shared or unique polymorphism in the sample (see Figure 1). In c, gene flow occurred (yellow line), thus sequence 3 was sampled in population one but its ancestor came from population two. Posterior distribution Calculated explicitly Estimated from coalescent simulations Prior distributions on parameters Future directions Our current method is relatively slow when using data from multiple loci because it is searching a huge space of possible histories and parameters. We would like to speed up the method and extend it to more complex models. To do so, we will need to account for two sources of variance: in the genealogies and the parameters. We therefore plan to generate many genealogies for the same set of parameters in order to improve the accuracy of our estimate of p(D| ) and use Markov Chain Monte Carlo in order to better explore the parameter space. Abstract Population divergence times are of interest in many contexts, from human genetics to conservation biology. These times can be estimated from polymorphism data. However, existing approaches make a number of assumptions (e.g., no recombination within loci or no migration since the split) that limit their applicability. To overcome these limitations, we developed an Approximate Bayesian Computation approach to estimate population parameters for a simple split model, allowing for migration as well as intralocus recombination. Application to simulated data suggests that the approach provides fairly accurate estimates of population sizes and divergence times and has high power to detect migration since the split. We illustrate the potential of the method by applying it to polymorphism data from five highly recombining loci surveyed in two closely related species of Lepidoptera (Papilio glaucus and P. canadensis). Fig-1. Summary statistics used for estimation a. A gene genealogy for a recent divergence time without migration 123a b c T N1N1 N2N2 NaNa Excess of shared polymorphisms (occurring along the red branch) and few fixed sites (purple branch). b. A gene genealogy for an old divergence time without migration T 123a b c Few shared polymorphisms (none here) and an excess of fixed sites. c. A gene genealogy for an old divergence time with migration T 123a b c M Excess of shared polymorphism and few fixed sites. Application to two Papilio species Fig-3. Performance on a small simulated data set Mean of the divergence time (a) and the ratio of ancestral to current population size (b). The estimates are based on polymorphism data from ten simulated loci of 1 kb, generated with: a sample size of 20 individuals from each population, the population mutation rates θ 1 =θ 2 =θ a =.001, T=5x10 4 generations and M=5. Each vertical line refers to a data set (Y-axis), the red line indicates the true value and the X-axis range corresponds to the range of the prior distribution. As can be seen, the divergence times tend to be over-estimated, while the ancestral population size estimates are more accurate. Fig-4. Ranges of P. glaucus and P. canadensis. A narrow hybrid zone forms where the ranges meet. Female mimetic morph of P. glaucus is shown with yellow morphs. We applied our method to data from five highly recombining loci sampled in two species of Lepidoptera (Papilio glaucus and P. canadensis). These two species are known to exchange migrants and experience high levels of recombination. In order to examine the sensitivity to assumptions about migration, we compared the parameter estimates obtained in models with and without gene flow: the time of divergence appears to be under- estimated and ancestral population size over-estimated when migration is ignored (see Table 2). Table 2 - Effect of model on estimation EstimatorModelN1N2Na T (in generation) Migration Rate* Mean Migration2.65E+051.80E+051.72E+054.38E+050.154 No Migration2.59E+051.75E+056.25E+051.46E+05 Median Migration2.55E+051.66E+051.51E+054.45E+050.148 No Migration2.52E+051.62E+055.84E+051.43E+05 * The posterior probability of migration is >.999, while the prior probability is only.5. Thus, there is strong support for gene flow, as expected.