Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Brief introduction on Logistic Regression
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
METHODS FOR HAPLOTYPE RECONSTRUCTION
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Basics of Linkage Analysis
DATA ANALYSIS Module Code: CA660 Lecture Block 6: Alternative estimation methods and their implementation.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
T-tests Computing a t-test  the t statistic  the t distribution Measures of Effect Size  Confidence Intervals  Cohen’s d.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
BCOR 1020 Business Statistics Lecture 17 – March 18, 2008.
The infinitesimal model and its extensions. Selection (and drift) compromise predictions of selection response by changing allele frequencies and generating.
Point and Confidence Interval Estimation of a Population Proportion, p
Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Evaluating Hypotheses
Statistical Background
Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology BIO 5 Adjunct Appointments Molecular and Cellular Biology Plant.
7-2 Estimating a Population Proportion
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Today Concepts underlying inferential statistics
5-3 Inference on the Means of Two Populations, Variances Unknown
Bootstrapping applied to t-tests
Standard Error of the Mean
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Estimation of Statistical Parameters
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Confidence intervals and hypothesis testing Petter Mostad
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Sampling and estimation Petter Mostad
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Example: In a recent poll, 70% of 1501 randomly selected adults said they believed.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Chapter 7 Estimates and Sample Sizes 7-1 Overview 7-2 Estimating a Population Proportion 7-3 Estimating a Population Mean: σ Known 7-4 Estimating a Population.
1 Chapter 8 Interval Estimation. 2 Chapter Outline  Population Mean: Known  Population Mean: Unknown  Population Proportion.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Two-Sample-Means-1 Two Independent Populations (Chapter 6) Develop a confidence interval for the difference in means between two independent normal populations.
Confidence Intervals. Point Estimate u A specific numerical value estimate of a parameter. u The best point estimate for the population mean is the sample.
Chapter 10: The t Test For Two Independent Samples.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Bayesian Estimators of Time to Most Recent Common Ancestry
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Statistical inference for the slope and intercept in SLR
Presentation transcript:

Bayesian Estimators of Time to Most Recent Common Ancestry Ecology and Evolutionary Biology Adjunct Appointments Molecular and Cellular Biology Plant Sciences Epidemiology & Biostatistics Animal Sciences Bruce Walsh

Definitions MRCA - Most Recent Common Ancestor TMRCA - Time to Most Recent Common Ancestor Question: Given molecular marker information from a pair of individuals, what is the estimated time back to their most recent common ancestor? With even a small number of highly polymorphic autosomal markers, trivial to assess zero (subject/ biological sample) and one (parent-offspring) MRCA

Problems with Autosomal Markers Often we are very interested in MRCAs that are modest (5-10 generations) or large (100’s to 10,000’s of generations) Unlinked autosomal markers simply don’t work over these time scales. Reason: IBD probabilities for individuals sharing a MRCA 5 or more generations ago are extremely small and hence very hard to estimate (need VERY large number of markers).

MRCA-I vs. MRCA-G We need to distinguish between the MRCA for a pair of individuals (MRCA-I) and the MRCA for a particular genetic marker G (MRCA-G). MRCA-G varies between any two individuals over recombination units. For example, we could easily have for a pair of relatives MRCA (mtDNA ) = 180 generations MRCA (Y ) = 350 generations MRCA (one  -globulin allele ) = 90 generations MRCA (other  -globulin allele ) = 400 generations

MRCA-G > MRCA-I MRCA-I lost MRCA-G( )

mtDNA and Y Chromosomes So how can we accurately estimate TMRCA for modest to large number of generations? Answer: Use a set of completely linked markers With autosomes, unlinked markers assort each generation leaving only a small amount of IBD information on each marker, which we must then multiply together. IBD information decays on the order of 1/2 each generation. With completely linked marker loci, information on IBD does not assort away via recombination. IBD information decay is on the order of the mutation rate.

Y chromosome microsatellite mutation rates- I Estimate of uSourceReference Y chromosomeKayser et al Y chromosomeHeyer et al Autosomal chromosomes Wong & Weber 1993 Brinkmann 1998 Estimates of human mutation rate in microsatellites are fairly consistent over both the Y and the autosomes

Basic Structure of the Problem What is the probability that the two marker alleles at a haploid locus from two related individuals agree given that their MRCA was t generation ago? Phrased another way, what is their probability of identity in state (IBS), given they are identical by descent (IBD) when their TMRCA is t generations

Infinite Alleles Model The first step in answering this question is to assume a particular mutational model Our (initial) assumption will be the infinite alleles model (IAM) The key assumption of this model (originally due to Kimura and Crow, 1964) is that each new mutation gives rise to a new allele. The IAM was the first population-genetics model to attempt to formally incorporate the structure of DNA into a model

Key: Under the infinite alleles, two alleles that are identical in state that are also ibd have not experienced any mutations since their MRCA. Let q(t) = Probability two alleles with a MRCA t generations ago are identical in state If u = per generation mutation rate, then q(t) = (1-u) 2t MRCA (1-u) t A B MRCA Pr(No mutation from MRCA->A) = (1-u) t Pr(No mutation from MRCA->B) = (1-u) t

q(t) = (1-u) 2t ≈ e -2ut = e - ,  = 2ut Building the Likelihood Function for n Loci For any single marker locus, the probability of IBS given a TMRCA of t generations is The probability that k of n marker loci are IBS is just a Binomial distribution with success parameter q(t) ( ) Likelihood function for t given k of n matches

ML Analysis of TMRCA ( ) It would seem that we now have all the pieces in hand for a likelihood analysis of TMRCA given the marker data (k of n matches) Likelihood function (  = 2ut) MLE for t is solution of ∂ L/∂t = 0 p = fraction of matches - () () ^ ^

In particular, the MLE for t becomes Likewise, the precision of this estimator follows for the (negative) inverse of the 2nd derivative of the log-likelihood function evaluated at the MLE, ( ) - Var( t ) = - - ( ) ^ ^ -

Likewise, we can (numerically) easily find 1-LOD support intervals for t and hence construct approximate 95% confidence intervals to TMRCA Finally, hypothesis testing, say Ho: MRCA = t 0, is easily accomplished by comparing -2* the natural log of the ratio of the value of the likelihood function at t = t 0 over the value of the likelihood function at the MLE t = t ^ The resulting log likelihood ratio LR is (asymptotically) distributed as a chi-square distribution with one degree of freedom

Trouble in Paradise The ML machinery has seem to have done its job, giving us an estimate, its approximate sampling error, approximate confidence intervals, and a scheme for hypothesis testing. Hence, all seems well. Problem: Look at k=n (= complete match at all markers). MLE (TMRCA) = 0 (independent of n) Var(MLE) = 0 (ouch!)

With n=k, the value of the likelihood function is L(t) = (1-u) 2tn ≈ e -2tun What about one-LOD support intervals (95% CI) ? L has a maximum value of one under the MLE Hence, any value of t that gives a likelihood value of 0.1 or larger is in the one-LOD support interval Solving, the one-LOD support interval is from t=0 to t = (1/2n) [ -Ln(10)/Ln(1-u) ] ≈ (1/n) [ Ln(10)/(2u) ] For u = 0.002, CI is (0, 575/n)

With n=k, likelihood function reduces to L(t) = (1-u) 2tn ≈ e -2tun t L(t) (Plots for u = 0.002) MLE(t) = 0 for all values on n n=5 n=10 n= of max value (1) of likelihood function 1 LOD ≈ t = 29 1 LOD ≈ t = 58 1 LOD ≈ t = 115

What about Hypothesis testing? Again recall that for k =n that the likelihood at t = t 0 is L(t 0 ) ≈ Exp(-2 t 0 un) Hence, the LR test statistic for Ho: t = t 0 is just LR = -2 ln [ L(t 0 )/ L(0) ] = -2 ln [ Exp(-2 t 0 un) / 1 ] = 4t 0 un Thus the probability for the test that TMRCA = t 0 is just Pr(  1 2 > 4t 0 un)

The problem(s) with ML The expressions developed for the sampling variance, approximate confidence intervals, and hypothesis testing are all large-sample approximations Problem 1: Here our sample size is the number of markers scored in the two individuals. Not likely to be large. Problem 2: These expressions are obtained by taking appropriate limits of the likelihood function. If the ML is exactly at the boundary of the admissible space on the likelihood surface, this limit may not formally exist, and hence the above approximations are incorrect.

The solution? “Ain’t Too Proud to Bayes” -- Brad Carlin

Why Go Bayesian An extension of likelihood is Bayesian statistics p(  | x) = C * l(x |  ) p(  ) Instead of simply estimating a point estimate (e.g., the MLE), the goal is the estimate the entire distribution for the unknown parameter  given the data x posterior distribution of  given x Likelihood function for  Given the data x prior distribution for  The appropriate constant so that the posterior integrates to one. Why Bayesian? Exact for any sample size Marginal posteriors Efficient use of any prior information MCMC (such as Gibbs sampling) methods

The Prior on TMRCA The first step in any Bayesian analysis is choice of an appropriate prior distribution p(t) -- our thoughts on the distribution of TMRCA in the absence of any of the marker data Standard approach: Use a flat or uninformative prior, with p(t) = a constant over the admissible range of the parameter. Can cause problems if the likelihood function is unbounded (integrates to infinity) In our case, population-genetic theory provides the prior: under very general settings, the time to MRCA for a pair of individuals follows a geometric distribution

In particular, for a haploid gene, TMRCA follows a geometric distribution with mean 1/N e. Hence, our prior is just p(t) = (1- ) t ≈ e - t, where = 1/N e Hence, we can use an exponential prior with hyperparameter (the parameter fully characterizing the distribution) = 1/N e. The posterior thus becomes Previous likelihood function (ignoring constants that cancel when we compute the normalizing factor C) Prior Prior hyperparameter = 1/N e

The Normalizing constant where I ensures that the posterior distribution integrates to one, and hence is formally a probability distribution

What is the effect of the hyperparameter? If 2uk >>, then essentially no dependence on the actual value of chosen. Hence, if 2N e uk >> 1, essentially no dependence on (hyperparameter) assumptions of the prior. For a typical microsatellite rate of u = 0.002, this is just N e k >> 250, which is a very weak assumption. For example, with k =10 matches, N e >> 25. Even with only 1 match (k=1), just require N e >> 250.

Closed-form Solutions for the Posterior Distribution Complete analytic solutions to the prior can be obtained by using a series expansion (of the (1-e x ) n term) to give ( = Each term is just a * e bt, which is easily integrated

With the assumption of a flat prior, = 0, this reduces to - -

Hence, the complete analytic solution of the posterior is Suppose k = n (no mismatches) ( In this case, the prior is simply an exponential distribution with mean 2un +, -

Analysis of n = k case Mean TMRCA and its variance: < -- Cumulative probability: In particular, the time T  satisfying P(t < T  ) =  is - -

For a flat prior ( = 0), the 95% (one-side) credible interval is thus given by -ln(0.5)/(2nu) ≈ 1.50/(nu) Hence, under a Bayesian analysis for u = 0.02, the 95% upper credible interval is given by ≈ 749/n Recall that the one-LOD support interval (approximate 95% CI) under an ML analysis is ≈ 575/n The ML solution’s asymptotic approximation significantly underestimates the true interval relative to the exact analysis under a full Bayesian model

Why the difference? Under ML, we plot the likelihood function and look for the 0.1 value Under a Bayesian analysis, we look at the posterior probability distribution (likelihood adjusted to integrate to one) and find the values that give an area of 0.95 n = 20, area to left of t=38 = 0.95 n = 10, area to left of t=75 = 0.95 t Pr(TMRCA < t) n = 20, t 0.95 = 38 n = 10, t 0.95 = 75

Posteriors for n = 10 Sample Posteriors for u = Posteriors for n = 20 Posteriors for n = Time t to MRCA p( t | k ) n = Time t to MRCA p( t | k ) n = 100

Key points By using the appropriate number of markers we can get accurate estimates for TMRCA for even just a few generations markers will do. By using markers on a non-recombining chromosomal section, we can estimate TMRCA over much, much longer time scales than with unlinked autosomal markers Hence, we have a fairly large window of resolution for TMRCA when using a modest number of completely linked markers.

Extensions I: Different Mutation Rates Let marker locus k have mutation rate u k. Code the observations as x k = 1 if a match, otherwise code x k = 0 - ( [ ] The posterior becomes:

Stepwise Mutation Model (SMM) The Infinite alleles model (IAM) is not especially realistic with microsatellite data, unless the fraction of matches is very high. Under IAM, score as a match, and hence no mutations In reality, there are two mutations Microsatellite allelic variants are scored by their number of repeat units. Hence, two “matching” alleles can actually hide multiple mutations (and hence more time to the MRCA) Mutation 1 Mutation 2

Y chromosome microsatellite mutation rates- II The SMM model is an attempt to correct for multiple hits by fully accounting for the mutational structure. Good fit to array sizes in natural populations when assuming the symmetric single-step model Equal probability of a one-step move up or down In direct studies of (Y chromosome) microsatellites 35 of 37 dectected mutations in pedigrees were single step, other 2 were two-step

SMM0 model -- match/no match under SMM The simplest implementation of the SMM model is to simply replace the match probabilities q(t) under the IAM model with those for the SMM model. This simply codes the marker loci as match / no match We refer to this as the SMMO model

Formally, the SMM model assumes the following transition probabilities > Note that two alleles can match only if they have experienced an even number of mutations in total between them. In such cases, the match probabilities become ()

() Number of mutations Prob(Match) ( The zero-order modifed Type I Bessel Function Hence, -

Under the SMM model, the prior hyperparameter can now become important. This is the case when the number n of markers is small and/or k/n is not very close to one Why? Under the prior, TMRCA is forced by a geometric with 1/Ne. Under the IAM model for most values this is still much more time than the likelihood function predicts from the marker data Under the SMM model, the likelihood alone predicts a much longer value so that the forcing effect of the initial geometric can come into play

n =5, k = 3, u = 0.02 Time, t Pr(TMRCA < t) IAM, both flat prior and Ne = 5000 SSMO, N e = 5000 SSMO, flat prior Prior with N e =5000

An Exact Treatment: SMME With a little work we can show that the probability two sites differ by j steps is just - > The resulting likelihood thus becomes … … Where n j is the number of sites that differ by k (observed) steps The jth-order modifed Type I Bessel Function

With this likelihood, the resulting posterior becomes … - This rearranges to give the general posterior under the Exact SMM model (SMME) as - - … Number of exact matchesNumber of k steps differences

Example Consider comparing haplotypes 1 and 3 from Thomas et al.’s (2000) study on the Lemba and Cohen Y chromosome modal haplotypes. Here six markers used, four exactly match, one differs by one repeat, the other by two repeats Hence, n = 6, k = 4 for IAM and SMM0 models n 0 = 4, n 1 = 1, n 2 = 1, n = 6 under SMME model Assume Hammer’s value of N e =5000 for the prior

IAM SMM0 SMME Time to MRCA, t P(t | markers) TMRCA for Lemba and Cohen Y Model usedMeanMedium2.5%97.5% IAM SMM SMME

Time to MRCA, t Pr(TMRCA < t) IAM SMM0 SMME

Technology Transfer Family Tree DNA (ftDNA) -- provides Y chromosome marker kits for genealogical studies So far, ftDNA has processed over 80,000 such kits This amounts to a rough gross of around 8 million dollars. The expressions developed above have direct commercial applications

Forensic applications of the Y A not uncommon situation is the only DNA is from fingernail scrappings. The result is a mixture wherein the victim's DNA often overwhelms the DNA of the perpetrator. Result: only modest match probability as many autosomal markers cannot be detected One solution: use Y chromosome markers. Easily amplified over (female) background

Problem: How do we combine Y match with autosomal match? NRC 1996 recommendations (autosomal loci) Prob(Y match)*Prob(autosomal match) Problem: Y markers may provide information about population substructure membership For example, a particular haplotype may be restricted to a certain subpopulation, e.g., Native Americans Product rule across markers Population substructure correction within markers

Correcting for Y substructure Let y denote the observed Y haplotype A the multilocus autosomal marker genotype P(y,A) = P(A | y)*P(y) Simple approach: knowledge of y indicates membership in a particular subpopulation, P(A) computed using allele frequencies for that subpopulation. Suggestion: Multiply freq(y)* max Freq(A over subgroups)

A more precise accounting Suppose two individuals share the same y haplotype. What is there average coancestry,  ? Balding and Nichols give expressions for autosomal single-locus genotype frequencies given that the population shows structure with coancestry . Second approach: Compute  from haplotype matching. Using this  value in Balding - Nichols expressions to compute (single-locus) autosomal frequencies.

> Posterior Distribution for a match at all n markers with a prior of = 1/Ne > > With a MRCA of t generations,  = (1/2) 2t+1

Typical situation is where we can exclude father-son and paternal half-sibs, k > 2 > Typical values, n = 11,  = 1/500 For = 1/5000, E [  ] = For = 1/500, E [  ] = For = 1/50, E [  ] = For these values, unless p i < 0.01, Balding-Nichols expression are essentially HW.

Formal procedure Estimate P(y) from a database (counting methods, Bayesian estimators) Compute mutlilocus autosomal frequencies by each major ethnic group using the product of the single-locus genotypes computed using group-specific allele frequencies and  = correction. Conservative P(y,A) = P(y)*max P(A)