Presentation is loading. Please wait.

Presentation is loading. Please wait.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.

Similar presentations


Presentation on theme: "CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical."— Presentation transcript:

1 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

2 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayes Theorem Reverend Thomas Bayes (1702-1761) P(D|M K ): Probability of data given model k = likelihood P(M K ): Prior probability of model P(D): Essentially a normalizing constant so posterior will sum to one P(M K |D): Posterior probability of model k

3 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesians vs. Frequentists Meaning of probability:Meaning of probability: –Frequentist: long-run frequency of event in repeatable experiment –Bayesian: degree of belief, way of quantifying uncertainty Finding probabilistic models from empirical data:Finding probabilistic models from empirical data: –Frequentist: parameters are fixed constants whose true values we are trying to find good (point) estimates for. trying to find good (point) estimates for. –Bayesian: uncertainty concerning parameter values expressed by means of probability distribution over possible parameter values of probability distribution over possible parameter values

4 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayes theorem: example We have three cookie jars: #1 P(chocolate) = 0.25 #2 P(chocolate) = 0.50 #3 P(chocolate) = 0.75 Bill (blindfolded) picks random bowl and random cookie. He gets a plain (non-chocolate) cookie. What is the probability that he picked bowl #1

5 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayes theorem: example II Investigated hypotheses (models): M 1 : Bowl #1 was chosen M 2 : Bowl #2 was chosen M 3 : Bowl #3 was chosen Prior probabilities: P(M 1 ) = 0.33 P(M 2 ) = 0.33 P(M 3 ) = 0.33 Likelihoods: P(plain|M 1 ) = 0.75 P(chocolate|M 1 ) = 0.25 P(plain|M 2 ) = 0.50 P(chocolate|M 2 ) = 0.50 P(plain|M 3 ) = 0.25 P(chocolate|M 3 ) = 0.75

6 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayes theorem: example III P(M 1 | plain) = 0.5 P(M 2 | plain) = 0.33 P(M 3 | plain) = 0.17 Likelihoods: P(plain|M 1 ) = 0.75 P(chocolate|M 1 ) = 0.25 P(plain|M 2 ) = 0.50 P(chocolate|M 2 ) = 0.50 P(plain|M 3 ) = 0.25 P(chocolate|M 3 ) = 0.75 Prior probabilities: P(M 1 ) = 0.33 P(M 2 ) = 0.33 P(M 3 ) = 0.33

7 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS MCMC: Markov chain Monte Carlo Problem: for complicated models parameter space is enormous.  Not easy/possible to find posterior distribution analytically Solution: MCMC = Markov chain Monte Carlo Start in random position on probability landscape. Attempt step of random length in random direction. (a) If move ends higher up: accept move (b) If move ends below: accept move with probability P (accept) = P LOW /P HIGH Note parameter values for accepted moves in file. After many, many repetitions points will be sampled in proportion to the height of the probability landscape

8 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS MCMCMC: Metropolis-coupled Markov Chain Monte Carlo Problem: If there are multiple peaks in the probability landscape, then MCMC may get stuck on one of them Solution: Metropolis-coupled Markov Chain Monte Carlo = MCMCMC = MC 3 MC 3 essential features: Run several Markov chains simultaneouslyRun several Markov chains simultaneously One chain “cold”: this chain performs MCMC samplingOne chain “cold”: this chain performs MCMC sampling Rest of chains are “heated”: move faster across valleysRest of chains are “heated”: move faster across valleys Each turn the cold and warm chains may swap position (swap probability is proportional to ratio between heights)Each turn the cold and warm chains may swap position (swap probability is proportional to ratio between heights)  More peaks will be visited More chains means better chance of visiting all important peaks, but each additional chain increases run-time

9 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS MCMCMC for inference of phylogeny Result of run: Substitution parameters (a)Tree topologies

10 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Posterior probability distributions of substitution parameters

11 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Posterior Probability Distribution over Trees MAP (maximum a posteriori) estimate of phylogeny: tree topology occurring most often in MCMCMC outputMAP (maximum a posteriori) estimate of phylogeny: tree topology occurring most often in MCMCMC output Clade support: posterior probability of group = frequency of clade in sampled trees.Clade support: posterior probability of group = frequency of clade in sampled trees. 95% credible set of trees: order trees from highest to lowest posterior probability, then add trees with highest probability until the cumulative posterior probability is 0.9595% credible set of trees: order trees from highest to lowest posterior probability, then add trees with highest probability until the cumulative posterior probability is 0.95


Download ppt "CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical."

Similar presentations


Ads by Google