Presentation is loading. Please wait.

Presentation is loading. Please wait.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.

Similar presentations


Presentation on theme: "CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological."— Presentation transcript:

1 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

2 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Mathematical models are: IncomprehensibleIncomprehensible UselessUseless No fun at allNo fun at all

3 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Mathematical models are: IncomprehensibleIncomprehensible UselessUseless No fun at allNo fun at all

4 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Model = hypothesis !!!Model = hypothesis !!! Hypothesis (as used in most biological research):Hypothesis (as used in most biological research): –Precisely stated, but qualitative –Allows you to make qualitative predictions Arithmetic model:Arithmetic model: –Mathematically explicit (parameters) –Allows you to make quantitative predictions …

5 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The Scientific Method Observation of data Model of how system works Prediction(s) about system behavior (simulation)

6 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The maximum likelihood approach I Starting point:Starting point: You have some observed data and a probabilistic model for how the observed data was produced Having a probabilistic model of a process means you are able to compute the probability of any possible outcome (given a set of specific values for the model parameters). Example:Example: –Data: result of tossing coin 10 times - 7 heads, 3 tails –Model: coin has probability p for heads, 1-p for tails. The probability of observing h heads among n tosses is: Goal:Goal: You want to find the best estimate of the (unknown) parameter values based on the observations. (here the only parameter is “p”) P(h heads) = p h (1-p) n-h hnhn   

7 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The maximum likelihood approach II  Likelihood = Probability (Data | Model)  Maximum likelihood: Best estimate is the set of parameter values which gives the highest possible likelihood.

8 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Maximum likelihood: coin tossing example Data: result of tossing coin 10 times - 7 heads, 3 tails Model: coin has probability p for heads, 1-p for tails. P(h heads) = p h (1-p) 10-h h 10   

9 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling applied to phylogeny Observed data: multiple alignment of sequencesObserved data: multiple alignment of sequences H.sapiens globinA G G G A T T C A H.sapiens globinA G G G A T T C A M.musculus globinA C G G T T T - A M.musculus globinA C G G T T T - A R.rattus globinA C G G A T T - A R.rattus globinA C G G A T T - A Probabilistic model parameters (simplest case):Probabilistic model parameters (simplest case): –Nucleotide frequencies:  A,  C,  G,  T –Tree topology and branch lengths –Nucleotide-nucleotide substitution rates (or substitution probabilities):

10 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Other models of nucleotide substitution

11 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS More elaborate models of evolution Codon-codon substitution ratesCodon-codon substitution rates (64 x 64 matrix of codon substitution rates) Different mutation rates at different sites in the geneDifferent mutation rates at different sites in the gene (the “gamma-distribution” of mutation rates) Molecular clocksMolecular clocks (same mutation rate on all branches of the tree). Different substitution matrices on different branches of the treeDifferent substitution matrices on different branches of the tree (e.g., strong selection on branch leading to specific group of animals) Etc., etc.Etc., etc.

12 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Different rates at different sites: the gamma distribution

13 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Computing the probability of one column in an alignment given tree topology and other parameters A C G G A T T C A A C G G T T T - A A A G G A T T - A A G G G T T T - A A A C CA G Columns in alignment contain homologous nucleotides Assume tree topology, branch lengths, and other parameters are given. For now, assume ancestral states were A and A (we’ll get to the full computation on next slide). Start computation at any internal or external node. Pr =  C P CA (t 1 ) P AC (t 2 ) P AA (t 3 ) P AG (t 4 ) P AA (t 5 ) t4t4 t5t5 t3t3 t1t1 t2t2

14 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Computing the probability of an entire alignment given tree topology and other parameters Probability must be summed over all possible combinations of ancestral nucleotides. (Here we have two internal nodes giving 16 possible combinations) Probability of individual columns are multiplied to give the overall probability of the alignment, i.e., the likelihood of the model. Often the log of the probability is used (log likelihood)

15 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Tree 1Tree 2 Node 1Node 2LikelihoodNode 1Node 2Likelihood AA0.0011549AA AC0.0012411AC0.0000247 AG0.0013101AG0.0001564 AT0.0000690AT0.0000082 CA0.0000029CA0.0001482 CC0.0001593CC CG0.0000280CG0.0001681 CT0.0000015CT0.0000088 GA0.0000039GA0.0000329 GC0.0000354GC0.0000059 GG0.0001774GG GT0.0000020GT TA0.0000010TA0.0000082 TC0.0000088TC0.0000015 TG0.0000093TG TT0.0000080TT Sum0.0042126Sum0.0020738

16 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Tree 1Tree 2 Node 1Node 2LikelihoodNode 1Node 2Likelihood AA0.0011549AA AC0.0012411AC0.0000247 AG0.0013101AG0.0001564 AT0.0000690AT0.0000082 CA0.0000029CA0.0001482 CC0.0001593CC CG0.0000280CG0.0001681 CT0.0000015CT0.0000088 GA0.0000039GA0.0000329 GC0.0000354GC0.0000059 GG0.0001774GG GT0.0000020GT TA0.0000010TA0.0000082 TC0.0000088TC0.0000015 TG0.0000093TG TT0.0000080TT Sum0.0042126Sum0.0020738

17 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Tree 1 Node 1Node 2LikelihoodSum AA0.0011549 0.0037751 AC0.0012411 AG0.0013101 AT0.0000690 CA0.0000029 0.0001917 CC0.0001593 CG0.0000280 CT0.0000015 GA0.0000039 0.0002187 GC0.0000354 GG0.0001774 GT0.0000020 TA0.0000010 0.000027 TC0.0000088 TG0.0000093 TT0.0000080 Sum0.0042126 Ancestral reconstruction: Node1 = A

18 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Maximum likelihood phylogeny Data:Data: –sequence alignment Model parameters:Model parameters: –nucleotide frequencies, nucleotide substitution rates, tree topology, branch lengths. Choose random initial values for all parameters, compute likelihood Choose random initial values for all parameters, compute likelihood Change parameter values slightly in a direction so likelihood improves Change parameter values slightly in a direction so likelihood improves Repeat until maximum found Repeat until maximum found Results: Results: (1) ML estimate of tree topology (2) ML estimate of branch lengths (3) ML estimate of other model parameters (4) Measure of how well model fits data (likelihood). (likelihood).


Download ppt "CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological."

Similar presentations


Ads by Google