Presentation is loading. Please wait.

Presentation is loading. Please wait.

Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman.

Similar presentations


Presentation on theme: "Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman."— Presentation transcript:

1 Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman Shlomit Klopman

2 The Scope of the Presentation Introduction Introduction Maximum Likelihood and Coin Tossing Maximum Likelihood and Coin Tossing The Phylogenetic Tree The Phylogenetic Tree Maximum Likelihood and DNA Substitution Maximum Likelihood and DNA Substitution Advantages and Disadvantages Maximum Likelihood Advantages and Disadvantages Maximum Likelihood

3 Introduction Phylogeny: the study of relationships between life forms Phylogeny: the study of relationships between life forms Phylogenetics is part of the field of taxonomy and systematics Phylogenetics is part of the field of taxonomy and systematics Phylogenetics received a huge push forward thanks to modern computers Phylogenetics received a huge push forward thanks to modern computers Various phylogenetic methods are used to explain the evolutionary process, and often give contradicting results! Various phylogenetic methods are used to explain the evolutionary process, and often give contradicting results!

4 Introduction (cont.) Scientists agree that a correct species linage should be determined using statistics Scientists agree that a correct species linage should be determined using statistics Maximum Likelihood is the method of choice for establishing the most realistic phylogenetic tree of a given data Maximum Likelihood is the method of choice for establishing the most realistic phylogenetic tree of a given data The Maximum Likelihood method was introduced in 1922 by R.A. Fisher an English statistician The Maximum Likelihood method was introduced in 1922 by R.A. Fisher an English statistician

5 Maximum Likelihood in a Nutshell The method depends on: The method depends on: –Complete data set –Probabilistic model that describes the data –Explicitly expressing the likelihood function The likelihood of a data set is the probability of obtaining it, given the chosen probability distribution model The likelihood of a data set is the probability of obtaining it, given the chosen probability distribution model We seek the values of the parameters that maximize the sample likelihood We seek the values of the parameters that maximize the sample likelihood

6 Maximum Likelihood approach using Coin Tossing Experiment Maximum Likelihood approach using Coin Tossing Experiment Find the parameter value(s) that make the observed data most likely. Basically, choose the value of parameter that maximizes the probability of observing the data. Find the parameter value(s) that make the observed data most likely. Basically, choose the value of parameter that maximizes the probability of observing the data. Probability: Knowing parameters  Prediction of outcome Probability: Knowing parameters  Prediction of outcome Likelihood: Observation of data  Estimation of parameters Likelihood: Observation of data  Estimation of parameters Parameters describe the characteristics of a population. Their values are estimated from samples collected from that population. Parameters describe the characteristics of a population. Their values are estimated from samples collected from that population.

7 Simple Coin Tossing Experiment Simple Coin Tossing Experiment Binomial probability distribution Binomial probability distribution The probability of observing h heads out of n tosses can be described as: The probability of observing h heads out of n tosses can be described as: Pr[h|p, n] = n! p h (1-p) n-h Pr[h|p, n] = n! p h (1-p) n-h h!(n-h)! h!(n-h)! Where p is probability of Heads Where p is probability of Heads (1-p) is probability of Tails. (1-p) is probability of Tails.

8 Simple Coin Tossing Experiment Simple Coin Tossing Experiment Suppose I told you we tossed a coin 10 times and got 4 heads and 6 tails, then the probability would be Suppose I told you we tossed a coin 10 times and got 4 heads and 6 tails, then the probability would be P(4Heads, 6Tails) = 10! p 4 (1-p) 6 P(4Heads, 6Tails) = 10! p 4 (1-p) 6 4!*6! 4!*6! The whole notion of maximum likelihood estimation is that we choose p to be the one that makes the probability of getting our set of observations the largest possible: i.e. maximize P 4 (1-P) 6. So our likelihood function would be: like = p 4 (1-p) 6 The whole notion of maximum likelihood estimation is that we choose p to be the one that makes the probability of getting our set of observations the largest possible: i.e. maximize P 4 (1-P) 6. So our likelihood function would be: like = p 4 (1-p) 6

9 Two ways to find MLE Two ways to find MLE 1. Take the first derivative of the likelihood function with respect to each parameter, set the resulting equations equal to 0, and solve for the parameter estimates. Applying log on both sides Log(L(p)) = n Log(p) + (n-h) Log(1-p) Take first derivative w.r.t p (n / p) – (n-h) / (1-p) = 0 (n / p) – (n-h) / (1-p) = 0 Solving for p, We get p = h / n This value maximizes the likelihood function and is the MLE.

10 Find the maximum using Numeric search procedures Find the maximum using Numeric search procedures 2. Plug in different values for p into the probability model and calculate likelihood. Lets take sample n = 100, h = 56. Lets take sample n = 100, h = 56. Imagine that p was 0.5. Imagine that p was 0.5. Plugging this value into our probability model as follows:- Plugging this value into our probability model as follows:- L(p = 0.5 | data ) = 100 ! 0.5 56 0.5 44 = 0.0389 L(p = 0.5 | data ) = 100 ! 0.5 56 0.5 44 = 0.0389 56! 44! 56! 44! But what if p was 0.52 instead? But what if p was 0.52 instead? L(p = 0.52 | data) = 100 ! 0.52 56 0.48 44 = 0.0581 L(p = 0.52 | data) = 100 ! 0.52 56 0.48 44 = 0.0581 56! 44! 56! 44!

11 So from this we can conclude that p is more likely to be 0.52 than 0.5. We can tabulate the likelihood for different parameter values to find the maximum likelihood estimate of p: So from this we can conclude that p is more likely to be 0.52 than 0.5. We can tabulate the likelihood for different parameter values to find the maximum likelihood estimate of p: p L p L p L p L ------ ------- ------- -------- ------ ------- ------- -------- 0.48 0.0222 0.50 0.03889 0.48 0.0222 0.50 0.03889 0.52 0.0581 0.54 0.0739 0.52 0.0581 0.54 0.0739 0.56 0.0801 0.58 0.0738 0.56 0.0801 0.58 0.0738 0.60 0.0576 0.62 0.0378 0.60 0.0576 0.62 0.0378

12 Maximum likelihood estimate for p seems to be exactly at 0.56. Maximum likelihood estimate for p seems to be exactly at 0.56.

13 MLE: Sample Graphs (using Mathematica)

14 Simple Coin Tossing Experiment Simple Coin Tossing Experiment The best estimate for p from any one sample is clearly going to be the proportion of heads observed in that sample. The best estimate for p from any one sample is clearly going to be the proportion of heads observed in that sample. A very simple example like this is over rated for evaluating p using MLE approach. A very simple example like this is over rated for evaluating p using MLE approach. But not all problems are this simple! The more complex the model and the greater the number of parameters, it often becomes very difficult to make even reasonable guesses at the MLEs. But not all problems are this simple! The more complex the model and the greater the number of parameters, it often becomes very difficult to make even reasonable guesses at the MLEs.

15 Phylogenetic Tree A phylogenetic tree is a data structure, characterized by: topology (form) its branch lengths Stores information regarding the relationship of several species or sequences.

16 a b c d : assumed ancestral state "d" is the Rooted tree: assumed ancestral state "d" is the root species.... no implicit "directionality", but is a measure of between species. Unrooted tree... no implicit "directionality", but is a measure of similarity between species. a b c d Types of Phylogenetic Trees leaf branc h ro ot leaf branc h

17 (1) A G G C U C C A A (1) A G G C U C C A A (2) A G G U U C G A A (2) A G G U U C G A A (3) A G C C C A G A A (3) A G C C C A G A A (4) A U U U C G G A A Molecular phylogenetic methods use a given set of aligned sequences to construct a phylogenetic Tree sequence 1 sequence 2 sequence 3 sequence 4 There are several ways to construct phylogenetic trees. The Maximum Likelihood method will pick out the tree that most represents the true tree.

18 j (1) A G G C T C C A A….A (2) A G G T T C G A A.…A (3) A G C C C A G A A....A (4) A T T T C G G A A....C The Maximum Likelihood Approach 1. Assumes that all sequences at each site are considered independent. 1 ….N 2

19 x The Maximum Likelihood Approach(cont.) C AC G y 1.The log-likelihood is computed for a given topology by using a particular probability model. L ( j ) = Prob + …+ Prob N ln L= ln L(1) + ln L(2)..+ ln L(j)+… + ln L(N) = SUM ln L(i) i=1 Binomial; Multinomial; Poisson….. a) b) C A C G A A C AC G G G c)

20 The Maximum Likelihood Approach (cont.) 3. After procedure is done for, the topology that showsis chosen as the 3. After procedure is done for all possible topologies, the topology that shows the highest likelihood is chosen as the true (realistic) tree. #Rooted trees = #Unrooted trees = How many topologies do we have to go through for? How many topologies do we have to go through for n sequences?

21 The Maximum Likelihood Approach(cont.)

22

23 result is consistent. but time consuming!

24 DNA – THE BASIS OF MOLECULAR PHYLOGENETICS The DNA molecule (polymer) is made of monomer units called nucleotides The DNA molecule (polymer) is made of monomer units called nucleotides Each nucleotide consists of: Each nucleotide consists of: 5 carbon sugar a phosphate group a nitrogen base

25 There are two groups of nitrogen bases: Purines Purines Pyrimidines Pyrimidines

26 There are 4 different types of nucleotides in DNA, differing only in the nitrogen base. There are 4 different types of nucleotides in DNA, differing only in the nitrogen base. The four nitrogen base nucleotides are given one letter abbreviation (the first letter of their name) The four nitrogen base nucleotides are given one letter abbreviation (the first letter of their name) –“A”denine –“G”uanine –“C”ytosine –“T”hymine

27 Purines, is the larger molecule of the two groups Purines, is the larger molecule of the two groups Adenine and Guanine belong to the purines group Adenine and Guanine belong to the purines group

28 Pyrimidines, the smaller molecule of the two groups Pyrimidines, the smaller molecule of the two groups Cytosine and Thymine belong to the Pyrimidines group Cytosine and Thymine belong to the Pyrimidines group

29 The DNA backbone is a polymer with alternating sugar-phosphate sequence The DNA backbone is a polymer with alternating sugar-phosphate sequence

30 Adenine forms 2 hydrogen bonds with thymine on the opposite strand Adenine forms 2 hydrogen bonds with thymine on the opposite strand This is a fixed pairing This is a fixed pairing

31 Guanine forms a triple hydrogen bond with Cytosine Guanine forms a triple hydrogen bond with Cytosine This is also a fixed pairing This is also a fixed pairing

32 Changes in DNA sequences occur through mutations Changes in DNA sequences occur through mutations There are two kind of mutations between nucleotides: There are two kind of mutations between nucleotides: –Transition –transversion

33 Transition A mutation between two nucleotides from the same nitrogen base group A mutation between two nucleotides from the same nitrogen base group –Purine transition G   A –Pyrimidine transition C   T Transversion A mutation between any two nucleotides belonging to different groups A mutation between any two nucleotides belonging to different groups Purines  Pyrimidines –T  A –C  G

34 Two basic elements of DNA substitution  : Composition r: The process

35  : Composition: The composition is just the proportion of four nucleotides.  = [ 0.1, 0.4, 0.2, 0.3], the sum of  = 1 r: The process: can be described by a matrix of numbers, describing how the nucleotides change from one to another

36 DNA substitution can be described by time-homogeneous Poisson process

37 DNA substitution model.  G r11  C r9  A r5 T  T r12.  C r7  A r3 G  T r10  G r8.  A r1 C  T r6  G r4  C r2.ATGCA

38 The Likelihood of two DNA equences JC69 model assumed  : [¼, ¼, ¼, ¼]  : [¼, ¼, ¼, ¼] : the rate of change, where is equal for all nucleotides : the rate of change, where is equal for all nucleotides n1: the number of sites remain same n2: the number of sites change n2: the number of sites change t: the distance form node A to B. t: the distance form node A to B.

39 Sequence ACCGGCCGCGCG Sequence BCGGGCCGGCCG Length = 11; n1 = 8; n2 = 3; = 0.007; Similarity between A and B is n1/(n1 + n2) = 73% From following plot we find the ML is 1.4E-14 where distance is 17

40 High similarity vs. low similarity Higher similarity, shorter distance

41 Long sequences vs. short sequences Longer sequences input produce sharper curve

42 Big vs. small Big vs. small Longer distance with slow rate of change

43 Multi DNA sequences as input PAUP* is designed for reconstruction of phylogenetic tree based on nucleic acid alignments. is Available at http://www.sinauser.com

44 Example output from PAUP*

45 DNA Substitution Models All models are special cases of the general model All models are special cases of the general model The unknown parameters are: The unknown parameters are: –Nucleotide frequency –Rate of change (mutation) Simplest model: equal mutation rates and equal nucleotide frequencies Simplest model: equal mutation rates and equal nucleotide frequencies Other models assume unequal nucleotide frequencies and/or different mutation rates Other models assume unequal nucleotide frequencies and/or different mutation rates

46 Likelihood & Phylogenetics Maximum Likelihood method helps us: Maximum Likelihood method helps us: –Determine the most probable tree of a set of DNA sequences –Determine the best DNA substitution model to describe our data

47 Advantages of the Maximum Likelihood Method The method can be used in a wide range of estimation problems, and produce consistent results The method can be used in a wide range of estimation problems, and produce consistent results When the data set is large the parameter results have a very small variance and come very close to the true value When the data set is large the parameter results have a very small variance and come very close to the true value –This allows us to draw conclusions about the evolutionary process

48 Disadvantages of the Maximum Likelihood Method The Likelihood equations need to be worked out for a given distribution, and they are usually very complicated The Likelihood equations need to be worked out for a given distribution, and they are usually very complicated –Fortunately Maximum Likelihood software is becoming common Maximum Likelihood estimates can be very biased for small samples Maximum Likelihood estimates can be very biased for small samples


Download ppt "Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman."

Similar presentations


Ads by Google