Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Maximum Likelihood Estimation Navneet Goyal BITS, Pilani.
Phylogenetic Trees Lecture 4
3.3 DNA Structure –
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
Maximum likelihood (ML) and likelihood ratio (LR) test
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Heuristic alignment algorithms and cost matrices
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Discrete Probability Distributions
Class 3: Estimating Scoring Rules for Sequence Alignment.
The concept of likelihood refers given some data D, a decision must be made about an adequate explanation of the data. In the phylogenetic framework, one.
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
Maximum likelihood (ML)
DNA Structure The Genetic Material.
The Structure of DNA.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
STATISTICAL INFERENCE PART I POINT ESTIMATION
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
 Deoxyribo- nucleic Acid is made up of nucleotides.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular Biology 2.6 Structure of DNA and RNA. Nucleic Acids The nucleic acids DNA and RNA are polymers of nucleotides.
DNA (deoxyribonucleic acid) consists of three components.
MAT 4830 Mathematical Modeling 4.1 Background on DNA
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Lecture 2: Principles of Phylogenetics
Introduction to DNA (Deoxyribonucleic acid). What do you know?
Calculating branch lengths from distances. ABC A B C----- a b c.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
The Structure of:.  By the end of the day, you should:  Know what DNA stands for  Understand the shape of DNA and be able to label all parts  Know.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Modelling evolution Gil McVean Department of Statistics TC A G.
Nucleic Acids. Nucleic acids are large biomolecules (polymers) – essential for all known forms of life Include DNA and RNA Made from long strands of nucleotides.
DNA Structure. DNA = D eoxyribo N ucleic A cid  DNA is a polymer (chain of monomers)  Nucleotide = monomer of nucleic acids  DNA is in a double helix.
Conditional Expectation
© SSER Ltd.. Visit this site to investigate the history, structure and role of DNA.
Nucleic Acids. Nucleic Acids Made from long strands of nucleotides (monomers) Nucleic acids are large biomolecules (polymers) – essential for all known.
Copyright © Cengage Learning. All rights reserved.
What is DNA?.
Maximum likelihood (ML) method
The Genetic Material DNA Structure.
Multiple Alignment and Phylogenetic Trees
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
Parametric Methods Berlin Chen, 2005 References:
Macromolecules and the Origin of Life
Presentation transcript:

Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman Shlomit Klopman

The Scope of the Presentation Introduction Introduction Maximum Likelihood and Coin Tossing Maximum Likelihood and Coin Tossing The Phylogenetic Tree The Phylogenetic Tree Maximum Likelihood and DNA Substitution Maximum Likelihood and DNA Substitution Advantages and Disadvantages Maximum Likelihood Advantages and Disadvantages Maximum Likelihood

Introduction Phylogeny: the study of relationships between life forms Phylogeny: the study of relationships between life forms Phylogenetics is part of the field of taxonomy and systematics Phylogenetics is part of the field of taxonomy and systematics Phylogenetics received a huge push forward thanks to modern computers Phylogenetics received a huge push forward thanks to modern computers Various phylogenetic methods are used to explain the evolutionary process, and often give contradicting results! Various phylogenetic methods are used to explain the evolutionary process, and often give contradicting results!

Introduction (cont.) Scientists agree that a correct species linage should be determined using statistics Scientists agree that a correct species linage should be determined using statistics Maximum Likelihood is the method of choice for establishing the most realistic phylogenetic tree of a given data Maximum Likelihood is the method of choice for establishing the most realistic phylogenetic tree of a given data The Maximum Likelihood method was introduced in 1922 by R.A. Fisher an English statistician The Maximum Likelihood method was introduced in 1922 by R.A. Fisher an English statistician

Maximum Likelihood in a Nutshell The method depends on: The method depends on: –Complete data set –Probabilistic model that describes the data –Explicitly expressing the likelihood function The likelihood of a data set is the probability of obtaining it, given the chosen probability distribution model The likelihood of a data set is the probability of obtaining it, given the chosen probability distribution model We seek the values of the parameters that maximize the sample likelihood We seek the values of the parameters that maximize the sample likelihood

Maximum Likelihood approach using Coin Tossing Experiment Maximum Likelihood approach using Coin Tossing Experiment Find the parameter value(s) that make the observed data most likely. Basically, choose the value of parameter that maximizes the probability of observing the data. Find the parameter value(s) that make the observed data most likely. Basically, choose the value of parameter that maximizes the probability of observing the data. Probability: Knowing parameters  Prediction of outcome Probability: Knowing parameters  Prediction of outcome Likelihood: Observation of data  Estimation of parameters Likelihood: Observation of data  Estimation of parameters Parameters describe the characteristics of a population. Their values are estimated from samples collected from that population. Parameters describe the characteristics of a population. Their values are estimated from samples collected from that population.

Simple Coin Tossing Experiment Simple Coin Tossing Experiment Binomial probability distribution Binomial probability distribution The probability of observing h heads out of n tosses can be described as: The probability of observing h heads out of n tosses can be described as: Pr[h|p, n] = n! p h (1-p) n-h Pr[h|p, n] = n! p h (1-p) n-h h!(n-h)! h!(n-h)! Where p is probability of Heads Where p is probability of Heads (1-p) is probability of Tails. (1-p) is probability of Tails.

Simple Coin Tossing Experiment Simple Coin Tossing Experiment Suppose I told you we tossed a coin 10 times and got 4 heads and 6 tails, then the probability would be Suppose I told you we tossed a coin 10 times and got 4 heads and 6 tails, then the probability would be P(4Heads, 6Tails) = 10! p 4 (1-p) 6 P(4Heads, 6Tails) = 10! p 4 (1-p) 6 4!*6! 4!*6! The whole notion of maximum likelihood estimation is that we choose p to be the one that makes the probability of getting our set of observations the largest possible: i.e. maximize P 4 (1-P) 6. So our likelihood function would be: like = p 4 (1-p) 6 The whole notion of maximum likelihood estimation is that we choose p to be the one that makes the probability of getting our set of observations the largest possible: i.e. maximize P 4 (1-P) 6. So our likelihood function would be: like = p 4 (1-p) 6

Two ways to find MLE Two ways to find MLE 1. Take the first derivative of the likelihood function with respect to each parameter, set the resulting equations equal to 0, and solve for the parameter estimates. Applying log on both sides Log(L(p)) = n Log(p) + (n-h) Log(1-p) Take first derivative w.r.t p (n / p) – (n-h) / (1-p) = 0 (n / p) – (n-h) / (1-p) = 0 Solving for p, We get p = h / n This value maximizes the likelihood function and is the MLE.

Find the maximum using Numeric search procedures Find the maximum using Numeric search procedures 2. Plug in different values for p into the probability model and calculate likelihood. Lets take sample n = 100, h = 56. Lets take sample n = 100, h = 56. Imagine that p was 0.5. Imagine that p was 0.5. Plugging this value into our probability model as follows:- Plugging this value into our probability model as follows:- L(p = 0.5 | data ) = 100 ! = L(p = 0.5 | data ) = 100 ! = ! 44! 56! 44! But what if p was 0.52 instead? But what if p was 0.52 instead? L(p = 0.52 | data) = 100 ! = L(p = 0.52 | data) = 100 ! = ! 44! 56! 44!

So from this we can conclude that p is more likely to be 0.52 than 0.5. We can tabulate the likelihood for different parameter values to find the maximum likelihood estimate of p: So from this we can conclude that p is more likely to be 0.52 than 0.5. We can tabulate the likelihood for different parameter values to find the maximum likelihood estimate of p: p L p L p L p L

Maximum likelihood estimate for p seems to be exactly at Maximum likelihood estimate for p seems to be exactly at 0.56.

MLE: Sample Graphs (using Mathematica)

Simple Coin Tossing Experiment Simple Coin Tossing Experiment The best estimate for p from any one sample is clearly going to be the proportion of heads observed in that sample. The best estimate for p from any one sample is clearly going to be the proportion of heads observed in that sample. A very simple example like this is over rated for evaluating p using MLE approach. A very simple example like this is over rated for evaluating p using MLE approach. But not all problems are this simple! The more complex the model and the greater the number of parameters, it often becomes very difficult to make even reasonable guesses at the MLEs. But not all problems are this simple! The more complex the model and the greater the number of parameters, it often becomes very difficult to make even reasonable guesses at the MLEs.

Phylogenetic Tree A phylogenetic tree is a data structure, characterized by: topology (form) its branch lengths Stores information regarding the relationship of several species or sequences.

a b c d : assumed ancestral state "d" is the Rooted tree: assumed ancestral state "d" is the root species.... no implicit "directionality", but is a measure of between species. Unrooted tree... no implicit "directionality", but is a measure of similarity between species. a b c d Types of Phylogenetic Trees leaf branc h ro ot leaf branc h

(1) A G G C U C C A A (1) A G G C U C C A A (2) A G G U U C G A A (2) A G G U U C G A A (3) A G C C C A G A A (3) A G C C C A G A A (4) A U U U C G G A A Molecular phylogenetic methods use a given set of aligned sequences to construct a phylogenetic Tree sequence 1 sequence 2 sequence 3 sequence 4 There are several ways to construct phylogenetic trees. The Maximum Likelihood method will pick out the tree that most represents the true tree.

j (1) A G G C T C C A A….A (2) A G G T T C G A A.…A (3) A G C C C A G A A....A (4) A T T T C G G A A....C The Maximum Likelihood Approach 1. Assumes that all sequences at each site are considered independent. 1 ….N 2

x The Maximum Likelihood Approach(cont.) C AC G y 1.The log-likelihood is computed for a given topology by using a particular probability model. L ( j ) = Prob + …+ Prob N ln L= ln L(1) + ln L(2)..+ ln L(j)+… + ln L(N) = SUM ln L(i) i=1 Binomial; Multinomial; Poisson….. a) b) C A C G A A C AC G G G c)

The Maximum Likelihood Approach (cont.) 3. After procedure is done for, the topology that showsis chosen as the 3. After procedure is done for all possible topologies, the topology that shows the highest likelihood is chosen as the true (realistic) tree. #Rooted trees = #Unrooted trees = How many topologies do we have to go through for? How many topologies do we have to go through for n sequences?

The Maximum Likelihood Approach(cont.)

result is consistent. but time consuming!

DNA – THE BASIS OF MOLECULAR PHYLOGENETICS The DNA molecule (polymer) is made of monomer units called nucleotides The DNA molecule (polymer) is made of monomer units called nucleotides Each nucleotide consists of: Each nucleotide consists of: 5 carbon sugar a phosphate group a nitrogen base

There are two groups of nitrogen bases: Purines Purines Pyrimidines Pyrimidines

There are 4 different types of nucleotides in DNA, differing only in the nitrogen base. There are 4 different types of nucleotides in DNA, differing only in the nitrogen base. The four nitrogen base nucleotides are given one letter abbreviation (the first letter of their name) The four nitrogen base nucleotides are given one letter abbreviation (the first letter of their name) –“A”denine –“G”uanine –“C”ytosine –“T”hymine

Purines, is the larger molecule of the two groups Purines, is the larger molecule of the two groups Adenine and Guanine belong to the purines group Adenine and Guanine belong to the purines group

Pyrimidines, the smaller molecule of the two groups Pyrimidines, the smaller molecule of the two groups Cytosine and Thymine belong to the Pyrimidines group Cytosine and Thymine belong to the Pyrimidines group

The DNA backbone is a polymer with alternating sugar-phosphate sequence The DNA backbone is a polymer with alternating sugar-phosphate sequence

Adenine forms 2 hydrogen bonds with thymine on the opposite strand Adenine forms 2 hydrogen bonds with thymine on the opposite strand This is a fixed pairing This is a fixed pairing

Guanine forms a triple hydrogen bond with Cytosine Guanine forms a triple hydrogen bond with Cytosine This is also a fixed pairing This is also a fixed pairing

Changes in DNA sequences occur through mutations Changes in DNA sequences occur through mutations There are two kind of mutations between nucleotides: There are two kind of mutations between nucleotides: –Transition –transversion

Transition A mutation between two nucleotides from the same nitrogen base group A mutation between two nucleotides from the same nitrogen base group –Purine transition G   A –Pyrimidine transition C   T Transversion A mutation between any two nucleotides belonging to different groups A mutation between any two nucleotides belonging to different groups Purines  Pyrimidines –T  A –C  G

Two basic elements of DNA substitution  : Composition r: The process

 : Composition: The composition is just the proportion of four nucleotides.  = [ 0.1, 0.4, 0.2, 0.3], the sum of  = 1 r: The process: can be described by a matrix of numbers, describing how the nucleotides change from one to another

DNA substitution can be described by time-homogeneous Poisson process

DNA substitution model.  G r11  C r9  A r5 T  T r12.  C r7  A r3 G  T r10  G r8.  A r1 C  T r6  G r4  C r2.ATGCA

The Likelihood of two DNA equences JC69 model assumed  : [¼, ¼, ¼, ¼]  : [¼, ¼, ¼, ¼] : the rate of change, where is equal for all nucleotides : the rate of change, where is equal for all nucleotides n1: the number of sites remain same n2: the number of sites change n2: the number of sites change t: the distance form node A to B. t: the distance form node A to B.

Sequence ACCGGCCGCGCG Sequence BCGGGCCGGCCG Length = 11; n1 = 8; n2 = 3; = 0.007; Similarity between A and B is n1/(n1 + n2) = 73% From following plot we find the ML is 1.4E-14 where distance is 17

High similarity vs. low similarity Higher similarity, shorter distance

Long sequences vs. short sequences Longer sequences input produce sharper curve

Big vs. small Big vs. small Longer distance with slow rate of change

Multi DNA sequences as input PAUP* is designed for reconstruction of phylogenetic tree based on nucleic acid alignments. is Available at

Example output from PAUP*

DNA Substitution Models All models are special cases of the general model All models are special cases of the general model The unknown parameters are: The unknown parameters are: –Nucleotide frequency –Rate of change (mutation) Simplest model: equal mutation rates and equal nucleotide frequencies Simplest model: equal mutation rates and equal nucleotide frequencies Other models assume unequal nucleotide frequencies and/or different mutation rates Other models assume unequal nucleotide frequencies and/or different mutation rates

Likelihood & Phylogenetics Maximum Likelihood method helps us: Maximum Likelihood method helps us: –Determine the most probable tree of a set of DNA sequences –Determine the best DNA substitution model to describe our data

Advantages of the Maximum Likelihood Method The method can be used in a wide range of estimation problems, and produce consistent results The method can be used in a wide range of estimation problems, and produce consistent results When the data set is large the parameter results have a very small variance and come very close to the true value When the data set is large the parameter results have a very small variance and come very close to the true value –This allows us to draw conclusions about the evolutionary process

Disadvantages of the Maximum Likelihood Method The Likelihood equations need to be worked out for a given distribution, and they are usually very complicated The Likelihood equations need to be worked out for a given distribution, and they are usually very complicated –Fortunately Maximum Likelihood software is becoming common Maximum Likelihood estimates can be very biased for small samples Maximum Likelihood estimates can be very biased for small samples