CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.

Slides:



Advertisements
Similar presentations
Maximum Likelihood:Phylogeny Estimation Neelima Lingareddy.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Phylogenetic Trees Lecture 4
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Distance Matrix Methods: Models of Evolution Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Class 3: Estimating Scoring Rules for Sequence Alignment.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Consensus Trees Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
The concept of likelihood refers given some data D, a decision must be made about an adequate explanation of the data. In the phylogenetic framework, one.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Probabilistic methods for phylogenetic trees (Part 2)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Phylogenetic Analysis
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Available at DNA variation in Ecology and Evolution DNA variation in Ecology and Evolution IV- Clustering methods and Phylogenetic.
Outline More exhaustive search algorithms Today: Motif finding
MOLECULAR PHYLOGENETICS Four main families of molecular phylogenetic methods :  Parsimony  Distance methods  Maximum likelihood methods  Bayesian methods.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
MS Sequence Clustering
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Lecture 15: Reconstruction of Phylogeny Adaptive characters: 1.May indicate derived character (special adaptation) e.g. Raptorial forelegs in mantids 2.May.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Methods in Phylogenetic Inference Chris Castorena Thornton Lab.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Markov Chain Models BMI/CS 776
Phylogenetic basis of systematics
Lecture 6B – Optimality Criteria: ML & ME
Molecular basis of evolution.
Recitation 5 2/4/09 ML in Phylogeny
Lecture 6B – Optimality Criteria: ML & ME
Presentation transcript:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Mathematical models are: IncomprehensibleIncomprehensible UselessUseless No fun at allNo fun at all

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Mathematical models are: IncomprehensibleIncomprehensible UselessUseless No fun at allNo fun at all

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS What is a model? Model = hypothesis !!!Model = hypothesis !!! Hypothesis (as used in most biological research):Hypothesis (as used in most biological research): –Precisely stated, but qualitative –Allows you to make qualitative predictions Arithmetic model:Arithmetic model: –Mathematically explicit (parameters) –Allows you to make quantitative predictions …

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The Scientific Method Observation of data Model of how system works Prediction(s) about system behavior (simulation)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The maximum likelihood approach I Starting point:Starting point: You have some observed data and a probabilistic model for how the observed data was produced Having a probabilistic model of a process means you are able to compute the probability of any possible outcome (given a set of specific values for the model parameters). Example:Example: –Data: result of tossing coin 10 times - 7 heads, 3 tails –Model: coin has probability p for heads, 1-p for tails. The probability of observing h heads among n tosses is: Goal:Goal: You want to find the best estimate of the (unknown) parameter values based on the observations. (here the only parameter is “p”) P(h heads) = p h (1-p) n-h hnhn   

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS The maximum likelihood approach II  Likelihood = Probability (Data | Model)  Maximum likelihood: Best estimate is the set of parameter values which gives the highest possible likelihood.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Maximum likelihood: coin tossing example Data: result of tossing coin 10 times - 7 heads, 3 tails Model: coin has probability p for heads, 1-p for tails. P(h heads) = p h (1-p) 10-h h 10   

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling applied to phylogeny Observed data: multiple alignment of sequencesObserved data: multiple alignment of sequences H.sapiens globinA G G G A T T C A H.sapiens globinA G G G A T T C A M.musculus globinA C G G T T T - A M.musculus globinA C G G T T T - A R.rattus globinA C G G A T T - A R.rattus globinA C G G A T T - A Probabilistic model parameters (simplest case):Probabilistic model parameters (simplest case): –Nucleotide frequencies:  A,  C,  G,  T –Tree topology and branch lengths –Nucleotide-nucleotide substitution rates (or substitution probabilities):

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Other models of nucleotide substitution

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS More elaborate models of evolution Codon-codon substitution ratesCodon-codon substitution rates (64 x 64 matrix of codon substitution rates) Different mutation rates at different sites in the geneDifferent mutation rates at different sites in the gene (the “gamma-distribution” of mutation rates) Molecular clocksMolecular clocks (same mutation rate on all branches of the tree). Different substitution matrices on different branches of the treeDifferent substitution matrices on different branches of the tree (e.g., strong selection on branch leading to specific group of animals) Etc., etc.Etc., etc.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Different rates at different sites: the gamma distribution

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Computing the probability of one column in an alignment given tree topology and other parameters A C G G A T T C A A C G G T T T - A A A G G A T T - A A G G G T T T - A A A C CA G Columns in alignment contain homologous nucleotides Assume tree topology, branch lengths, and other parameters are given. For now, assume ancestral states were A and A (we’ll get to the full computation on next slide). Start computation at any internal or external node. Pr =  C P CA (t 1 ) P AC (t 2 ) P AA (t 3 ) P AG (t 4 ) P AA (t 5 ) t4t4 t5t5 t3t3 t1t1 t2t2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Computing the probability of an entire alignment given tree topology and other parameters Probability must be summed over all possible combinations of ancestral nucleotides. (Here we have two internal nodes giving 16 possible combinations) Probability of individual columns are multiplied to give the overall probability of the alignment, i.e., the likelihood of the model. Often the log of the probability is used (log likelihood)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Tree 1Tree 2 Node 1Node 2LikelihoodNode 1Node 2Likelihood AA AA AC AC AG AG AT AT CA CA CC CC CG CG CT CT GA GA GC GC GG GG GT GT TA TA TC TC TG TG TT TT Sum Sum

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Tree 1Tree 2 Node 1Node 2LikelihoodNode 1Node 2Likelihood AA AA AC AC AG AG AT AT CA CA CC CC CG CG CT CT GA GA GC GC GG GG GT GT TA TA TC TC TG TG TT TT Sum Sum

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Tree 1 Node 1Node 2LikelihoodSum AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT Sum Ancestral reconstruction: Node1 = A

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Maximum likelihood phylogeny Data:Data: –sequence alignment Model parameters:Model parameters: –nucleotide frequencies, nucleotide substitution rates, tree topology, branch lengths. Choose random initial values for all parameters, compute likelihood Choose random initial values for all parameters, compute likelihood Change parameter values slightly in a direction so likelihood improves Change parameter values slightly in a direction so likelihood improves Repeat until maximum found Repeat until maximum found Results: Results: (1) ML estimate of tree topology (2) ML estimate of branch lengths (3) ML estimate of other model parameters (4) Measure of how well model fits data (likelihood). (likelihood).