MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.

Slides:



Advertisements
Similar presentations
Introduction to Molecular Evolution
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Chapter 2 Opener How do we classify organisms?. Figure 2.1 Tracing the path of evolution to Homo sapiens from the universal ancestor of all life.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 5: Variation within and between species * Chapter 5: Are Neanderthals among.
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Building and visualizing phylogeny Henrik Lantz Dept. of Medical Biochemistry and Microbiology, BMC, Uppsala University.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Chapter 8 Molecular Phylogenetics: Measuring Evolution.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Lecture 2: Principles of Phylogenetics
Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Patterns of divergent selection from combined DNA barcode and phenotypic data Tim Barraclough, Imperial College London.
An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA 07/07/2006.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Evolutionary Models CS 498 SS Saurabh Sinha. Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Modelling evolution Gil McVean Department of Statistics TC A G.
Comparative methods wrap-up and “key innovations”.
From: On the Origin of Darwin's Finches
Phylogeny and the Tree of Life
Models for DNA substitution
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Pipelines for Computational Analysis (Bioinformatics)
Models of Sequence Evolution
Goals of Phylogenetic Analysis
Summary and Recommendations
Reading Phylogenetic Trees
Chapter 19 Molecular Phylogenetics
The Most General Markov Substitution Model on an Unrooted Tree
Evidence for Evolution
Summary and Recommendations
Presentation transcript:

MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU

DRIVERS OF EVOLUTION: RANDOM PROCESS + SELECTION/ADAPTATION Bacteria undergo hypermutation in response to stress Adaptive immune system undergoes somatic hypermutation to mount a specific immune response

DERIVING STOCHASTIC MODELS OF MOLECULAR EVOLUTION COMPARE HOMOLOGOUS GENES ACROSS SPECIES USING SEQUENCE ALIGNMENTS

PHYLOGENETICS: TOOL FOR MODELLING EVOLUTION What is the evolutionary relationship between a set of taxa? (identify the tree) Given a tree, what is the evolutionary distance along the branches?

CONTINUOUS TIME MARKOV PROCESSES TO MODEL MOLECULAR EVOLUTION Underlying substitution rate matrix Q Initial nucleotide distribution, e.g. π = (.25,.25,.25,.25) Up to 12 free parameters

A BRIEF HISTORY OF CONTINUOUS TIME MARKOV PROCESSES FOR MODELLING EVOLUTION Model Substitution Rate #Parameters Jukes-Cantor(1969) all equal 1 Kimura (1980) transitions > transversions 2 Felsenstein (1981) estimate initial state 3 Gen. Time-reversible (1986) exchangeable 9

PROPERTIES OF ALL OF THESE MODELS Stationarity: the distribution of bases (A,C,G,T) unchanged over evolutionary time Time-reversibility: the process looks the same whether run forwards or backwards

NESTED MODELS IN COMMON MODELS OF EVOLUTION Jukes-Cantor(1969) Felsenstein(1981) Kimura(1980) HKY(1984)GTR(1986) DATA (sufficient statistic) ? 1 df 2 df 3 df 5 df 9 df (4^k -1) df

ASSESSING AN EVOLUTIONARY MODEL Choice of model should reflect evident patterns in data Stationary models: A/C/G/T content should be similar across taxa Note: time reversible models are stationary Models should “fit” the data Likelihood ratio tests against the “saturated model” GOLDMAN, N Statistical tests of models of DNA substitution. J Mol Evol, 36,

FULLY GENERALISED CONTINUOUS TIME MARKOV MODEL 12 free parameters for each edge Initial nucleotide distribution, to be estimated by the data Up to 12 free parameters

COMPARING NESTED MODELS IN 3-TAXA (UNROOTED) TREE 4000 SEQUENCE ALIGNMENTS (NUCLEUS) HUMAN OPOSSUM MOUSE TIME-REVERSIBLE Q (GTR) 9 df GENERAL MODEL Q H, Q M, Q O 39 df “FULLY SATURATED” MODEL (MULTINOMIAL) 63 df Q QHQHQHQH QMQMQMQM QOQOQOQO KAEHLER, B. et al Genetic distance for a general non- stationary markov substitution process. Syst Biol, 64,

HOW OFTEN ARE THE MARKOV MODELS “ADEQUATE” RELATIVE TO THE SATURATED MODEL? “ADEQUATE” = LIKELIHOOD RATIO TEST (p >0.05) GENERAL MODEL – 94% TIME-REVERSIBLE MODEL - 18% The saturated model can be approximated using a Markov model!

MODEL FIT VS MODEL COMPLEXITY MARKOV MODELS WITH N TAXA: GTR model: 9 free parameters General model: (2N-3)* free parameters Can we reduce the model complexity without sacrificing model fit?

DATA: SUBSTITUTION RATE ESTIMATES FOR 4000 NUCLEAR SEQUENCE ALIGNMENTS: X 12 GeneedgeT.AT.GC.TC.Aetc. 1MOUSE MOUSE HUMAN HUMAN OPOSSUM OPOSSUM

DIMENSIONS OF VARIATION (PCA) IN SUBSTITUTION RATE 4000 RATE MATRICES: HUMAN TWO dimensions explains 87% total variation between rate matrices Strand symmetry: A>G, T>C are most variable substitutions Transitions: A>G, T>C, C>T and G>A are most common substitutions

DIMENSIONS OF VARIATION IN SUBSTITUTION RATE 4000 RATE MATRICES: OPOSSUM TWO dimensions explains 69% total variation between rate matrices Strand symmetry: A>G, T>C are most variable substitutions Transitions: A>G, T>C, C>T and G>A are most common substitutions Issue: Opossum is the outgroup, models are not time-reversible.

DIMENSIONS OF VARIATION IN (NORMALISED) SUBSTITUTION RATE 4000 RATE MATRICES: HUMAN FOUR dimensions explains 56% total variation between rate matrices Strand symmetry: for all substitutions Transitions: A>G, T>C, C>T and G>A are most common substitutions Could strand-symmetric models be “as good as” general model?

PRELIMINARY CONCLUSIONS Continuous time Markov models can be good models of molecular evolution General Markov models trump traditional evolutionary models General Markov models can be too parameter-rich We can fit general models to sequence alignment data and look for lower dimensional alternatives GTR(1986) DATA (sufficient statistic) Fully General Markov ?

ACKNOWLEDGEMENTS AND REFERENCES Gavin Huttley, JCSMR ANU Ben Kaehler, JCSMR, ANU KAEHLER, B. et al Genetic distance for a general non-stationary markov substitution process. Syst Biol, 64, GOLDMAN, N Statistical tests of models of DNA substitution. J Mol Evol, 36,

NEXT DIRECTIONS: USING PYTHON FOCUS ON FEATURES For each alignment, fit the general model and reduced model (e.g. strand symmetric models) Compare models using likelihood ratio tests FOCUS ON PROJECTIONS Fit general model to all 4000 alignments: Q_all for each species Project Q_all matrices to Q0 using singular value decomposition Compare models using likelihood ratio tests