Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Sampling distributions of alleles under models of neutral evolution.
Orthology, paralogy and GO annotation Paul D. Thomas SRI International.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Molecular Evolution Revised 29/12/06
Review of cladistic technique Shared derived (apomorphic) traits are useful in understanding evolutionary relationships Shared primitive (plesiomorphic)
Effective Population Size Real populations don’t satisfy the Wright-Fisher model. In particular, real populations exhibit reproductive structure, either.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Course information To reach me: Barry Cohen GITC 4301 W 4:00-5:30 F 4:45-5:55 Web site,
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Ferris et al PNAS 1981 Evolutionary tree of apes and humans based on cleavage maps of mtDNA Figure compares restriction fragments in humans and gorillas.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Molecular phylogenetics
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Speciation SJCHS. Evolution Microevolution: Change in a population ’ s gene pool from generation to generation Speciation: When one or more new species.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Improved Gene Expression Programming to Solve the Inverse Problem for Ordinary Differential Equations Kangshun Li Professor, Ph.D Professor, Ph.D College.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Virus Evolution. Lecture 6. Chapter 20, pp. 759 – 777.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Inquiry and 22 No Warm-Up Today! Take out your quiz review assignment for check-off.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Expectation-Maximization (EM) Algorithm & Monte Carlo Sampling for Inference and Approximation.
NEW TOPIC: MOLECULAR EVOLUTION.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Modelling evolution Gil McVean Department of Statistics TC A G.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Molecular Evolution and Population Genetics A few notes on population genetics of interest in phylogenetics Thomas Mailund.
Hidden Markov Models BMI/CS 576
What is a Hidden Markov Model?
Evolutionary genomics can now be applied beyond ‘model’ organisms
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Constrained Hidden Markov Models for Population-based Haplotyping
Ancient Wolf Genome Reveals an Early Divergence of Domestic Dog Ancestors and Admixture into High-Latitude Breeds  Pontus Skoglund, Erik Ersmark, Eleftheria.
Gil McVean Department of Statistics, Oxford
Higher Biology Genomic Sequencing Mr G R Davidson.
In-Text Art, Ch. 16, p. 316 (1).
Ab initio gene prediction
1 Department of Engineering, 2 Department of Mathematics,
Hidden Markov Models Part 2: Algorithms
1 Department of Engineering, 2 Department of Mathematics,
26.5 Molecular Clocks Help Track Evolutionary Time
1 Department of Engineering, 2 Department of Mathematics,
CONTEXT DEPENDENT CLASSIFICATION
There is a Great Diversity of Organisms
Chapter 19 Molecular Phylogenetics
5.4 Cladistics Essential idea: The ancestry of groups of species can be deduced by comparing their base or amino acid sequences. The images above are both.
Evolution Part Two.
Evidence for Evolution
Unit Genomic sequencing
Hidden Markov Models By Manish Shrivastava.
5.4 Cladistics.
But what if there is a large amount of homoplasy in the data?
Ancient Wolf Genome Reveals an Early Divergence of Domestic Dog Ancestors and Admixture into High-Latitude Breeds  Pontus Skoglund, Erik Ersmark, Eleftheria.
Presentation transcript:

Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup

Paper:

Dating speciation events Relationship between great apes:

Dating speciation events Relationship between great apes: How do we know this is the relationship?

Dating speciation events Relationship between great apes: How do we know this is the relationship? How do we date the speciation events?

Fossil record Jobling et al Common ancestor human-chimp? Good fossil record for humans, less so for other apes...

Fossil record Jobling et al Common ancestor human-chimp? Good fossil record for humans, less so for other apes... Enter: molecular genetics methods!

Outline of talk ● Introduction to molecular evolution and population genetics ● Mathematical modeling of the problem: coalescence hidden Markov models ● Inference and results

Mutations and a molecular clock ● Mutations enter DNA when ➔ cells replicate ➔ chromosomes are modified by the cell's “chemical soup” ● Error correction and various “proof reading” mechanisms makes mutations extremely rare ➔ But there is a lot of DNA in each individual (2 x ~3 billion nucleotides) ➔ Things that happen about every thousand years happens a lot over millions of years

Mutations and a molecular clock Linear in time and (germ line cell) generations:...and for our purposes, generations are linear in time, so:

Mutations and a molecular clock HIV sequence: Accumulation of mutations over time

Mutations and a molecular clock

Estimation using observed number of mutations

Mutations and a molecular clock Back mutations complicate matters slightly, but we can compensate for this by solving a simple differential equation...

Mutations and a molecular clock Back mutations complicate matters slightly, but we can compensate for this by solving a simple differential equation...

Mutations and a molecular clock The time estimate is known up to a mutation-rate factor......that e.g. can be calibrated using fossil evidence.

Dating divergence... So we can estimate pairwise divergence in units of time My 36 My

Dating divergence... So we can estimate pairwise divergence in units of time......although this over-estimates the number of mutations (does not take shared mutations into account) 14 My 36 My

Dating divergence... So we can estimate pairwise divergence in units of time......possible to construct a tree and infer branch lengths from this My 36 My

Dating divergence... aactg agctg aggtg atatg agctg aactg So we can estimate pairwise divergence in units of time......possible to construct a tree and infer branch lengths from this......or take a statistical approach and deal with unobserved sequences in a mathematical model.

Dating divergence... aactg agctg aggtg atatg agctg aactg t1 t1 t2 t2 t3 t3

Dating divergence... aactg agctg aggtg atatg agctg aactg Can be computed efficiently using a dynamic programming algorithm. t1 t1 t2 t2 t3 t3

Dating divergence... aactg agctg aggtg atatg agctg aactg Time parameters can then be estimated by maximizing the likelihood. t1 t1 t2 t2 t3 t3

Mutations in a population Wright-Fisher model ➔ Discrete, non-overlapping generations ➔ Constant population size ➔ Each individual in one generation is a random copy of an individual from the previous generation N e individuals

Mutations in a population Wright-Fisher model ➔ Discrete, non-overlapping generations ➔ Constant population size ➔ Each individual in one generation is a random copy of an individual from the previous generation N e individuals

Mutations in a population Wright-Fisher model ➔ Discrete, non-overlapping generations ➔ Constant population size ➔ Each individual in one generation is a random copy of an individual from the previous generation N e individuals

Populations and species Individuals A funny thing happens if the population size is large relative to the splitting time...

Populations and species Populations Individuals

Populations and species Individuals Populations

Populations and species Individuals Populations join at population join time join much later than population

Populations and species Individuals Populations

Populations and species A funny thing happens if the population size is large relative to the splitting time......for speciation, the time is too long...

Populations and species A funny thing happens if the population size is large relative to the splitting time......for speciation, the time is too long......and no human is closer related to chimps than any other human!

Populations and species ● The time for k lines to coalesce is distributed as E(k(k– 1)/2) in units of 2N e generations. ➔ Generation time ~25 years ➔ N e for humans ~10000

Populations and species ➔ We will all have coalesced before we meet the chimps HumansChimps

Recombination ● But there is recombination! ➔ Breaks up DNA in segments with separate histories ➔ Single extant sequence reaches an equilibrium back in time ➔ We will have several segments meeting the chimp!

Recombination  genom e  specie s Physical position along genome Divergence time Adapted from Patterson et al. 2006

Species and segment trees Human Chimp Gorilla Case A: Only Human and Chimp can coalesce here. Always consistent with species tree. Case B: All species can coalesce here. Only a third consistent with species tree.

Species and segment trees Human Chimp Gorilla Case A: Only Human and Chimp can coalesce here. Always consistent with species tree. Case B: All species can coalesce here. Only a third consistent with species tree. We can get the probability of A or B based on split times and population size t1 t1 t1+t2 t1+t2 NHCNHC N HC G

Species and segment trees We can express the probability of either of these trees in terms of splitting times and effective population sizes...

Species and segment trees We can express the probability of either of these trees in terms of splitting times and effective population sizes......or alternatively get the time parameters from the tree probabilities.

Species and segment trees We obtain the tree probabilities from an approximation to the coalescence process: a hidden Markov model

Markov models A Markov model is an automaton where transitions are probabilistic, i.e. each transition to the next state is taken with a certain probability. A run is a sequence of states generated by the model.

Hidden Markov models (HMMs) A hidden Markov model in addition has emission symbols and probabilities. In each state it emits symbols with certain probabilities. A run is a sequence of both states and emissions. We only observe the emissions.

A coalescence HMM ● States correspond to the four trees ● Emission probabilities using dynamic programming algorithm ➔ Branch lengths are mean branch lengths for each tree type

A coalescence HMM A

AaacAaac a a c

A B1 a c

A coalescence HMM A B1 a c c c c c

A coalescence HMM A B1 B1 a c a a c t c c a a t a

A coalescence HMM A B1 B1 B1 a c a a a c t a c c a a a a a

A coalescence HMM A B1 B1 B1 B3 a c a a g a c t a c c c a a c g c c

A coalescence HMM A B1 B1 B1 B3 B3 B3 A A A a c a a g a t c a c a c t a c t c t a c c c a a c t c t a c c c c

A coalescence HMM A B1 B1 B1 B3 B3 B3 A A A a c a a g a t c a c a c t a c t c t a c c c a a c t c t a c

Inference in the CoalHMM ● We use standard algorithms for estimating the HMM parameters ➔ Transition probabilities gives us: ➔ State probabilities, which gives us: ➔ Coalescence process parameters, which gives us: ➔ Speciation times!

Results t1 t1 t2 t2

Annotation in the CoalHMM Posterior probabilities of each tree (probability of being in a particular tree given the sequence data) along a genomic region:

Summary ● Dating the speciation between humans, chimps, and gorilla ➔ Expressed as a molecular evolution / population genetics problem ➔ Approximated by a machine learning / statistical model (HMM) ➔ Estimated HMM parameters gives us speciation times ● Puts speciation time of human-chimp ~4 My ago and (human-chimp) – gorilla ~5-6 My ago

The end Thank you!

Cheers! See you in the bar!