Presentation is loading. Please wait.

Presentation is loading. Please wait.

Molecular Evolution: Plan for week Monday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30.

Similar presentations


Presentation on theme: "Molecular Evolution: Plan for week Monday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30."— Presentation transcript:

1 Molecular Evolution: Plan for week Monday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30 PAUP : Distance/Parsimony/Compatibility (JH/IH) Lecture 2 : 13.30-15 Molecular Basis and Models II (JH) Lecture 3: 15.30-17 The Origin of Life (JH/ Miklos) Tuesday 4.11: Tree of Life Lecture 1: 9-10.30 Molecular Evolution of Eukaryote Pathogens (Day/Barry) Lecture 2: 11-12.30 Molecular Evolution of Prokaryote Pathogens (Maiden) Computer: 13.30-15 Analysis of Viral Data (Taylor) Lecture 3:15.30-17 Molecular Evolution of Virus (E.Holmes) Wednesday 5.11: Stochastic Models of Evolution & Phylogenies Computer : 9-10.30 PAUP/Mr. Bayes: Likelihood (JH/IH) Lecture 1:11-12.30 The Evolution of Protein Structures (Deane) Computer: 13.30-15 PAML:Testing Evolutionary Models (JH/Lyngsoe) Lecture 2:15.30- 17 Molecular Evolution & Function/Structure/Selection(Meyer) Thursday 6.11: More Phylogenies Computer : 9-10.30 Molecular Evolution on the web (JH/Lyngsoe) Lecture 2: 11-12.30 Beyond Phylogenies: Networks & Recombination (Song/JH) Computer: 13.30-15 Beyond Phylogenies (Song) Lecture 3: 15.30-17 Molecular Evolution and the Genomes. (JH/Lunter) Friday 7.11: Results, Advanced Topics and article discussion Computer: 9-10.30 Statistical Alignment (JH/IM) Lecture: 11-12.30 Article Discussion/Presentation by students The Last Lunch

2 Two Discussion Articles 1. Timing the ancestor of the HIV-1 pandemic strains. Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T. Science. 2000 Jun 9;288(5472):1789-96. 2. Sequencing and comparison of yeast species to identify genes and regulatory elements. Kells, M., N.Patterson, M.Endrizzi & E.Lander Nature May 15 2003 vol 423.241-

3 The Data & its growth. 1976/79 The first viral genome –MS2/  X174 1995 The first prokaryotic genome – H. influenzae 1996 The first unicellular eukaryotic genome - Yeast 1997 The first multicellular eukaryotic genome – C.elegans 2001 The human genome 2002 The Mouse Genome 1.5.03: Known >1000 viral genomes 96 prokaryotic genomes 16 Archeobacterial genomes A series multicellular genomes are coming. A general increase in data involving higher structures and dynamics of biological systems

4 The Nucleotides http://www.accessexcellence.org/AB/GG/ Pyremidines Purines Transversions Transitions

5 The Amino Acids/Codons/Genes http://www.accessexcellence.org/AB/GG/ {nucleotides} 3  amino acids, stop

6 Major Application Areas of Molecular Evolution Phylogenies and Classification Rates of Evolution & The Molecular Clock Dating Functional Constraint – Negative Selection. Positive/Diversifying Selection Structure RNA Structure Gene Finding Homing in on Important Genes Homology Searches Disease Gene Mapping

7 The Tree (?) of Life LUCA Prokaryotes Eukaryotes Archea Origin of Life Viruses ?? PlantFungiAnimals

8 Tree of Life. Science vol.300 June 2003

9 The Origin of Life When did life originate? Is the present structure a necessity or is it random accident? How frequent is life in the Universe? “+”: “-”: Self replication easy Self assembly easy Many extrasolar planets Hard to make proper polymerisation No convincing scenario. No testability Increased Origin Research: In preparation of future NASA expeditions. The rise of nano biology. The ability to simulate larger molecular systems

10 Central Principles of Phylogeny Reconstruction Parsimony Distance Likelihood TTCAGT TCCAGT GCCAAT s2 s1 s4 s3 s2 s1 s4 s3 s2 s1 s4 s3 0 1 1 2 0 Total Weight: 4 1 1 2 3 2 1 0.4 0.6 0.3 0.7 1.5 L=3.1*10 -7 Parameter estimates

11 From Distance to Phylogenies What is the relationship of a, b, c, d & e? A b c d e A - 22 10 22 22 B 6 - 22 16 14 C 7 3 - 22 22 D 13 9 8 - 16 e 6 8 9 15 - Molecular clock No Molecular clock

12 Enumerating Trees: Unrooted & valency 3 2 1 3 1 1 2 4 2 3 3 1 2 3 4 4 1 2 3 4 1 2 34 1 2 34 1 2 34 1 2 34 5 55 5 5 456789101520 3 15 105945 10345 1.4 10 5 2.0 10 6 7.9 10 12 2.2 10 20 Recursion: T n = (2n-5) T n-1 Initialisation: T 1 = T 2 = T 3 =1

13 Heuristic Searches in Tree Space Nearest Neighbour Interchange Subtree regrafting Subtree rerooting and regrafting T2T2 T1T1 T4T4 T3T3 T2T2 T1T1 T4T4 T3T3 T2T2 T1T1 T4T4 T3T3 T4T4 T3T3 s4 s5 s6 s1 s2 s3 T4T4 T3T3 s4 s5 s6 s1 s2 s3 T4T4 T3T3 s4 s5 s6 s1 s2 s3 T4T4 T3T3 s4 s5 s6 s1 s2 s3

14 Assignment to internal nodes: The simple way. C A C C A C T G ? ? ? ? ? ? What is the cheapest assignment of nucleotides to internal nodes, given some (symmetric) distance function d(N 1,N 2 )?? If there are k leaves, there are k-2 internal nodes and 4 k-2 possible assignments of nucleotides. For k=22, this is more than 10 12.

15 5S RNA Alignment & Phylogeny Hein, 1990 10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcgaacttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-gggggccct-gcggaaaaatagctcgatgccagga--ta 17 t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaacttggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagcccg-atggaaaaatagctcgatgccagga--t- 9 t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaacttggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagcccg-atggaaaaatagctcgacgccagga--t- 14 t----ctggtggccatggcgtagaggaaacaccccatcccataccgaactcggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagcccg-ctgggaaaataggacgctgccag-a--t- 3 t----ctggtgatgatggcggaggggacacacccgttcccataccgaacacggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcagggag-ccgggagagtaggacgtcgccag-g--c- 11 t----ctggtggcgatggcgaagaggacacacccgttcccataccgaacacggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtccg-ctgggagagtaggacgctgccag-g--c- 4 t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaacacggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtccc-ctgtgagagtaggacgctgccag-g--c- 15 g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaactcggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagaccgcctgggaaacctggatgctgcaag-c--t- 8 g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatctcggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacctcctgggaataccgggtgctgtagg-ct-t- 12 g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagaccgcctgggaatcctgggtgctgtagg-c--t- 7 g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatctggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacggcctgggaatcctggatgttgtaag-c--t- 16 g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagaccgcctgggaatcctgggtgctgtagg-c--t- 1 a----tccacggccataggactctgaaagcactgcatcccgt-ccgatctgcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggaccacgcgggaatcctgggtgctgt-gg-t--t- 18 a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatctgcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 2 a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatctgcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 5 g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacctcccgggaagtcctggtgccgcacc-c--c- 13 g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacctcctgggaagtcctgatgctgcacc-c--t- 6 g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacctcctgggaagtcctaatattgcacc-c-tt- 9 11 10 6 8 7 5 4 3 1 2 17 16 15 14 13 12 Transitions 2, transversions 5 Total weight 843.

16 Cost of a history - minimizing over internal states A C G T d(C,G) +w C (left subtree)

17 Cost of a history – leaves (initialisation). A C G T G A Empty Cost 0 Empty Cost 0 Initialisation: leaves Cost(N)= 0 if N is at leaf, otherwise infinity

18 Fitch-Hartigan-Sankoff Algorithm The cost of cheapest tree hanging from this node given there is a “C” at this node A C T G 2 5 (A,C,G,T) * 0 * * (A,C,G,T) * * * 0 (A,C,G,T) * * 0 * (A, C, G,T) (10,2,10,2) (A,C,G,T) (9,7,7,7)

19 The Felsenstein Zone Felsenstein-Cavendar (1979) s4 s3 s2 s1 Patterns:(16 only 8 shown) 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 True Tree Reconstructed Tree s3 s1 s2 s4

20 Bootstrapping Felsenstein (1985) ATCTGTAGTCT 10230101201 ATCTGTAGTCT 1 2 3 4 2 1500 ?????????? 1 23 4 1 2 3 4

21 The Molecular Clock First noted by Zuckerkandl & Pauling (1964) as an empirical fact. How can one detect it? Known Ancestor, a, at Time t s1 s2 a Unknown Ancestors s1 s2 s3 ??

22 1) Outgrup: Enhance data set with sequence from a species definitely distant to all of them. It will be be joined at the root of the original data Rootings Purpose 1) To give time direction in the phylogeny & most ancient point 2) To be able to define concepts such a monophyletic group. 2) Midpoint: Find midpoint of longest path in tree. 3) Assume Molecular Clock.

23 Rooting the 3 kingdoms 3 billion years ago: no reliable clock - no outgroup Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? E P A E P A Root?? E P A LDH/MDH Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? E P A E P A LDH MDH

24 time Contemporary sample no time structure Serial sample with time structure 2000 1980 1990 RNA viruses like HIV evolve fast enough that you can’t ignore the time structure Non-contemporaneous leaves. (A.Rambaut (2000): Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16.4.395-399) From Drummond

25 Pt.7 Pt.9 HIV1U36148 HIV1U36015 HIV1U35980 HIV1U36073 HIV1U35926 HIVU95460 Pt.2 Patient #6 from Wolinsky et al. Pt.5 Pt.3 Pt.1 Pt.8 Pt.6 10% Shankarappa et al (1999) 0 246810 Years Post Seroconversion Viral Divergence 2% 4% 6% 8% 10% From Drummond HIV-1 (env) evolution in nine infected individuals

26 Lineage A Lineage B ‘Ladder-like’ appearance N e = [4000,6300] Mu = [0.8% – 1%] per site year 210 sequences collected over a period of 9.5 years 660 nucleotides from env: C2-V5 region Only first 285 (no alignment ambiguities) were used in this analysis Effective population size and mutation rate were co-estimated using Bayesian MCMC. From Drummond A tree sampled from the posterior distribution of Shankarappa Patient

27 Models of Amino Acid, Nucleotide & Codon Evolution Amino Acids, Nucleotides & Codons Continuous Time Markov Processes Specific Models Special Issues Context Dependence Rate Variation

28 The Purpose of Stochastic Models. 1.Molecular Evolution is Stochastic. 2. To estimate evolutionary parameters, not observable directly: i. Real number of events in evolutionary history. ii. Rates of different kinds of events in evolutionary history. iii. Strength of selection against amino acid changing nucleotide substitutions. iv. Estimate importance of different biological factors. 3.Survive a goodness of fit test. 4. Serve these purposes as simply as possible.

29 ACGTC Central Problems: History cannot be observed, only end products. Comment: Even if History could be observed, the underlying process couldn’t ACGCC AGGCC AGGCT AGGTT ACGTC ACGCC AGGCC AGGCT AGGTT AGGGC AGTGC

30 Principle of Inference: Likelihood Likelihood function L() – the probability of data as function of parameters: L( ,D) LogLikelihood Function – l(): ln(L( ,D)) If the data is a series of independent experiments L() will become a product of Likelihoods of each experiment, l() will become the sum of LogLikelihoods of each experiment In Likelihood analysis parameter is not viewed as a random variable.

31 Likelihood and logLikelihood of Coin Tossing From Edwards (1991) Likelihood

32 Principle of Inference: Bayesian Analysis In Bayesian Analysis the parameters are viewed as stochastic variables that has a prior distribution before observing data. Data depend on the parameters and after observing the data, the parameters will have a posterior distribution.

33 2) Processes in different positions of the molecule are independent, so the probability for the whole alignment will be the product of the probabilities of the individual patterns. Simplifying Assumptions I Data: s1=TCGGTA,s2=TGGTT 1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT TGGTT TCGGTA Probability of Data a - unknown Biological setup T T a1a1 a2a2 a3a3 a4a4 a5a5 G G T T C G G A

34 Simplifying Assumptions II 3) The evolutionary process is the same in all positions 4) Time reversibility: Virtually all models of sequence evolution are time reversible. I.e. π i P i,j (t) = π j P j,i (t), where π i is the stationary distribution of i and P t (i->j) the probability that state i has changed into state j after t time. This implies that P a,N1 (l 1 )*P a,N2 (l 2 ) = P N1,N2 (l 1 +l 2 ) = a N1N1 N2N2 l 2 +l 1 l1l1 l2l2 N2N2 N1N1

35 Simplifying assumptions III 6) The rate matrix, Q, for the continuous time Markov Chain is the same at all times (and often all positions). However, it is possible to let the rate of events, r i, vary from site to site, then the term for passed time, t, will be substituted by r i *t. 5) The nucleotide at any position evolves following a continuous time Markov Chain. T O A C G T F A -(q A,C +q A,G +q A,T ) q A,C q A,G q A,T R C q C,A -(q C,A +q C,G +q C,T ) q C, G q C,T O G q G,A q G,C -(q G,A +q G,C +q G,T ) q G,T M T q T,A q T,C q T,G -(q T,A +q T,C +q T,G ) P i,j (t) continuous time markov chain on the state space {A,C,G,T}. Q - rate matrix: t1t1 t2t2 C C A 

36 i. P(0) = I. ii. P(  ) close to I+  Q for  small. iii. P'(0) = Q. iv. lim P(t) has the equilibrium frequencies of the 4 nucleotides in each row. v. Waiting time in state j, T j, P(T j > t) = e -(q jj t) vi. QE=0 E ij =1 (all i,j) vii. PE=E viii If AB=BA, then e A+B =e A e B. Q and P(t) What is the probability of going from i (C?) to j (G?) in time t with rate matrix Q?

37 Rate-matrix, R: T O A C G T F A  R C  O G  M T  Transition prob. after time t, a =  *t: P(equal) = ¼(1 + 3e -4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e -4*a ) ~ 3a Stationary Distribution: (1,1,1,1)/4. Jukes-Cantor 69: Total Symmetry

38 Geometric/Exponential Distributions The Geometric Distribution: {0,1,..} Geo(p): P{Z=j)=p j (1-p) P{Z>j)=p j E(Z)=1/p. The Exponential Distribution: R + Exp(  ) Density: f(t) =  e -  t, P(X>t)= e -  t Properties: X ~ Exp(  ) Y ~ Exp(  ) independent i. P(X>t 2 |X>t 1 ) = P(X>t 2 -t 1 ) (t 2 > t 1 ) Markov (memoryless) process ii. E(X) = 1/ . iii. P(Z>t)=(≈)P(X>t) small a (p=e -a ). iv. P(X < Y) =  /(  +  ). v. min(X,Y) ~ Exp (  ). N Mean 2.5

39 Comparison of Pairs of Nucleotides/Sequences C G All Evolutionary Paths: C G Shortest Path C G Sample Paths according to their probability: CTACGT GTATAT All Evolutionary Paths: Higher Cells ChimpMouse Fish E.coli ATTGTGTATATAT….CAG ATTGCGTATCTAT….CCG

40 From Q to P for Jukes-Cantor

41 TO A C G T F A -  R C  O G  M T   a =  *t b =  *t Kimura 2-parameter model start Q: P(t):

42 Unequal base composition: (Felsenstein, 1981) Q i,j = C*π j i unequal j Felsenstein81 & Hasegawa, Kishino & Yano 85 Transition/transversion & compostion bias (Hasegawa, Kishino & Yano, 1985) (  )*C*π j i- >j a transition Q i,j = C*π j i- >j a transversion

43 Dayhoffs empirical approach (1970) Take a set of closely related proteins, count all differences and make symmetric difference matrix, since time direction cannot be observed. If q ij =q ji, then equilibrium frequencies,  i, are all the same. The transformation q ij -->  i q ij /  j, then equilibrium frequencies will be  i.

44 Measuring Selection ThrSer ACGTCA Pro ThrPro ACGCCA ThrSer ACGCCG ArgSer AGGCCG ThrSer ACTCTG AlaSer GCTCTG Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest. AlaSer GCACTG - - The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important I

45 The Genetic Code i. 3 classes of sites: 4 2-2 1-1-1-1 Problems: i. Not all fit into those categories. ii. Change in on site can change the status of another. 4 (3 rd ) 1-1-1-1 (3 rd ) ii. T  A (2 nd )

46 Possible events if the genetic code remade from Li,1997 N Substitutions Number Percent Total in all codons 549 100 Synonymous 134 25 Nonsynonymous 415 75 Missense 392 71 Nonsense 23 4 Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides).

47 Ser Thr Glu Met Cys Leu Met Gly Thr TCA ACT GAG ATG TGT TTA ATG GGG ACG *** * * * * * * ** GGG ACA GGG ATA TAT CTA ATG GGT AGC Ser Thr Gly Ile Tyr Leu Met Gly Ser K s : Number of Silent Events in Common History K a : Number of Replacement Events in Common History N s : Silent positions N a : replacement positions. Rates per pos: ((K s /N s )/2T) Example: K s =100 N s = 300 T=10 8 years Silent rate (100/300)/2*10 8 = 1.66 * 10 -9 /year/pos. Synonyous (silent) & Non-synonymous (replacement) substitutions Thr ACG Arg AGG Thr ACC Ser AGC Miyata: use most silent path for calculations. * * *

48 Kimura’s 2 parameter model & Li’s Model.      start Selection on the 3 kinds of sites (a,b)  (?,?) 1-1-1-1 (f* ,f*  ) 2-2 ( ,f*  ) 4 ( ,  ) Rates: Probabilities:

49 Sites Total Conserved Transitions Transversions 1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584) 2-2 77 51 (.6623) 21(.2727) 5(.0649) 4 78 47 (.6026) 16(.2051) 15(.1923) Z(  t,  t) =.50[1+exp(-2  t) - 2exp(-t(  +  )] transition Y(  t,  t) =.25[1-exp(-2  t )] (transversion) X(  t,  t) =.25[1+exp(-2  t) + 2exp(-t(  )] identity L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f) 246 *Y(a*f,b*f) 12 *Z(a*f,b*f) 16 }* {X(a,b*f) 51 *Y(a,b*f) 21 *Z(a,b*f) 5 }*{X(a,b) 47 *Y(a,b) 16 *Z(a,b) 15 } where a = at and b = bt. Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663 Transitions Transversions 1-1-1-1 a*f = 0.0500 2*b*f = 0.0622 2-2 a = 0.3004 2*b*f = 0.0622 4 a = 0.3004 2*b = 0.3741 Expected number of: replacement substitutions 35.49 synonymous 75.93 Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72 Silent sites : 429 - 314.72 = 114.28 K s =.6644 K a =.1127 alpha-globin from rabbit and mouse. Ser Thr Glu Met Cys Leu Met Gly Gly TCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * ** TCG ACA GGG ATA TAT CTA ATG GGT ATA Ser Thr Gly Ile Tyr Leu Met Gly Ile

50 Hasegawa, Kisino & Yano Subsitution Model Parameters: a*t β*t  A  C  G  T 0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003 Selection Factors GAG0.385(s.d. 0.030) POL0.220(s.d. 0.017) VIF0.407(s.d. 0.035) VPR0.494(s.d. 0.044) TAT1.229(s.d. 0.104) REV0.596(s.d. 0.052) VPU0.902(s.d. 0.079) ENV0.889(s.d. 0.051) NEF0.928(s.d. 0.073) Estimated Distance per Site: 0.194 HIV2 Analysis

51 Examples of rates remade from Li,1997 N RNA Virus Influenza A Hemagglutinin 13.1 10 -3 3.6 10 -3 Hepatitis C E 6.9 10 -3 0.3 10 -3 HIV 1 gag 2.8 10 -3 1.7 10 -3 DNA virus Hepatitis B P 4.6 10 -5 1.5 10 -5 Herpes Simplex Genome 3.5 10 -8 Nuclear Genes Mammals c-mos 5.2 10 -9 0.9 10 -9 Mammals a-globin 3.9 10 -9 0.6 10 -9 Mammals histone 3 6.2 10 -9 0.0 Organism Gene Syno/year Non-Syno/Year

52 i.Codons as the basic unit. ii. A codon based matrix would have (61*61)-61 (= 3661) off-diagonal entries. i. Bias in nucleotide usage. ii. Bias in codon usage. iii. Bias in amino acid usage. iv. Synonymous/non-synonymous distinction. v. Amino acid distance. vi. Transition/transversion bias. codon i and codon j differing by one nucleotide, then  p j exp(-d i,j /V) differs by transition q i,j =  p j exp(-d i,j /V) differs by transversion. -d i,j is a physico-chemical difference between amino acid i and amino acid j. V is a factor that reflects the variability of the gene involved. Codon based Models Goldman,Yang + Muse,Gaut

53 Rate variation between sites:iid each site i)The rate at each position is drawn independently from a distribution, typically a  (or lognormal) distribution. G(a,b) has density x  -1 *e -  x /  ), where  is called scale parameter and  form parameter. Let L(p i, ,t) be the likelihood for observing the i'th pattern, t all time lengths,  the parameters describing the process parameters and f (r i ) the continuous distribution of rate(s). Then

54 What is the probability of the data? What is the most probable ”hidden” configuration? What is the probability of specific ”hidden” state? 1)Different positions in the molecule evolves at different rates. For instance fast or slow r F or slow r S. 2) The rates at neighbor positions evolve at the same rate. Rate variation between sites:iid Hidden Markov Chains O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 F S

55 Data: 3 sequences of length L ACGTTGCAA... AGCTTTTGA... TCGTTTCGA... Statistical Test of Models (Goldman,1990) A. Likelihood (free multinominal model 63 free parameters) L1 = p AAA #AAA *...p AAC #AAC *...*p TTT #TTT where p N 1 N 2 N 3 = #(N 1 N 2 N 3 )/L L 2 = p AAA (l1',l2',l3') #AAA *...*p TTT (l1',l2',l3') #TTT l2l2 l1l1 l3l3 TCGTTTCGA... ACGTTGCAA... AGCTTTTGA... B. Jukes-Cantor and unknown branch lengths Test statistics: I. (expected-observed)2/expected or II: -2 lnQ = 2(lnL 1 - lnL 2 ) JC69 Jukes-Cantor: 3 parameters =>  2 60 d.of freedom Problems: i. To few observations pr. pattern. ii. Many competing hypothesis. Parametric bootstrap: i. Maximum likelihood to estimate the parameters. ii. Simulate with estimated model. iii. Make simulated distribution of -2 lnQ. iv. Where is real -2 lnQ in this distribution?

56 Emperical Observations: i. Variance/Mean > 1 (clumpy process) for non-synonymous event Possible explanations: i. Selective Avalances. ii. Gene conversions from pseudogenes. Episodic Evolution Poisson Process: i. T i 's independent, exponentially distributed with same parameter (l). ii. Variance and Mean both l.

57 Assignment to internal nodes: The simple way. C A C C A C T G ? ? ? ? ? ? If branch lengths and evolutionary process is known, what is the probability of nucleotides at the leaves? Cctacggccatacca a ccctgaaagcaccccatcccgt Cttacgaccatatca c cgttgaatgcacgccatcccgt Cctacggccatagca c ccctgaaagcaccccatcccgt Cccacggccatagga c ctctgaaagcactgcatcccgt Tccacggccatagga a ctctgaaagcaccgcatcccgt Ttccacggccatagg c actgtgaaagcaccgcatcccg Tggtgcggtcatacc g agcgctaatgcaccggatccca Ggtgcggtcatacca t gcgttaatgcaccggatcccat

58 Probability of leaf observations - summing over internal states A C G T P(C  G) *P C (left subtree)

59 ln(7.9*10 -14 ) –ln(6.2*10 -12 ) is  2 – distributed with (n-2) degrees of freedom Output from Likelihood Method. s1s2 s3s4 s5 Now Duplication Times Amount of Evolution s1 s2 s3 s4 s5 Likelihood: 6.2*10 -12  = 0.34 0.16 Likelihood: 7.9*10 -14  = 0.31 0.18 Molecular ClockNo Molecular Clock 23 -/+5.2 12 -/+2.2 11.1 -/+1.8 5.9 -/+1.2 6.9 -/+1.3 11.4 -/+1.9 3.9 -/+0.8 10.9 -/+2.1 9.9 -/+1.2 11.6 -/+2.1 n-1 heights estimated 2n-3 lengths estimated 4.1 -/+0.7

60 The generation/year-time clock Langley-Fitch,1973 s1 s3 s2 s1s3 s2 {l 1 = l 2 < l 3 } l2l2 l1l1 l3l3 l3l3 Some rooting techniquee Absolute Time Clock: Generation Time Clock: Absolute Time Clock Generation Time Elephant Mouse 100 Myr variable constant l 1 = l 2

61 The generation/year-time clock Langley-Fitch,1973 s1s3 s2 Any Tree Generation Time Clock Can the generation time clock be tested? Assume, a data set: 3 species, 2 sequences each s1 s3 s2 s1 s3 s2 s1s3 s2

62 The generation/year-time clock Langley-Fitch,1973 s1 s3 s2 l2l2 l1l1 l3l3 s1 s3 s2 c*l 2 c*l 1 c*l 3 s1 s3 s2 s1s3 s2 l2l2 l1l1 l3l3 l 1 = l 2 l3l3 k=3: degrees of freedom: 3 dg: 2 k: dg: 2k-3 dg: k-1 k=3, t=2: dg=4 k, t: dg =(2k-3)-(t-1)

63  – globin, cytochrome c, fibrinopeptide A & generation time clock Langley-Fitch,1973 N Fibrinopeptide A phylogeny: Human Gorilla Donkey GibbonMonkey Rabbit Cow Rat Pig Horse GoatLlamaSheep Dog Relative rates  -globin 0.342  – globin 0.452 cytochrome c 0.069 fibrinopeptide A 0.137

64 I Smoothing a non-clock tree onto a clock tree (Sanderson). II Rate of Evolution of the rate of Evolution (Thorne et al.). The rate of evolution can change at each bifurcation. III Relaxed Molecular Clock (Huelsenbeck et al.). At random points in time, the rate changes by multiplying with random variable (gamma distributed) Almost Clocks (MJ Sanderson (1997) “A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy” Mol.Biol.Evol.14.12.1218-31), J.L.Thorne et al. (1998): “Estimating the Rate of Evolution of the Rate of Evolution.” Mol.Biol.Evol. 15(12).1647-57, JP Huelsenbeck et al. (2000) “A compound Poisson Process for Relaxing the Molecular Clock” Genetics 154.1879-92. ) Comment: Makes perfect sense. Testing no clock versus perfect is choosing between two unrealistic extremes.

65 Summary Phylogeny Principles of Phylogenies Rates of Molecular Rates and the Molecular Clock Rooting Phylogenies The Generation Time Clock Almost Clocks Non-Contemporaneous Leaves (Viruses & Ancient DNA) The Purpose of Stochastic Models The assumptions of Stochastic Models The Central Models Measuring Selection Variation among sites Testing Models.

66 History of Phylogenetic Methods & Stochastic Models 1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock. 1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza. 1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock. 1967 First large molecular phylogenies by Fitch and Margoliash. 1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences. 1969 Jukes-Cantor proposes simple model for amino acid evolution. 1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution. 1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences. 1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment. 1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”. 1979: Kimura introduces transition/transversion bias in nucleotide model in response to pbulication of mitochondria sequences. 1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). Simple nucleotide model with equilibrium bias.

67 1981 Parsimony tree problem is shown to be NP-Complete. 1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies. 1985: Hasegawa, Kishino and Yano combines transition/transversion bias with unequal equilibrium frequencies. 1986 Bandelt and Dress introduces split decomposition as a generalization of trees. 1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies. 1991 Gillespie’s book proposes “lumpy” evolution. 1994 Goldman & Yang + Muse & Gaut introduces codon based models 1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock. 2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves. 2000 Complex Context Dependent Models by Jensen & Pedersen. Dinucleotide and overlapping reading frames. 2001- Major rise in the interest in phylogenetic statistical alignment 2001- Comparative genomics underlines the functional importance of molecular evolution.

68 References: Books & Journals Joseph Felsenstein "Inferring Phylogenies” 660 pages Sinauer 2003 Excellent – focus on methods and conceptual issues. Masatoshi Nei, Sudhir Kumar “Molecular Evolution and Phylogenetics” 336 pages Oxford University Press Inc, USA 2000Molecular Evolution and Phylogenetics” R.D.M. Page, E. Holmes “Molecular Evolution: A Phylogenetic Approach” 352 pages 1998 Blackwell Science (UK)“Molecular Evolution: A Phylogenetic Approach” Dan Graur, Li Wen-Hsiung “Fundamentals of Molecular Evolution” Sinauer Associates Incorporated 439 pages 1999“Fundamentals of Molecular Evolution” Margulis, L and K.V. Schwartz (1998) “Five Kingdoms” 500 pages Freeman A grand illustrated tour of the tree of life Semple, C and M. Steel “Phylogenetics” 2002 230 pages Oxford University Press Very mathematical Journals Journal of Molecular Evolution : http://www.nslij-genetics.org/j/jme.htmlhttp://www.nslij-genetics.org/j/jme.html Molecular Biology and Evolution : http://mbe.oupjournals.org/http://mbe.oupjournals.org/ Molecular Phylogenetics and Evolution : http://www.elsevier.com/locate/issn/1055-7903http://www.elsevier.com/locate/issn/1055-7903 Systematic Biology - http://systbiol.org/ J. of Classification - http://www.pitt.edu/~csna/joc.html

69 References: www-pages Tree of Life on the WWW http://tolweb.org/tree/phylogeny.html http://www.treebase.org/treebase/ Software http://evolution.genetics.washington.edu/phylip.html http://paup.csit.fsu.edu/ http://morphbank.ebc.uu.se/mrbayes/ http://evolve.zoo.ox.ac.uk/beast/ http://abacus.gene.ucl.ac.uk/software/paml.html Data & Genome Centres http://www.ncbi.nih.gov/Entrez/ http://www.sanger.ac.uk

70 Next Classification of Viruses * Overhead with considerations model  > data. Example : HMM variation in rates, gamma rates. Example: Almost clock Example: Episodic clock Example: Bootstrapping. *


Download ppt "Molecular Evolution: Plan for week Monday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30."

Similar presentations


Ads by Google