Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining genes in phylogeny And How to test phylogeny methods … Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences,

Similar presentations


Presentation on theme: "Combining genes in phylogeny And How to test phylogeny methods … Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences,"— Presentation transcript:

1 Combining genes in phylogeny And How to test phylogeny methods … Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University talp@post.tau.ac.il

2 Multiple sequence alignment (vWF) RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

3 RatQEPGGLVVPPTDA RabbitQEPGGMVVPPTDA GorillaQEPGGLVVPPTDA CatREPGGLVVPPTEG VWF From sequences to a phylogenetic tree

4 Multiple multiple sequence alignment RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

5 Murphy et al. (2001b) 19 nuclear genes + 3 mitochondrial genes (16,400 bp) Phylogenetic studies are now based on the analysis of multiple genes

6 Consensus trees

7 Consensus tree A consensus tree summarizes information common to two or more trees. bcdeabcdeabcdea

8 Strict consensus Strict consensus includes only those groups that occur in all the trees being considered. bcdeabcdea bcdea bcdea Strict consensus

9 Problem: the split {ab} is found 2 out of 3 times, and this is not shown in the strict consensus. bcdeabcdea bcdea bcdea Strict consensus

10 Majority-rule consensus Majority-rule consensus: splits that are found in the majority of the trees are shown. bcdeabcdea bcdea bcdea Majority-rule consensus

11 The percentage of the trees supporting each splits are indicated bcdeabcdea bcde 100 bcdea Majority-rule consensus a 67

12 Problem with Majority-rule consensus However in both trees if we consider only {b,c,d}, then in both trees b is closer to c than b to d, or c to d. bcde bcdae Majority-rule consensus= Strict consensus = a bcdea

13 Adams consensus Adams consensus will give the subtrees that are common to all trees. Adams consensus is useful where there is one or more sequences with unclear positions but there’s a subset of sequences that are common to all trees. bcdae Adams consensus= bcdea cdaeb

14 Networks A network is sometimes used to represent tree in which recombination occurred. bcdea

15 t1t1 t3t3 t2t2 A C X S Maximum Likelihood

16 Multiple genes analysis concatenate analysis Sp1 Sp2 Sp3 Sp4 e.g., Murphy et al. (2001) Gene 1 + Gene 2 + Gene 3 Sp1: TCTGT…AACTCTTT…GAATCGTT…GCC Sp2: TCTGC…GACTCGCT…GGAACGCT…CCC Sp3: CTTAT…GATCTATT…GGAATATT…CGA Sp4: CCTAT…GATCCATT…GGACCATT…CCA Evolutionary model

17 Multiple genes analysis concatenate analysis Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Gene 2 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Gene 1 Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA Gene 3 e.g., Murphy et al. (2001) Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Evolutionary model Evolutionary model Evolutionary model

18 Branch lengths correspond to evolutionary distance: d = AA replacements/site= [AA replacements/(site*year)]*year= Evolutionary rate * year What are branch lengths

19 Multiple genes analysis separate analysis Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Gene 2 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Gene 1 Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA Gene 3 e.g., Nikaido et al. (2001) Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Evolutionary model2 Evolutionary model1 Evolutionary model3

20 Multiple genes analysis Number of parameters Separate analysis Concatenate analysis Number of species = n Number of gene = g Number of parameters in the model = m Number of parameter m+(2n-3)g*(m+(2n-3)) Example n= 44 ; g = 22 m = 0 85 1870

21 Multiple genes analysis Number of parameters Both oversimplified model and over-parameterization may lead to the wrong phylogenetic conclusions

22 Multiple genes analysis proportional analysis Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Gene 2 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Gene 1 Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA Gene 3 Evolutionary model2 Evolutionary model1 Evolutionary model3 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Rate=1 Rate=0.5Rate=1.5

23 Multiple genes analysis Number of parameters Separate analysis Concatenate analysis Number of species = n Number of gene = g Number of parameters in the model = m Number of parameter m+(2n-3)g*(m+(2n-3)) Proportional analysis g-1+gm+(2n-3) Example n= 44 g = 22 m = 0 85 1870106

24 Aims of our study To compare 3 types of multiple-genes analysis: Concatenate analysis Separate analysis Proportional analysis 3 protein datasets: Mitochondrial data set [56 species, 12 genes] Nuclear dataset (“short genes”) [46 species, 6 genes] Nuclear dataset (“long genes”) [28 species, 4 genes] (Short genes- based on Murphy dataset)

25 Comparing topologies Archonta Glires Ungulata Carnivora Insectivora Xenarthra (Based on Mc Kenna and Bell, 1997) Morphological topology

26 Mitochondrial topology Perissodactyla Carnivora Cetartiodactyla Rodentia 1 Hedgehogs Rodentia 2 Primates Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Aims of our study

27 Nuclear topology Aims of our study Cetartiodactyla Afrotheria Chiroptera Eulipotyphla Glires Xenarthra Carnivora Perissodactyla Scandentia+ Dermoptera Pholidota Primate (Madsenl tree)

28 Comparing different models using AKAIKE INFORMATION CRITERION A model which minimizes the AIC is considered to be the most appropriate model.

29 Results: the best multiple gene analysis The proportional analysis is the best for the mitochondrial dataset Separate analysis Concatenate analysis Proportional analysis df Ln(L) AIC -90999.30 182262.60 -89921.78 182483.55 -91188.71 182619.42 132 1320 121 (Mitochondrial tree, N-Gamma rate model)

30 Results: the best multiple gene method Separate analysis Concatenate analysis Proportional analysis df Ln(L) AIC -11543.87 23287.74 -11192.12 23464.23 -11618.67 23427.33 100 540 95 (Murphy dataset, Madsenl tree, N-Gamma rate model) The Proportional analysis is the best for the Nuclear dataset (“Short genes”)

31 The Separate analysis is the best for the Nuclear dataset (“Long genes”) Results: the best multiple gene method Separate analysis Concatenate analysis Proportional analysis df Ln(L) AIC -31406.81 62933.63 -31153.28 62738.56 -31519.10 63152.21 60 216 57 (Madsen dataset, Murphyl tree, N-Gamma rate model)

32 Conclusion: the best multiple gene method 1- The concatenate model is always the worst way to analyze multiple genes. 2- Selecting between the separate analysis or the proportional analysis depends on the data considered: The proportional model is more adapted for short genes, the separate model for longer sequences

33 Results: mammalian phylogeny The morphological tree is always rejected P(K-H test) < 0.05 whatever the model used whatever the dataset

34 Results: mammalian phylogeny The mitochondrial tree is the best tree for the mitochondrial dataset. But we cannot reject the nuclear tree. The nuclear tree is the best for the nuclear datasets, and we can reject the mitochondrial tree. Conclusion (Topology): It seems that the nuclear tree is the best tree among the 3 alternative trees.

35 Modelisation of site rate variation The gamma distribution: F(t+x) =  (1/n).F(t).P(x.Rn) c n=1 Homogenous model: F(t+x) = F(t).P(x) Gamma model: Site proportions f(r) Substitution rates (R)

36 A C G d1d1 d3d3 d2d2 Continuous A C G d1d1 d3d3 d2d2 Discrete Likelihoods with rate variation

37 Results: the best site-rate variation model Mitochondrial data set (Mitochondrial tree, proportional analysis) Homogenous model 1-Gamma model N-Gamma model df Ln(L) AIC -90999.30 182262.60 -98998.68 198237.37 -91094.30 182430.61 132 120 121

38 Conclusion: the best site-rate variation model The N-Gamma model is always the best site- rate variation model.

39 Combining Multiple Genes Dorothee Huchon (Florida State University) Masami Hasegawa (Institute of Statistical Mathematics) Norihiri Okada (Tokyo Institute of Technology) Ying Cao (ISM). Collaborations

40 Known phylogenies

41 Best way to test different methods of phylogenetic reconstruction is on trees that are known to be true from other resources… Problem: known phylogenies are very rare. Known phylogeny: laboratory animals, crop plants (and even those are often suspect). Also their evolutionary rate is very small…

42 Known phylogenies David Hillis and colleagues have created “experimental” phylogenies in the lab.

43 Known phylogenies They have used bacteriophage T7 and subdivided cultures of it, in the present of a mutagen. They then sequenced a marker gene from the final cultures and gave the sequences as input to few phylogenetic methods. The output of the tree building methods was compared to the true tree.

44 Known phylogenies In fact, they used restriction sites method to infer the phylogeny, using MP, NJ, UPGMA and others. All methods reconstructed the true tree.

45 Known phylogenies They also compared outputs of ancestral sequence reconstruction, using MP. 97.3% of the ancestral states were correctly reconstructed. Encouraging!

46 Known phylogenies Criticism: (1) The true tree was very easy to infer, because it was well balances, and all nodes are accompanied by numerous changes. (2) The mutations by a single mutagen do not reflect reality.

47 Thank You…


Download ppt "Combining genes in phylogeny And How to test phylogeny methods … Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences,"

Similar presentations


Ads by Google