Combining genes in phylogeny And How to test phylogeny methods … Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences,

Slides:



Advertisements
Similar presentations
Intro to Phylogenetic Trees Computational Genomics Lecture 4b
Advertisements

Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Phylogenetic Trees Lecture 12
. Intro to Phylogenetic Trees Lecture 5 Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield Slides by Shlomo Moran. Slight modifications by Benny.
An Introduction to Phylogenetic Methods
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
University of British Columbia Department of Computer Science Tamara Munzner Interactive Visualization of Evolutionary Trees and Gene Sequences February.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
Sequence alignment: Removing ambiguous positions: Generation of pseudosamples: Calculating and evaluating phylogenies: Comparing phylogenies: Comparing.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
. Phylogenetic Trees Lecture 1 Credits: N. Friedman, D. Geiger, S. Moran,
Chapter 2 Opener How do we classify organisms?. Figure 2.1 Tracing the path of evolution to Homo sapiens from the universal ancestor of all life.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Sections 7.1, 7.2, in Durbin et al. Chapter 17 in Gusfield (updated April 12, 2009)
Consensus Consensus tree A consensus tree summarizes information common to two or more trees. bcdeabcdeabcdea.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
. Phylogenetic Trees Lecture 11 Sections 7.1, 7.2, in Durbin et al.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
The Graph of Life Dennis Shasha Joint work with Kenneth Birnbaum Treester system by: Matt Olim.
Tree Inference Methods
Phylogentic Tree Evolution Evolution of organisms is driven by Diversity  Different individuals carry different variants of.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
CSCE555 Bioinformatics Lecture 12 Phylogenetics I Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Calculating branch lengths from distances. ABC A B C----- a b c.
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Mammalian Evolution Using IRBP Gene. Goal: To provide a problem space wherein students can use sequence data using a slowly evolving genes to resolve.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
Bioinformatics Lecture 3 Molecular Phylogenetic By: Dr. Mehdi Mansouri Mehr 1395.
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Summary and Recommendations
BNFO 602 Phylogenetics Usman Roshan.
Chapter 19 Molecular Phylogenetics
Phylogeny.
Molecular data assisted morphological analyses
Algorithms for Inferring the Tree of Life
Summary and Recommendations
Phylogenetic analysis of AquK2P.
Presentation transcript:

Combining genes in phylogeny And How to test phylogeny methods … Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University

Multiple sequence alignment (vWF) RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

RatQEPGGLVVPPTDA RabbitQEPGGMVVPPTDA GorillaQEPGGLVVPPTDA CatREPGGLVVPPTEG VWF From sequences to a phylogenetic tree

Multiple multiple sequence alignment RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

Murphy et al. (2001b) 19 nuclear genes + 3 mitochondrial genes (16,400 bp) Phylogenetic studies are now based on the analysis of multiple genes

Consensus trees

Consensus tree A consensus tree summarizes information common to two or more trees. bcdeabcdeabcdea

Strict consensus Strict consensus includes only those groups that occur in all the trees being considered. bcdeabcdea bcdea bcdea Strict consensus

Problem: the split {ab} is found 2 out of 3 times, and this is not shown in the strict consensus. bcdeabcdea bcdea bcdea Strict consensus

Majority-rule consensus Majority-rule consensus: splits that are found in the majority of the trees are shown. bcdeabcdea bcdea bcdea Majority-rule consensus

The percentage of the trees supporting each splits are indicated bcdeabcdea bcde 100 bcdea Majority-rule consensus a 67

Problem with Majority-rule consensus However in both trees if we consider only {b,c,d}, then in both trees b is closer to c than b to d, or c to d. bcde bcdae Majority-rule consensus= Strict consensus = a bcdea

Adams consensus Adams consensus will give the subtrees that are common to all trees. Adams consensus is useful where there is one or more sequences with unclear positions but there’s a subset of sequences that are common to all trees. bcdae Adams consensus= bcdea cdaeb

Networks A network is sometimes used to represent tree in which recombination occurred. bcdea

t1t1 t3t3 t2t2 A C X S Maximum Likelihood

Multiple genes analysis concatenate analysis Sp1 Sp2 Sp3 Sp4 e.g., Murphy et al. (2001) Gene 1 + Gene 2 + Gene 3 Sp1: TCTGT…AACTCTTT…GAATCGTT…GCC Sp2: TCTGC…GACTCGCT…GGAACGCT…CCC Sp3: CTTAT…GATCTATT…GGAATATT…CGA Sp4: CCTAT…GATCCATT…GGACCATT…CCA Evolutionary model

Multiple genes analysis concatenate analysis Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Gene 2 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Gene 1 Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA Gene 3 e.g., Murphy et al. (2001) Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Evolutionary model Evolutionary model Evolutionary model

Branch lengths correspond to evolutionary distance: d = AA replacements/site= [AA replacements/(site*year)]*year= Evolutionary rate * year What are branch lengths

Multiple genes analysis separate analysis Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Gene 2 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Gene 1 Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA Gene 3 e.g., Nikaido et al. (2001) Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Evolutionary model2 Evolutionary model1 Evolutionary model3

Multiple genes analysis Number of parameters Separate analysis Concatenate analysis Number of species = n Number of gene = g Number of parameters in the model = m Number of parameter m+(2n-3)g*(m+(2n-3)) Example n= 44 ; g = 22 m =

Multiple genes analysis Number of parameters Both oversimplified model and over-parameterization may lead to the wrong phylogenetic conclusions

Multiple genes analysis proportional analysis Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Gene 2 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Gene 1 Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA Gene 3 Evolutionary model2 Evolutionary model1 Evolutionary model3 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Rate=1 Rate=0.5Rate=1.5

Multiple genes analysis Number of parameters Separate analysis Concatenate analysis Number of species = n Number of gene = g Number of parameters in the model = m Number of parameter m+(2n-3)g*(m+(2n-3)) Proportional analysis g-1+gm+(2n-3) Example n= 44 g = 22 m =

Aims of our study To compare 3 types of multiple-genes analysis: Concatenate analysis Separate analysis Proportional analysis 3 protein datasets: Mitochondrial data set [56 species, 12 genes] Nuclear dataset (“short genes”) [46 species, 6 genes] Nuclear dataset (“long genes”) [28 species, 4 genes] (Short genes- based on Murphy dataset)

Comparing topologies Archonta Glires Ungulata Carnivora Insectivora Xenarthra (Based on Mc Kenna and Bell, 1997) Morphological topology

Mitochondrial topology Perissodactyla Carnivora Cetartiodactyla Rodentia 1 Hedgehogs Rodentia 2 Primates Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Aims of our study

Nuclear topology Aims of our study Cetartiodactyla Afrotheria Chiroptera Eulipotyphla Glires Xenarthra Carnivora Perissodactyla Scandentia+ Dermoptera Pholidota Primate (Madsenl tree)

Comparing different models using AKAIKE INFORMATION CRITERION A model which minimizes the AIC is considered to be the most appropriate model.

Results: the best multiple gene analysis The proportional analysis is the best for the mitochondrial dataset Separate analysis Concatenate analysis Proportional analysis df Ln(L) AIC (Mitochondrial tree, N-Gamma rate model)

Results: the best multiple gene method Separate analysis Concatenate analysis Proportional analysis df Ln(L) AIC (Murphy dataset, Madsenl tree, N-Gamma rate model) The Proportional analysis is the best for the Nuclear dataset (“Short genes”)

The Separate analysis is the best for the Nuclear dataset (“Long genes”) Results: the best multiple gene method Separate analysis Concatenate analysis Proportional analysis df Ln(L) AIC (Madsen dataset, Murphyl tree, N-Gamma rate model)

Conclusion: the best multiple gene method 1- The concatenate model is always the worst way to analyze multiple genes. 2- Selecting between the separate analysis or the proportional analysis depends on the data considered: The proportional model is more adapted for short genes, the separate model for longer sequences

Results: mammalian phylogeny The morphological tree is always rejected P(K-H test) < 0.05 whatever the model used whatever the dataset

Results: mammalian phylogeny The mitochondrial tree is the best tree for the mitochondrial dataset. But we cannot reject the nuclear tree. The nuclear tree is the best for the nuclear datasets, and we can reject the mitochondrial tree. Conclusion (Topology): It seems that the nuclear tree is the best tree among the 3 alternative trees.

Modelisation of site rate variation The gamma distribution: F(t+x) =  (1/n).F(t).P(x.Rn) c n=1 Homogenous model: F(t+x) = F(t).P(x) Gamma model: Site proportions f(r) Substitution rates (R)

A C G d1d1 d3d3 d2d2 Continuous A C G d1d1 d3d3 d2d2 Discrete Likelihoods with rate variation

Results: the best site-rate variation model Mitochondrial data set (Mitochondrial tree, proportional analysis) Homogenous model 1-Gamma model N-Gamma model df Ln(L) AIC

Conclusion: the best site-rate variation model The N-Gamma model is always the best site- rate variation model.

Combining Multiple Genes Dorothee Huchon (Florida State University) Masami Hasegawa (Institute of Statistical Mathematics) Norihiri Okada (Tokyo Institute of Technology) Ying Cao (ISM). Collaborations

Known phylogenies

Best way to test different methods of phylogenetic reconstruction is on trees that are known to be true from other resources… Problem: known phylogenies are very rare. Known phylogeny: laboratory animals, crop plants (and even those are often suspect). Also their evolutionary rate is very small…

Known phylogenies David Hillis and colleagues have created “experimental” phylogenies in the lab.

Known phylogenies They have used bacteriophage T7 and subdivided cultures of it, in the present of a mutagen. They then sequenced a marker gene from the final cultures and gave the sequences as input to few phylogenetic methods. The output of the tree building methods was compared to the true tree.

Known phylogenies In fact, they used restriction sites method to infer the phylogeny, using MP, NJ, UPGMA and others. All methods reconstructed the true tree.

Known phylogenies They also compared outputs of ancestral sequence reconstruction, using MP. 97.3% of the ancestral states were correctly reconstructed. Encouraging!

Known phylogenies Criticism: (1) The true tree was very easy to infer, because it was well balances, and all nodes are accompanied by numerous changes. (2) The mutations by a single mutagen do not reflect reality.

Thank You…