Presentation on theme: "Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton."— Presentation transcript:
Phylogenetic analysis in the context of multigene sequences Sudhindra R. Gadagkar University of Dayton
DNA Evolution The unifying force of all life on earth is DNA Adenine, Cytosine, Guanine, Thymine ATGGCATACGTGCAGTTCATCGGCTAGTGTGACATGA
DNA sequence evolution t0t0 t1t1 ATGGCATACGTGCA ATGGTATAGGTGCA ATGGCATACGTGAA
A phylogenetic tree A pattern of branching events, with each branching point showing a speciation (or divergence) event Taxon A Taxon B 3.5 3.5 7.5 4 Taxon C Nodes (extinct ancestors) Tips (living species) Branches (amount of evolution) Taxon (pl. Taxa)
Reconstruction of the evolutionary relationships among “taxa” Representation in a graphical form. What is phylogenetic inference? M Fin Whale M Blue Whale M Cow M Rat M Mouse M Opossum B Chicken A Xenopus F Rainbow Trout F Loach F Carp L Lamprey S Sea urchin 0.05 M Fin Whale M Blue Whale M Cow M Rat M Mouse M Opossum B Chicken A Xenopus F Rainbow Trout F Loach F Carp L Lamprey S Sea urchin 0.05 M Fin Whale M Blue Whale M Cow M Rat M Mouse M Opossum B Chicken A Xenopus F Rainbow Trout F Loach F Carp L Lamprey S Sea urchin 0.05
Parts of a tree Tree size: no. of taxa in the phylogeny. Interior branch: partitions an unrooted tree into 2 subtrees, each containing 2 taxa. Cluster size: minimum of two subtree sizes partitioned by an interior branch. Depth of a branch: defined in terms of the no. of taxa clustered by it. Root Internal Branch F. Whale B. Whale Cow Rat Mouse Opossum External branch Node Outgroup
Example of a 6-sequence tree F. Whale B. Whale Cow Rat Mouse Opossum F. Whale B. Whale Cow Rat Mouse Opossum Rooted Tree Unrooted Tree
Phylogenetic analysis using DNA sequences t0t0 t1t1 ATGGCATACGTGCA ATGGTATAGGTGCA ATGGCATACGTGAA
Gene Sequences Homologous (orthologous) gene sequences D. melanogaster ATGTCGTTGACCAACAAGAACGTGATTTTCGTGGCCGGTCT... D. pseudoobscura ATGTCTCTCACCAACAAGAACGTCGTTTTCGTGGCCGGTCT... D. crassifemur ATGTTCATCGCTGGCAAGAACATCATCTTTGTCGCTGGTCT... D. mulleri ATGGCCATCGCTAACAAGAACATCATCTTCGTCGCTGGACT... [ D.me D.ps D.cr D.mu] [D.me] [D.ps] 0.14 [D.cr] 0.24 0.24 [D.mu] 0.21 0.20 0.21 Distance Matrix D. melanogaster D. pseudoobscura D. mulleri D. crassifemur
Expected or Species tree F. Whale B. Whale Cow Rat Mouse Opossum Realized tree for gene X F. Whale B. Whale Cow Rat Mouse Opossum
Two-fold Challenge Today’s challenge is the flood of data, in two ways: 1. The increasing number of taxa (say, species) for which molecular data is available. 2. The increasing amount of molecular data that is available for each taxon.
The number of possible trees increases enormously as the number of taxa increases Why is reconstructing the evolutionary history of a large number of taxa a challenge?
Number of rooted trees The number of bifurcating rooted trees is given by the following formula, where m is the number of taxa. Source: Nei and Kumar, 2000. Molecular Evolution and Phylogenetics
3 taxa Source: Brian Golding, Reconstructing Phylogenies http://helix.biology.mcmaster.ca/721/phylo/phylo.html
4 taxa Source: Brian Golding, Reconstructing Phylogenies http://helix.biology.mcmaster.ca/721/phylo/phylo.html
More taxa Source: Brian Golding, Reconstructing Phylogenies http://helix.biology.mcmaster.ca/721/phylo/phylo.html
So many trees! 0 400 600 800 1000 1200 0100200300400 Millions Billions 10 200 10 No. of Possible Trees No. of Sequences 10 79 atoms in the universe 10 37 atoms in the bodies of all humans by year 2035 5 10 30 prokaryotes living today 5 10 11 stars in the milky way How many trees represent the true relationship?
Only ONE out of all possible trees is the true tree!
Which is the true tree? Choose a criterion (optimality criterion). Score the fit of the data to a given tree for that criterion Tree with the optimal score is chosen as the best tree. Optimal tree found in this way is expected to be closest to the true tree.
Optimality Criteria Branch lengths computed for each tree using pair-wise distances obtained from sequences. Sum of branch lengths (S) is used as the optimality score. Minimum Evolution (ME) Branch lengths Computer Data Topology Sum of branch lengths Substitution Model Distance Computer Tree with the smallest S-value is chosen.
The Neighbor-Joining method ( Saitou and Nei, Mol. Biol.Evol. 4: 406 - 425, 1987 )
Computationally efficient Desirable statistical properties Accuracy Performance with large phylogenies? Properties of the NJ method
Research Problem Performance of NJ optimality criteria in inferring large trees Performance worse with more sequences? More difficult to infer deep branches as compared to the shallow ones? Reconstruct branches at similar depths in large and small trees with same efficiency?
4 basic 6-taxa trees (topologies) Equal interior branch lengths Trees stacked to make larger trees (e.g., D x = x trees of type D stacked) Model trees and their features EFG D D D D D D D D BD 8 9 9 1 1 1 1 1 1 C 8 1 1 1 1 9 9 1 11 1 6 6 7 8 6 6 A 111 4 4 6 5 7 8 1 Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 1)
Additional model topology - the rbcL tree (From Hillis, Nature, 383:130-131, 1996)
Tree parameters Rate: Up to 10 fold differences in rate. Sequence Length: Up to 10 multiples of 100 sites. Tree size: A x, B x, C x, D x, where x varied from 1 to 10, 16, and 32
Simulating Evolutionary Change Starting point or “root” chosen. Random ancestral sequence generated for the root. Branch length randomly obtained from a Poisson distribution with mean = expected no. of substitutions (evolutionary rate sequence length multiplier). 4 4 5 6 7 8 1 1 1 1
Equal probability of transition from one state to another. Process carried out for all branches Resulting data are sequences for the taxa for that “gene”. These sequences used to infer back the evolutionary relationships using NJ. 1000 replications (A to D trees; 60 taxa), 100 reps (> 60 taxa, rbcL tree). Simulating Evolutionary Change (contd.)
Accurate Inference of Complete Trees Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Table 1)
Effect of 0-length branches on NJ performance Percent branches correct Sequence length (s) 02004006008001000 0 20 40 60 80 P 0 P Model P Realized 100 % branches correct Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 3)
Reconstruction efficiency of 6 taxa monophyletic clusters 70 80 90 100 050100150200 Number of sequences % Efficiency of inferring monophyletic groups 200 sites 500 sites 1000 sites % correct replicates
Branch depth and NJ efficiency Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 5B)
Shallow versus deep branches Results from rbcL tree Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 8B) Branch depth Reconstruction efficiency
Branch depth and efficiency for different inference methods (JC simulations ) Rosenberg and Kumar, 2001, Mol. Biol. Evol.,18:1823-1827 (Fig. 1)
Branch depth and efficiency for different inference methods (HKY simulations ) Rosenberg and Kumar, 2001, Mol. Biol. Evol.,18:1823-1827 (Fig. 2)
The Challenge of Multi-Gene Sequences Multi-Gene/Whole Genome sequences increasingly available for many taxa. How best to obtain phylogenetic information from these multiple sequences?
Concatenation vs Consensus Concatenation approach ATGCTGACTG ATGTCGTCAGTC A BC DE A BC D E A BC DE Consensus approach A BCD E
The worst-case scenario approach The worst-case scenario is when all the available genes yield highly incorrect phylogenetic reconstructions. When faced with such sequences, which strategy to employ: consensus or concatenation?
Simulation with estimated parameters Model tree based on the phylogenetic relationships among 66 mammals from Murphy et al., (Nature 409:614-618, 2001).
Source: Fig. 1 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Simulation with estimated parameters Sequences for 448 genes downloaded from HOVERGEN (Duret et al., Nucleic Acids Res. 22: 2360-2365, 1994). Sequence parameters (length, L, substitution rate, r, transition-transversion rate ratio, , and G+C content, ) were estimated from the data.
Simulation with estimated parameters (contd.) For each of the 448 genes, 100 replicate sequences generated by computer simulation, using the estimated parameters and the HKY model of evolution.
Simulation with estimated parameters (contd.) Phylogenetic inference was done on each of the 44,800 simulation replicates using NJ- JC and NJ-TN methods. The accuracy of each tree was recorded in terms of the number of incorrect branches when compared to the model tree.
Simulation lets us play God! In computer simulation, evolution is simulated based on a model tree, and replicate sequences are obtained. These replicate sequences are then used to infer back the true tree. Therefore, for the 100 simulation replicates for each of the 448 genes, we know the worst performing replicate.
Simulation lets us play God! D. melanogaster D. pseudoobscura D. mulleri D. crassifemur Start
The Two-Gene Case Data: For each of NJ-JC and NJ-TN, we picked *10,000 pairs of worst replicates *10,000 pairs of randomly chosen replicates
Two-gene concatenation Source: Table 1 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Comparison of the number of incorrect inferred branches (NJ-JC) Worst replicate pairs Random replicate pairs
Quality of second gene Source: Fig. 2 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted) Worst case Random case
Progressive addition of genes Source: Fig. 3 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
When all 448 genes were used Source: Fig. 4 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Effect of neighboring branches Source: Fig. 5 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Summary & Conclusions Heck of a lot of data available Two dimensions – number of species, and number of sequences per species Many methods available to infer phylogenies from a large number of species Neighbor-joining (NJ), a fast, distance based algorithm works well and infers trees correctly as long as there are no polytomies (multifurcations) in the true tree NJ also infers shallow and deep branches with good and equal efficiency
Summary and Conculsions – contd. Multigene data available for many species How best to obtain phylogenetic info from these sequences (consensus or concatenation)? Our simulation results, with biologically realistic parameters and the worst-case approach, show that concatenation is better However, concatenation approach appears excessively prone to certain systematic errors.
Acknowledgements Co-authors: –Sudhir Kumar –Michael Rosenberg Help: –Roman Johnson –Tushar Gadagkar –Sankar Subramanian –Balaji Ramanujam Arizona State University University of Dayton
please visit our Biology Department at: http://biology.udayton.edu To find out more about our graduate programs, Apply online for free at: http://gradadmission.udayton.edu