New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

Slides:



Advertisements
Similar presentations
The multispecies coalescent: implications for inferring species trees
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Practical Session: Bayesian evolutionary analysis by sampling trees (BEAST) Rebecca R. Gray, Ph.D. Department of Pathology University of Florida.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.
Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.
Supertrees and the Tree of Life
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Molecular phylogenetics
Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant
Speciation history inferred from gene trees L. Lacey Knowles Department of Ecology and Evolutionary Biology University of Michigan, Ann Arbor MI
16 September 2007 Coalescent Consequences for Consensus Cladograms J. H. Degnan 1, M. Degiorgio 2, D. Bryant 3, and N. A. Rosenberg 1,2 1 Dept. of Human.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Computational Phylogenomics and Metagenomics Tandy Warnow Departments of Bioengineering and Computer Science The University of Illinois at Urbana-Champaign.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
Estimating Species Tree from Gene Trees by Minimizing Duplications
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Understanding sets of trees CS 394C September 10, 2009.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Full modeling versus summarizing gene- tree uncertainty: Method choice and species-tree accuracy L.L. Knowles et al., Molecular Phylogenetics and Evolution.
Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Darwin’s Tree of Life, July million species Phylogenetic inference from genomic.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Lecture 19 – Species Tree Estimation
New Approaches for Inferring the Tree of Life
Chalk Talk Tandy Warnow
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Mathematical and Computational Challenges in Reconstructing Evolution
Tandy Warnow The University of Illinois
Large-Scale Multiple Sequence Alignment
Mathematical and Computational Challenges in Reconstructing Evolution
Summary and Recommendations
Chapter 19 Molecular Phylogenetics
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Summary and Recommendations
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
Scaling Species Tree Estimation to Large Datasets
Presentation transcript:

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Phylogeny (evolutionary tree)

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Sampling multiple genes from multiple species

Incomplete Lineage Sorting (ILS) is a dominant cause of gene tree heterogeneity

This talk Gene tree heterogeneity due to incomplete lineage sorting, modelled by the multi-species coalescent (MSC) Statistically consistent estimation of species trees under the MSC, and the impact of gene tree estimation error “Statistical binning” (Science 2014) – improving gene tree estimation, and hence species tree estimation Open questions

Gene trees inside the species tree (Coalescent Process) Present Past Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Incomplete Lineage Sorting (ILS) Confounds phylogenetic analysis for many groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS.

... Analyze separately Summary Method Two competing approaches gene 1 gene 2... gene k... Concatenation Species

Statistical Consistency error Data

... What about summary methods?

... What about summary methods? Techniques: Most frequent gene tree? Consensus of gene trees? Other?

Statistically consistent under ILS? Coalescent-based summary methods: – MP-EST (Liu et al. 2010): maximum pseudo-likelihood estimation of rooted species tree based on rooted triplet tree distribution – YES – BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree estimation –YES – And many others (ASTRAL, ASTRID, NJst, GLASS, etc.) Co-estimation methods: *BEAST (Heled and Drummond 2009): Bayesian co- estimation of gene trees and species trees – YES Single-site methods (SVDquartets, METAL, SNAPP, and others)

Statistically consistent under ILS? Coalescent-based summary methods: – MP-EST (Liu et al. 2010): maximum pseudo-likelihood estimation of rooted species tree based on rooted triplet tree distribution – YES – BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree estimation –YES – And many others (ASTRAL, ASTRID, NJst, GLASS, etc.) Co-estimation methods: *BEAST (Heled and Drummond 2009): Bayesian co- estimation of gene trees and species trees – YES Single-site methods (SVDquartets, METAL, SNAPP, and others) - YES CA-ML (Concatenation using unpartitioned maximum likelihood) - NO MDC – NO GC (Greedy Consensus) – NO MRP (supertree method) – NO

Statistically consistent under ILS? Coalescent-based summary methods: – MP-EST (Liu et al. 2010): maximum pseudo-likelihood estimation of rooted species tree based on rooted triplet tree distribution – YES – BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree estimation –YES – And many others (ASTRAL, ASTRID, NJst, GLASS, etc.) Co-estimation methods: *BEAST (Heled and Drummond 2009): Bayesian co- estimation of gene trees and species trees – YES Single-site methods (SVDquartets, METAL, SNAPP, and others) - YES CA-ML (Concatenation using unpartitioned maximum likelihood) - NO MDC – NO GC (Greedy Consensus) – NO MRP (supertree method) – NO

Results on 11-taxon datasets with weak ILS * BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinformatics 2013

Problem: poor gene trees Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy.

Problem: poor gene trees Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy.

Problem: poor gene trees Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy.

Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy. TYPICAL PHYLOGENOMICS PROBLEM: many poor gene trees

Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy. THIS IS A KEY ISSUE IN THE DEBATE ABOUT HOW TO COMPUTE SPECIES TREES

Statistical Consistency for summary methods error Data Data are gene trees, presumed to be randomly sampled true gene trees.

Avian Phylogenomics Project E Jarvis, HHMI G Zhang, BGI Approx. 50 species, whole genomes, 14,000 loci Published Science 2014 MTP Gilbert, Copenhagen S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin T. Warnow UT-Austin Plus many many other people… Challenges: Massive gene tree conflict suggestive of ILS Coalescent-based analysis using MP-EST produced tree that conflicted with concatenation analysis Most gene trees had very low bootstrap support, suggestive of gene tree estimation error

Avian Phylogenomics Project E Jarvis, HHMI G Zhang, BGI Approx. 50 species, whole genomes, 14,000 loci MTP Gilbert, Copenhagen S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin T. Warnow UT-Austin Plus many many other people… Solution: Statistical Binning Improves coalescent-based species tree estimation by improving gene trees (Mirarab, Bayzid, Boussau, and Warnow, Science 2014) Avian species tree estimated using Statistical Binning with MP-EST (Jarvis, Mirarab, et al., Science 2014)

Gene Tree Estimation Error can be due to insufficient data error Data Data are sites in an alignment for a c-gene

Unweighted statistical binning (Science 2014) Given multiple sequence alignments for a set of loci: 1.Estimate ML gene trees with bootstrap support 2.Bin genes based on gene tree compatibility after collapsing low support branches, producing “supergene alignments” 3.Compute “supergene trees” (one for each bin), using fully partitioned maximum likelihood 4.Apply coalescent-based summary method to the supergene trees, requiring that the summary method be statistically consistent under the MSC

Unweighted statistical binning pipelines are not statistically consistent under GTR+MSC Easy proof: As the number of sites per locus increase All estimated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) For every bin, with probability converging to 1, the genes in the bin have the same tree topology. Fully partitioned GTR ML analysis of each bin converges to a tree with the common topology of the genes in the bin. As the number of loci increase, every gene tree topology appears with probability converging to 1. Cannot infer the species tree from the flat distribution of gene trees!

Weighted statistical binning (PLOS One 2015) Given multiple sequence alignments for a set of loci: 1.Estimate ML gene trees with bootstrap support 2.Bin genes based on gene tree compatibility after collapsing low support branches, producing “supergene alignments” 3.Compute “supergene trees” (one for each bin), using fully partitioned maximum likelihood 4.Replace original gene tree by the new supergene tree (equivalently, replicate supergene trees by the size of each bin) 5.Apply coalescent-based summary method to the supergene trees, requiring that the summary method be statistically consistent under the MSC

WSB pipelines are statistically consistent under GTR+MSC Easy proof: As the number of sites per locus increase All estimated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) For every bin, with probability converging to 1, the genes in the bin have the same tree topology Fully partitioned GTR ML analysis of each bin converges to a tree with the common topology of the genes in the bin Hence as the number of sites per locus and number of loci both increase, WSB followed by a statistically consistent summary method will converge in probability to the true species tree. Q.E.D.

Statistical binning vs. unbinned Mirarab, et al., Science 2014 (Unweighted statistical binning) Binning produces bins with approximate 5 to 7 genes each Datasets: 11-taxon strongILS datasets with 50 genes, Chung and Ané, Systematic Biology

Comparing Binned and Un-binned MP-EST on the Avian Dataset Unbinned MP-EST strongly rejects Columbea, a major finding by Jarvis, Mirarab,et al. Binned MP-EST is largely consistent with the ML concatenation analysis. The trees presented in Science 2014 were the ML concatenation and Binned MP-EST

Summary Unpartitioned concatenation using maximum likelihood is statistically inconsistent under the MSC (Roch and Steel 2014, see discussion in Warnow PLOS Currents 2015) Gene tree estimation error impacts species tree estimation (multiple papers) Statistical binning (Mirarab et al. Science 2014) improves coalescent-based species tree estimation from multiple genes, used in Avian Tree (Jarvis, Mirarab, et al. Science 2014). Weighted statistical binning pipelines are statistically consistent under GTR+MSC, but unweighted statistical binning pipelines are not (Bayzid et al., PLOS One 2015)

Bounded number of sites per locus? Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? Roch & Warnow, Systematic Biology 2015: – Yes under the strong molecular clock (even for a single site per locus) – Very limited results otherwise

Open Questions Is fully partitioned ML statistically consistent or inconsistent under the MSC? (Note: proof by Roch and Steel for unpartitioned ML will not easily extend to fully partitioned) Are any of the standard summary methods statistically consistent for bounded number of sites per locus, but unbounded number of loci? Are the co-estimation methods (e.g., *BEAST and BEST) statistically consistent for bounded number of sites per locus but unbounded number of loci?

Open Questions Why does concatenation using ML (whether unpartitioned or partitioned) produce such good accuracy under many conditions? Why does statistical binning improve accuracy under many conditions?

Acknowledgments PhD students: Siavash Mirarab* (now Assistant Professor at UCSD ECE) and Md. S. Bayzid** Bastien Boussau (CNRS, Lyon) Sébastien Roch (Wisconsin) Funding: Guggenheim Foundation, Packard, NSF, Microsoft Research New England, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Computing Center), and GEBI. TACC and UTCS computational resources * Supported by HHMI Predoctoral Fellowship ** Supported by Fulbright Foundation Predoctoral Fellowship