Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Advertisements

Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.
Recent breakthroughs in mathematical and computational phylogenetics
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
IPAM Workshop on Multiple Sequence Alignment Tandy Warnow The University of Illinois at Urbana-Champaign.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant
Computational Phylogenomics and Metagenomics Tandy Warnow Departments of Bioengineering and Computer Science The University of Illinois at Urbana-Champaign.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Family of HMMs Nam Nguyen University of Texas at Austin.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
CS 466 and BIOE 498: Introduction to Bioinformatics
Advances in Ultra-large Phylogeny Estimation
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Mathematical and Computational Challenges in Reconstructing Evolution
Large-Scale Multiple Sequence Alignment
Mathematical and Computational Challenges in Reconstructing Evolution
CS 581 Algorithmic Computational Genomics
Tandy Warnow Founder Professor of Engineering
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
CS 394C: Computational Biology Algorithms
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
New methods for simultaneous estimation of trees and alignments
Ultra-large Multiple Sequence Alignment
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
Scaling Species Tree Estimation to Large Datasets
Presentation transcript:

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign

From the Tree of the Life Website, University of Arizona Phylogeny (evolutionary tree)

Two Two dimensions: number of genes and number of species

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene trees Compute species tree or network: – Combine the estimated gene trees, OR – Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene trees Compute species tree or network: – Combine the estimated gene trees, OR – Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

1KP: Thousand Transcriptome Project First publication: Wickett, Mirarab, et al., PNAS, 2014 Used SATé (Liu et al., Science 2009 and Syst Biol 2012) to compute multiple sequence alignments and trees Used ASTRAL (Mirarab et al., Bioinf 2014 and 2015) to compute the species tree G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin Upcoming Challenge: Multiple sequence alignment and gene tree estimation on 100,000 sequences. Many sequences are highly fragmentary.

Multiple Sequence Alignment (MSA): a scientific grand challenge 1 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- … Sn = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Divide-and-Conquer Divide-and-conquer is a basic algorithmic trick for solving problems! Three steps: – divide a dataset into two or more sets, – solve the problem on each set, and – combine solutions.

Computational Phylogenetics (2005) Courtesy of the Tree of Life web project, tolweb.org Current methods can use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week

Computational Phylogenetics (2015) Courtesy of the Tree of Life web project, tolweb.org 2012: Computing accurate trees (almost) without multiple sequence alignments : Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks : Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity

…ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… The true multiple alignment –Reflects historical substitution, insertion, and deletion events –Defined using transitive closure of pairwise alignments computed on edges of the true tree … ACGGTGCAGTTACCA … Substitution Deletion … ACCAGTCACCTA … Insertion

Phylogenetic Tree Estimation S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 2: Construct tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) FN FP 50% error rate

Evaluation of MSA methods (Science 2009) Alignment methods Clustal MAFFT Muscle Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) Phylogeny estimation: Maximum likelihood using RAxML Datasets: 1000-taxon simulated datasets under varying rates of evolution Biological datasets with structural alignments Liu et al., Science 2009

1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)

Observations Large datasets can be easy to align with high accuracy if there is not too much heterogeneity. Poor alignments produce poor trees.

Observations Highly accurate alignments are easy if the dataset is not too heterogeneous. We can use phylogenies to decompose datasets into smaller, less heterogeneous datasets.

Re-aligning on a tree A B D C Merge sub- alignments Estimate ML tree on merged alignment Decompose dataset AB CD Align subproblems AB CD ABCD

SATé and PASTA Input: set of unaligned sequences Output: multiple sequence alignment and tree SATé: Liu et al., Science 2009 (up to 10,000 sequences) and Systematic Biology 2012 (up to 50,000 sequences) PASTA: Mirarab et al., J. Comp Biol 2015 (up to 1,000,000 sequences)

SATé and PASTA Algorithms Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment

SATé and PASTA Algorithms Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment

SATé and PASTA Algorithms Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment

SATé and PASTA Algorithms Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009) 24-hour SATé analysis, on desktop machines (Similar improvements for biological datasets) SATé: 24-hour co-estimation of highly accurate alignments and trees on 1000 sequences

(Liu et al., Syst Biol 61(1):90-106, 2012) SATé-2: even more accurate!

Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish) PASTA, Mirarab et al., J Comp Biol 22(5): (2015) PASTA: even more accurate, and can scale to 1,000,000 sequences

Main Points Innovative algorithm design can improve accuracy as well as reduce running time. Divide-and-conquer is a key algorithmic technique that has dramatically changed the toolkit for biologists!

Acknowledgments Funding: HHMI (to Siavash Mirarab) Guggenheim Foundation Packard Foundation NSF Microsoft Research New England David Bruton Jr. Centennial Professorship Grainger Foundation TACC (Texas Advanced Computing Center)