Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.

Slides:

Advertisements

Similar presentations

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.

Advertisements

Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.

Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.

New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.

Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.

Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.

Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant

Computational Phylogenomics and Metagenomics Tandy Warnow Departments of Bioengineering and Computer Science The University of Illinois at Urbana-Champaign.

Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.

New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.

Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign

SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.

Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.

Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.

New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Family of HMMs Nam Nguyen University of Texas at Austin.

Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.

CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.

Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.

Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.

Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.

Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.

CS 466 and BIOE 498: Introduction to Bioinformatics

Introduction to Bioinformatics Resources for DNA Barcoding

Advances in Ultra-large Phylogeny Estimation

Chalk Talk Tandy Warnow

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Multiple Sequence Alignment Methods

Techniques for MSA Tandy Warnow.

Algorithm Design and Phylogenomics

New methods for simultaneous estimation of trees and alignments

Large-Scale Multiple Sequence Alignment

TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles

CS 581 Algorithmic Computational Genomics

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Tandy Warnow Founder Professor of Engineering

New methods for simultaneous estimation of trees and alignments

Texas, Nebraska, Georgia, Kansas

Ultra-Large Phylogeny Estimation Using SATé and DACTAL

Recent Breakthroughs in Mathematical and Computational Phylogenetics

CS 394C: Computational Biology Algorithms

Taxonomic identification and phylogenetic profiling

Algorithms for Inferring the Tree of Life

Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.

Tandy Warnow The University of Texas at Austin

Tandy Warnow The University of Texas at Austin

New methods for simultaneous estimation of trees and alignments

Advances in Phylogenomic Estimation

TIPP and SEPP (plus PASTA)

Scaling Species Tree Estimation to Large Datasets

Presentation transcript:

Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin

AGAT TAGACTTTGCACAATGCGCTT AGGGCATGA UVWXY U VW X Y

Input: Unaligned Sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 2: Construct Tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

1000 taxon models, ordered by difficulty (Liu et al., 2009)

Problems Large datasets with high rates of evolution are hard to align accurately, and phylogeny estimation methods produce poor trees when alignments are poor. Many phylogeny estimation methods have poor accuracy on large datasets (even if given correct alignments) Potentially useful genes are often discarded if they are difficult to align. These issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)

Co-estimation methods POY, and other “treelength” methods, are controversial. Liu and Warnow, PLoS One 2012, showed that although gap penalty impacts tree accuracy, even when using a good affine gap penalty, treelength optimization gives poorer accuracy than maximum likelihood on good alignments. Likelihood-based methods based upon statistical models of evolution that include indels as well as substitutions (BAliPhy, Alifritz, StatAlign, and others) provide potential improvements in accuracy. These target small datasets (at most a few hundred sequences). BAli-Phy is the fastest of these methods.

This talk SATé: Simultaneous Alignment and Tree Estimation DACTAL: Divide-and-conquer trees (almost) without alignments SEPP: SATé-enabled phylogenetic placement (analyses of large numbers of fragmentary sequences)

Part I: SATé Simultaneous Alignment and Tree Estimation (for nucleotide or amino-acid analysis) Liu, Nelesen, Raghavan, Linder, and Warnow, Science, Liu et al., Systematic Biology, 2012 Public software distribution (open source) through the University of Kansas (Mark Holder)

1000 taxon models, ordered by difficulty (Liu et al., 2009)

SATé Algorithm Tree Obtain initial alignment and estimated ML tree

SATé Algorithm Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment

SATé Algorithm Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment

Re-aligning on a Tree A B D C Merge sub- alignments Estimate ML tree on merged alignment Decompose dataset AB CD Align subproblem s AB CD ABCD

1000 taxon models, ordered by difficulty (Liu et al., 2009)

1000 taxon models ranked by difficulty

SATé Summary Improved tree and alignment accuracy compared to two-phase methods, on both simulated and biological data. Public software distribution (open source) through the University of Kansas (Mark Holder) Workshops Monday and Tuesday References: Liu, Nelesen, Raghavan, Linder, and Warnow, Science, Liu et al., Systematic Biology, 2012

Limitations A B D C Merge sub- alignments Estimate ML tree on merged alignment Decompose dataset AB CD Align subproblem s AB CD ABCD

Part II: DACTAL (Divide-And-Conquer Trees (Almost) without alignments) Input: set S of unaligned sequences Output: tree on S (but no alignment) Nelesen, Liu, Wang, Linder, and Warnow, In Press, ISMB 2012 and Bioinformatics 2012

DACTAL New supertree method: SuperFine Existing Method: RAxML(MAFFT) pRecDCM3 BLAST- based Overlapping subsets A tree for each subset Unaligned Sequences A tree for the entire dataset

DACTAL: Better results than 2-phase methods Three 16S datasets from Gutell’s database (CRW) with 6,323 to 27,643 sequences Reference alignments based on secondary structure Reference trees are 75% RAxML bootstrap trees DACTAL (shown in red) run for 5 iterations starting from FT(Part) FastTree (FT) and RAxML are ML methods

DACTAL is Flexible Any tree estimation method, any kind of data Overlapping subsets A tree for each subset Unaligned Sequences A tree for the entire dataset  non-homogeneous models  heterogeneous data

Part III: SEPP SEPP: SATé-enabled phylogenetic placement Mirarab, Nguyen, and Warnow. Pacific Symposium on Biocomputing, 2012.

NGS and metagenomic data Fragmentary data (e.g., short reads): –How to align? How to insert into trees? Unknown taxa –How to identify the species, genus, family, etc?

Phylogenetic Placement Input: Backbone alignment and tree on full- length sequences, and a set of query sequences (short fragments) Output: Placement of query sequences on backbone tree

Align Sequence S1 S4 S2 S3 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

Align Sequence S1 S4 S2 S3 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC

Place Sequence S1 S4 S2 S3 Q1 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = T-A--AAAC

Phylogenetic Placement Align each query sequence to backbone alignment –HMMALIGN (Eddy, Bioinformatics 1998) –PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree –pplacer (Matsen et al., BMC Bioinformatics, 2011) –EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA use maximum likelihood

HMMER vs. PaPaRa Alignments Increasing rate of evolution 0.0

SEPP Key insight: HMMs are not very good at modelling MSAs on large, divergent datasets. Approach: insert fragments into taxonomy using estimated alignment of full-length sequences, and multiple HMMs (on different subsets of taxa).

SEPP: SATé-enabled Phylogenetic Placement

SEPP (10%-rule) on Simulated Data 0.0 Increasing rate of evolution

SEPP (10%) on Biological Data For 1 million fragments: PaPaRa+pplacer: ~133 days HMMALIGN+pplacer: ~30 days SEPP 1000/1000: ~6 days 16S.B.ALL dataset, 13k curated backbone tree, 13k total fragments

SEPP (10%) on Biological Data For 1 million fragments: PaPaRa+pplacer: ~133 days HMMALIGN+pplacer: ~30 days SEPP 1000/1000: ~6 days 16S.B.ALL dataset, 13k curated backbone tree, 13k total fragments

Summary SATé improves accuracy for large-scale alignment and tree estimation DACTAL enables phylogeny estimation for very large datasets, and may be robust to model violations SEPP is useful for phylogenetic placement of short reads Main observation: divide-and-conquer can make base methods more accurate (and maybe even faster)

References For papers, see and note numbers listed below SATé: Science 2009 (papers #89, #99) DACTAL: To appear, Bioinformatics (special issue for ISMB 2012) SEPP: Pacific Symposium on Biocomputing (#104) For software, see

Research Projects Theory: Phylogenetic estimation under statistical models Method development: “Absolute fast converging” methods Very large-scale multiple sequence alignment and phylogeny estimation Estimating species trees and networks from gene trees Supertree methods Comparative genomics (genome rearrangement phylogenetics) Metagenomic taxon identification Alignment and Phylogenetic Placement of NGS data Dataset analyses  Avian Phylogeny: 50 species and genes  Thousand Transcriptome (1KP) Project: 1000 species and 1000 genes  Chloroplast genomics

Acknowledgments Guggenheim Foundation Fellowship, Microsoft Research New England, National Science Foundation: Assembling the Tree of Life (ATOL), ITR, and IGERT grants, and David Bruton Jr. Professorship Collaborators: –SATé: Mark Holder, Randy Linder, Kevin Liu, Siavash Mirarab, Serita Nelesen, Sindhu Raghavan, Li-San Wang, and Jiaye Yu –DACTAL: Serita Nelesen, Kevin Liu, Li-San Wang, and Randy Linder –SEPP/TIPP: Siavash Mirarab and Nam Nguyen Software: see