0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston,

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.
Molecular Evolution Revised 29/12/06
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Phylogeny Tree Reconstruction
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Phylogenetic Trees: Assumptions All existing species have a common ancestor Each species is descended from a single ancestor Each speciation gives rise.
Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM.
Bioinformatics and Phylogenetic Analysis
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
EECS 395/495 Algorithmic Techniques for Bioinformatics General Introduction 9/27/2012 Ming-Yang Kao 19/27/2012.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Introduction to Bioinformatics Algorithms Exhaustive Search and Branch-and-Bound Algorithms for Partial Digest Mapping.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Introduction to Bioinformatics Algorithms DNA Mapping and Brute Force Algorithms.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Graduate Research with Bioinformatics Research Mentors Nancy Warter-Perez, ECE Robert Vellanoweth Chem and Biochem Fellow Sean Caonguyen 8/20/08.
Bioinformatics Overview
WABI: Workshop on Algorithms in Bioinformatics
Distance based phylogenetics
Multiple Alignment and Phylogenetic Trees
1 Department of Engineering, 2 Department of Mathematics,
Genomes and Their Evolution
1 Department of Engineering, 2 Department of Mathematics,
BNFO 602 Phylogenetics Usman Roshan.
1 Department of Engineering, 2 Department of Mathematics,
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A.

1 Perspectives Use biology ideas to solve computer science problems Use computer science tools to solve biology problems biology computer science this talk

2 Use Biology to Solve CS Problems DNA Computing DNA Self-Assembly Genetic Algorithms Neural Network Others

3 Use CS to Solve Biology Problems Bioinformatics or Computational Biology data mining (this talk) Related fields computational neuroscience computational ecology medical informatics … many more...

4 Example Research Areas of Bioinformatics DNA sequencing DNA microarray analysis DNA self-assembly for nano-structures DNA word design RNA secondary structure prediction Protein sequencing (my talk #4) Proteomics Protein database search Protein sequence design (my talk #3) Protein landscape analysis Phylogeny reconstruction (this talk) Phylogeny comparison (my talk #1)

5 Evolutionary Trees definition: a tree with distinct labels at leaves leaf labels: species, organisms, DNAs, RNAs, proteins, features, etc. ancestral species bird plum peach rice wheat present-day species (Just a joke!)

6 Evolutionary Trees leaf labels: DNA sequences bird plum peach rice wheat AAGT CCAG CCAT CGGG CGGC (Just a joke!)

7 Problem Formulation bird plum peach rice wheat AAGTCCAG CCAT CGGG CGGC Input: DNA sequences of present-day species Output: the true evolutionary tree Question: What is “true”? Need a model! (Just a joke!)

8 A Fundamental Problem of Biology Since the time of Charles Darwin, Problem: reconstruct the evolutionary history of all known species. Importance: intellectually fascinating practical benefits – medicine, food … Charles Robert Darwin Origin of Species

9 Main Difficulties Availability of data Hundreds of millions of species --- unlikely to be all available any time soon or ever. But DNA sequences of more and more species are becoming available. Extracting information from data focus of this talk

10 Today’s Technical Focus bird plum peach rice wheat AAGTCCAG CCAT CGGG CGGC Input: DNA sequences of present-day species Output: the true evolutionary tree Question: What is “true”? Need a model! Collaborators: Csuros & Kim

11 Main Result An algorithm that constructs an evolutionary tree from biomolecular sequences Provable high accuracy Short sequence length Optimal running time Optimal memory space

12 Outline of Technical Discussion 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

13 Outline of Technical Discussion (1) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

14 Model of Evolution Intuitions ACGTACT AGGAGAA CAGGAGTTTTAA Mutation occurs probabilistically. 1.edge length ~ time 2.edge length ~ mutation probability 3.edge length ~ dissimilarity (or distance) AGTTCCT

15 Jukes-Cantor Model of Evolution (1) Edge Mutation Probability A X No insertion or deletion. X = A with probability = 0.4 X = C, G, or T with probability 0.6/3 = 0.2

16 Jukes-Cantor Model of Evolution (2) Independent Mutations along All Edges A AC G G

17 Jukes-Cantor Model of Evolution (3) i.i.d. mutations at every character AAGT AGTT CAGG GGTG GTTG

18 Outline of Technical Discussion (2) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

19 Problem Formulation AGTGT GGTAC CGTTT CAGGT GTACT TGGAC CAGGT CGTGTATCGT True Tree (not known to algorithm) Input: Output: unrooted Pick any sequence for the root (also unknown to algorithm). Generate the other sequences. but not the other sequences, nor the tree.

20 Computational Objectives Input: DNA sequences Output: Minimize: running time memory space probability of incorrect output sample size, i.e., length of the input sequences

21 Outline of Technical Discussion (3) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

22 Triplets A triplet is one formed by three leaves. P is the center of XYZ. X P Z Y

23 G-depth of Triplet # of edges between X and Y X Z Y 5, 8, 7

24 G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree g-depth = 4 the best case

25 G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree g-depth = 2 log n the worst case

26 G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree at most 2 log n can be O(1)

27 Our New Result (1)

28 Our New Result (2) polynomial sample size

29 Our New Result (3) polynomial sample size provable high accuracy

30 Our New Result (4) polynomial sample size provable high accuracy optimal time & space

31 Comparison with Previous Results this talk

32 Outline of Technical Discussion (4) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

33 Experimental Study Design Step 1 -- Pick a model tree T. Step 2 -- Use T to generate sequences. Step 3 -- Use an algorithm to reconstruct a tree T’ from the sequences (without knowing T). Step 4 -- Compare T’ and T.

34 Wrong and Right Edges X1 X2 X4 X3 X5 X3 X2 X4 X1 X5 bad good true tree reconstructed tree

35 Experiment #1 the 135-taxon African-Eve tree (courtesy of Huson and Maddison) algorithms compared: HGT and bioNJ (Olivier Gascuel) parameters: sequence length and percentage of wrong edges edge mutation probabilities: between 0.47 and # of simulations = 20 per sequence length more experiments in progress

taxon African Eve Tree

37 Results of Experiment #1

38 Experiment #2 a 1892-taxon tree of eukaryotes algorithms compared: HGT and bioNJ parameters: sequence length and percentage of wrong edges edge mutation probabilities: between 0.47 and # of simulations = 20 per sequence length more experiments in progress several variants of the basic HGT

39 Results of Experiment #2

40 Results of Experiment #2

41 Results of Experiment #2

42 Outline of Technical Discussion (5) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

43 Our New Result (4) polynomial sample size provable high accuracy optimal time & space

44 Outline of Technical Discussion (5) 1.Describe the HGT algorithm. 2.Prove the sample size bound (and high probability for accuracy). 3.Prove the optimal time & space.

45 Outline of Technical Discussion (5/1) 1.Describe the HGT algorithm. 2.Prove the sample size bound (and high probability for accuracy). 3.Prove the optimal time & space.

46 Closeness and Distance of Two Leaves AAGT AGTT X CAGG GGTG Y GTTG The larger the closeness, the more accurately we can estimate the distance. Closeness is multiplicative. Distance is additive!!!

47 Closeness = Cubic Root of Determinant AAGT CAGG A C G T

48 Closeness of Triplet AAGT AGTT X CAGG GGTG Y GTTG Z The larger the closeness, the more accurately we can estimate the three pairwise distances.

49 Assemble Triplets Into Tree via Distance Additivity (I) XA Y b P a c XA Y 3 P 25 6

50 Assemble Triplets Into Tree via Distance Additivity (II) XYA B B X X Y Y A Q P P Q

51 How to Choose Triplets to Minimize Errors? XZ Y 3 P 25 6 The larger the closeness, the more accurately we can estimate the three pairwise distances. Greedy Strategy! Harmonic Greedy Triplet (HGT)

52 Over-Simplified Outline of HGT Stage 1: T’  ABC with the largest closeness. Stage 2: Repeat the following steps until T’ contains all the leaves. Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not. Step 2(2): Incorporate XYZ into T’ to add Z into T’.

53 Outline of Technical Discussion (5/2) 1.Describe the HGT algorithm. 2.Prove the sample size bound (and high probability for accuracy). 3.Prove the optimal time & space.

54 Our New Result (4/1) polynomial sample size provable high accuracy

55 Over-Simplified Outline of HGT Stage 1: T’  ABC with the largest closeness. Stage 2: Repeat the following steps until T’ contains all the leaves. Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not. Step 2(2): Incorporate XYZ into T’ to add Z into T’.

56 Polynomial Sequence Length (1) larger smaller Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T. Proof: The largest closeness such that the triplets with same or larger closeness cover the true tree T. The smallest g-depth such that the triplets with same or smaller g-depths cover the true tree T.

57 Polynomial Sequence Length (2) g-depth of tree Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T. Lemma 2: sequence length needed where XYZ is the last triplet used.

58 Outline of Technical Discussion (5/3) 1.Describe the HGT algorithm. 2.Prove the sample size bound (and high probability for accuracy). 3.Prove the optimal time & space.

59 Our New Result (4/2) optimal time & space

60 Over-Simplified Outline of HGT Stage 1: T’  ABC with the largest closeness. Stage 2: Repeat the following steps until T’ contains all the leaves. Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not. Step 2(2): Incorporate XYZ into T’ to add Z into T’.

61 Optimal Time/Space for the First Triplet Stage 1: Fix an arbitrary leaf A. T’  ABC with the largest closeness. Stage 2: Repeat the following steps until T’ contains all the leaves. Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not. Step 2(2): Incorporate XYZ into T’.

62 Optimal Time/Space for the Other Leaves partially reconstructed tree not yet recovered Y X Z XYZ A B C ABC P Q only need to consider the triplets formed by one of X, Y, one of B, C, and one of

63 Outline of Technical Discussion (6) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

64 Further Research more general models of evolution practical implementations

65 Main Difficulties Availability of data Hundreds of millions of species --- unlikely to be all available any time soon or ever. But DNA sequences of more and more species are becoming available. Extracting information from data focus of this talk

66 Do the genomes of all green plants contain enough information for the reconstruction of their evolutionary tree? genome size of eukaryotes: base pairs # of green plant species: several If so, does this impose any necessary structure on the information or the tree? If so, how do we determine and use that structure? Beyond All Computational Considerations What do you think? The End. Thank You!

67 Data Mining Flowchart true tree (unknown) collect & process individual sequences compare & align multiple sequences tree reconstruction algorithms tree verification (compare & refine) evolution models generate sequences further process parameters distance or characters trees information refine infer today’s focus parameters