Presentation is loading. Please wait.

Presentation is loading. Please wait.

0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston,

Similar presentations


Presentation on theme: "0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston,"— Presentation transcript:

1 0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A.

2 1 Perspectives Use biology ideas to solve computer science problems Use computer science tools to solve biology problems biology computer science this talk

3 2 Use Biology to Solve CS Problems DNA Computing DNA Self-Assembly Genetic Algorithms Neural Network Others

4 3 Use CS to Solve Biology Problems Bioinformatics or Computational Biology data mining (this talk) Related fields computational neuroscience computational ecology medical informatics … many more...

5 4 Example Research Areas of Bioinformatics DNA sequencing DNA microarray analysis DNA self-assembly for nano-structures DNA word design RNA secondary structure prediction Protein sequencing (my talk #4) Proteomics Protein database search Protein sequence design (my talk #3) Protein landscape analysis Phylogeny reconstruction (this talk) Phylogeny comparison (my talk #1)

6 5 Evolutionary Trees definition: a tree with distinct labels at leaves leaf labels: species, organisms, DNAs, RNAs, proteins, features, etc. ancestral species bird plum peach rice wheat present-day species (Just a joke!)

7 6 Evolutionary Trees leaf labels: DNA sequences bird plum peach rice wheat AAGT CCAG CCAT CGGG CGGC (Just a joke!)

8 7 Problem Formulation bird plum peach rice wheat AAGTCCAG CCAT CGGG CGGC Input: DNA sequences of present-day species Output: the true evolutionary tree Question: What is “true”? Need a model! (Just a joke!)

9 8 A Fundamental Problem of Biology Since the time of Charles Darwin, Problem: reconstruct the evolutionary history of all known species. Importance: intellectually fascinating practical benefits – medicine, food … Charles Robert Darwin --- 1809-1882 Origin of Species --- 1859

10 9 Main Difficulties Availability of data Hundreds of millions of species --- unlikely to be all available any time soon or ever. But DNA sequences of more and more species are becoming available. Extracting information from data focus of this talk

11 10 Today’s Technical Focus bird plum peach rice wheat AAGTCCAG CCAT CGGG CGGC Input: DNA sequences of present-day species Output: the true evolutionary tree Question: What is “true”? Need a model! Collaborators: Csuros & Kim

12 11 Main Result An algorithm that constructs an evolutionary tree from biomolecular sequences Provable high accuracy Short sequence length Optimal running time Optimal memory space

13 12 Outline of Technical Discussion 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

14 13 Outline of Technical Discussion (1) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

15 14 Model of Evolution Intuitions ACGTACT AGGAGAA CAGGAGTTTTAA Mutation occurs probabilistically. 1.edge length ~ time 2.edge length ~ mutation probability 3.edge length ~ dissimilarity (or distance) AGTTCCT

16 15 Jukes-Cantor Model of Evolution (1) Edge Mutation Probability A X No insertion or deletion. X = A with probability 1 - 0.6 = 0.4 X = C, G, or T with probability 0.6/3 = 0.2

17 16 Jukes-Cantor Model of Evolution (2) Independent Mutations along All Edges A AC G G 0.2 0.7 0.65 0.6

18 17 Jukes-Cantor Model of Evolution (3) i.i.d. mutations at every character AAGT AGTT CAGG GGTG GTTG 0.2 0.7 0.65 0.6

19 18 Outline of Technical Discussion (2) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

20 19 Problem Formulation AGTGT GGTAC CGTTT CAGGT GTACT TGGAC CAGGT CGTGTATCGT 0.2 0.6 0.7 0.3 0.2 0.5 0.7 0.1 True Tree (not known to algorithm) Input: Output: unrooted Pick any sequence for the root (also unknown to algorithm). Generate the other sequences. but not the other sequences, nor the tree.

21 20 Computational Objectives Input: DNA sequences Output: Minimize: running time memory space probability of incorrect output sample size, i.e., length of the input sequences

22 21 Outline of Technical Discussion (3) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

23 22 Triplets A triplet is one formed by three leaves. P is the center of XYZ. X P Z Y

24 23 G-depth of Triplet # of edges between X and Y X Z Y 5, 8, 7

25 24 G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree g-depth = 4 the best case

26 25 G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree g-depth = 2 log n the worst case

27 26 G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree at most 2 log n can be O(1)

28 27 Our New Result (1)

29 28 Our New Result (2) polynomial sample size

30 29 Our New Result (3) polynomial sample size provable high accuracy

31 30 Our New Result (4) polynomial sample size provable high accuracy optimal time & space

32 31 Comparison with Previous Results this talk

33 32 Outline of Technical Discussion (4) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

34 33 Experimental Study Design Step 1 -- Pick a model tree T. Step 2 -- Use T to generate sequences. Step 3 -- Use an algorithm to reconstruct a tree T’ from the sequences (without knowing T). Step 4 -- Compare T’ and T.

35 34 Wrong and Right Edges X1 X2 X4 X3 X5 X3 X2 X4 X1 X5 bad good true tree reconstructed tree

36 35 Experiment #1 the 135-taxon African-Eve tree (courtesy of Huson and Maddison) algorithms compared: HGT and bioNJ (Olivier Gascuel) parameters: sequence length and percentage of wrong edges edge mutation probabilities: between 0.47 and 0.088 # of simulations = 20 per sequence length more experiments in progress

37 36 135-taxon African Eve Tree

38 37 Results of Experiment #1

39 38 Experiment #2 a 1892-taxon tree of eukaryotes algorithms compared: HGT and bioNJ parameters: sequence length and percentage of wrong edges edge mutation probabilities: between 0.47 and 0.088 # of simulations = 20 per sequence length more experiments in progress several variants of the basic HGT

40 39 Results of Experiment #2

41 40 Results of Experiment #2

42 41 Results of Experiment #2

43 42 Outline of Technical Discussion (5) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

44 43 Our New Result (4) polynomial sample size provable high accuracy optimal time & space

45 44 Outline of Technical Discussion (5) 1.Describe the HGT algorithm. 2.Prove the sample size bound (and high probability for accuracy). 3.Prove the optimal time & space.

46 45 Outline of Technical Discussion (5/1) 1.Describe the HGT algorithm. 2.Prove the sample size bound (and high probability for accuracy). 3.Prove the optimal time & space.

47 46 Closeness and Distance of Two Leaves AAGT AGTT X CAGG GGTG Y GTTG 0.2 0.7 0.65 The larger the closeness, the more accurately we can estimate the distance. Closeness is multiplicative. Distance is additive!!!

48 47 Closeness = Cubic Root of Determinant AAGT CAGG A C G T

49 48 Closeness of Triplet AAGT AGTT X CAGG GGTG Y GTTG Z 0.2 0.7 0.65 The larger the closeness, the more accurately we can estimate the three pairwise distances.

50 49 Assemble Triplets Into Tree via Distance Additivity (I) XA Y b P a c XA Y 3 P 25 6

51 50 Assemble Triplets Into Tree via Distance Additivity (II) XYA B B X X Y Y A 3 2 10 6 3 Q P P Q 25 6 15 2 16

52 51 How to Choose Triplets to Minimize Errors? XZ Y 3 P 25 6 The larger the closeness, the more accurately we can estimate the three pairwise distances. Greedy Strategy! Harmonic Greedy Triplet (HGT)

53 52 Over-Simplified Outline of HGT Stage 1: T’  ABC with the largest closeness. Stage 2: Repeat the following steps until T’ contains all the leaves. Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not. Step 2(2): Incorporate XYZ into T’ to add Z into T’.

54 53 Outline of Technical Discussion (5/2) 1.Describe the HGT algorithm. 2.Prove the sample size bound (and high probability for accuracy). 3.Prove the optimal time & space.

55 54 Our New Result (4/1) polynomial sample size provable high accuracy

56 55 Over-Simplified Outline of HGT Stage 1: T’  ABC with the largest closeness. Stage 2: Repeat the following steps until T’ contains all the leaves. Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not. Step 2(2): Incorporate XYZ into T’ to add Z into T’.

57 56 Polynomial Sequence Length (1) larger smaller Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T. Proof: The largest closeness such that the triplets with same or larger closeness cover the true tree T. The smallest g-depth such that the triplets with same or smaller g-depths cover the true tree T.

58 57 Polynomial Sequence Length (2) g-depth of tree Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T. Lemma 2: sequence length needed where XYZ is the last triplet used.

59 58 Outline of Technical Discussion (5/3) 1.Describe the HGT algorithm. 2.Prove the sample size bound (and high probability for accuracy). 3.Prove the optimal time & space.

60 59 Our New Result (4/2) optimal time & space

61 60 Over-Simplified Outline of HGT Stage 1: T’  ABC with the largest closeness. Stage 2: Repeat the following steps until T’ contains all the leaves. Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not. Step 2(2): Incorporate XYZ into T’ to add Z into T’.

62 61 Optimal Time/Space for the First Triplet Stage 1: Fix an arbitrary leaf A. T’  ABC with the largest closeness. Stage 2: Repeat the following steps until T’ contains all the leaves. Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not. Step 2(2): Incorporate XYZ into T’.

63 62 Optimal Time/Space for the Other Leaves partially reconstructed tree not yet recovered Y X Z XYZ A B C ABC P Q only need to consider the triplets formed by one of X, Y, one of B, C, and one of

64 63 Outline of Technical Discussion (6) 1.Define the model of evolution. 2.Formulate the computational problem. 3.Discuss the theoretical performance of our algorithm. 4.Discuss the empirical performance. 5.Describe and analyze the algorithm. 6.Further research.

65 64 Further Research more general models of evolution practical implementations

66 65 Main Difficulties Availability of data Hundreds of millions of species --- unlikely to be all available any time soon or ever. But DNA sequences of more and more species are becoming available. Extracting information from data focus of this talk

67 66 Do the genomes of all green plants contain enough information for the reconstruction of their evolutionary tree? genome size of eukaryotes: base pairs # of green plant species: several If so, does this impose any necessary structure on the information or the tree? If so, how do we determine and use that structure? Beyond All Computational Considerations What do you think? The End. Thank You!

68 67 Data Mining Flowchart true tree (unknown) collect & process individual sequences compare & align multiple sequences tree reconstruction algorithms tree verification (compare & refine) evolution models generate sequences further process parameters distance or characters trees information refine infer today’s focus parameters


Download ppt "0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston,"

Similar presentations


Ads by Google