Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance.

Similar presentations


Presentation on theme: "How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance."— Presentation transcript:

1 How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance Algorithm Engineering and Computational Biology Dept. of Computer Science University of New Mexico www.compbio.unm.edu

2 Phylogeny Reconstruction OrangutanChimpanzeeHumanGorilla

3 Phylogeny Reconstruction 1.Get an estimate of evolutionary distance between species 2.Treat the species as a set of points with pairwise distance measure 3.Find a tree that optimizes {parsimony, likelihood, function of your choice} on that set of points

4 Overview of My Research Computational Phylogeny –Comparison of methods that combine trees (greed is bad) –Topological accuracy of maximum parsimony Is optimal necessary? How to know when “good enough”? –Online consensus and other statistics –Heterogeneous data in phylogeny Controlled animal breeding strategies Computational Phylogeny –Comparison of methods that combine trees (greed is bad) –Topological accuracy of maximum parsimony Is optimal necessary? How to know when “good enough”? –Online consensus and other statistics –Heterogeneous data in phylogeny Controlled animal breeding strategies

5 Talk Overview Part I: Getting a good tree fast –Consensus methods –Experiment overview –Results and possibilities –Online consensus Part II: Heterogeneous data –Molecular vs. other: cons and pros –Constraint-based reconstruction –Consensus as constraint-based reconstruction methods The Big Future

6 Computational Pitfalls Resulting optimization problems are hard Existing heuristics expensive on large datasets Same score – many topologies True tree is unknown ⇓ When to stop and what to return?

7 Consensus Methods A B C D E A C B D E A B C D E + = Consensus is what many people say in chorus but do not believe as individuals Abba Eban (1915 - 2002), Israeli diplomat In "The New Yorker," 23 Apr 1990

8 Consensus Methods: Strict McMorris et al. (83) E A B C D E A B C D E A B C D AB CD ABCD ABCDE AB ABC DE ABCDE BCD ABCD ABCDE Strict: contains clades common to all trees E A B C D

9 Consensus Methods: Majority Margush & McMorris (81), McMorris et al. (83), Barthelemy & McMorris (86) E A B C D E A B C D E A B C D AB CD ABCD ABCDE AB ABC DE ABCDE BCD ABCD ABCDE Majority: contains clades common to majority AB CD ABCDAB ABC DEBCD ABCD E A B C D

10 Stopping Maximum Parsimony (joint work with T.Williams, B.M.E.Moret, U.Roshan, T.Warnow) If return Majority Consensus of the top scoring trees how early can we stop without changing the outcome? What stopping criteria? Biological datasets: three567: “three-gene” (rbcL, atpB, and 18s) DNA sequences (Soltis et al., 2000) aster328: ITS RNA sequences from the plant Asteracaeae (Gutell Lab, ICMB, UT Austin) ocho854: rbcL DNA sequences (Goloboff, 1999) lipsc439: rDNA sequences of Eukaryotes (Goloboff, 1999) john921: Avian Cytochrome b DNA sequences (Johnson, 2001) eern476: Metazoan DNA sequences (Goloboff, 1999) will2000: Eukaryotic sRNA sequences (Gutell Lab, ICMB, UT Austin) rbcL500: rbcL DNA sequences (Rice et al., 1997) mari2594: rbcL DNA sequences (Kallerjo et al., 1998)

11 Experiment Design ATTCGGAAGCGATAGCTGA ATCGATCGATCGTATTACGT TAGCTAGTATGCAGCGGAG Biological dataset Run parsimony ratchet (PAUP*) 500 iterations, 5 repetitions Save the tree at each iteration Majority consensus of optimal trees (PAUP*) Output consensus tree … Optimal - best scoring trees in all repetitions Majority consensus of best and second best so far

12 Results

13

14 Online Consensus Input: T 1, T 2, …, T k with n leaves, one at a time Output: Majority Consensus tree M i of T 1,…,T i Solution: Maintain set of clades C with counters When T i arrives, need to consider only the clades in T i and M i-1, total of 2n Data structureTimeSpace Self balancing binary treeO(n lg n)O(|C|) Hash table, h=O(n 2 )O(n)O(n 2 )

15 Conclusions and Future Evidence that can stop parsimony search early Need simulations and more data to verify Collect other (than consensus) statistics Other stopping criteria Different representation of final sets of trees Other methods

16 Wait! There is more! Part II: Heterogeneous Data (joint work with Tandy Warnow)

17 Heterogeneous Data Molecular data: DNA and genomes ProsCons Have distance measure Unambiguous Many characters No data for extinct species Difficulties with ancient evolutionary events Recombination, repeated evolution

18 Heterogeneous Data Paleontological, morphological, geographical, historical data ProsCons Easy to sample Sometimes is the only available information Has been used for a century Character states hard to determine Genetic basis not known No distance measure Subjective

19 Data As Constraints Constraints, not distance! Positive: these species are together (phylogenetic trees, presence of a morphological character) Negative: these species are not together (above + geography, fossils) Temporal: these events happened in this order (fossils, history) Frequency: this even happens more often than another (adaptation mechanisms)

20 E A B C D Consensus Methods: Greedy E A B C D E A B C D E A B C D AB CD ABCD ABCDE AB ABC DE ABCDE BCD ABCD ABCDE Greedy: resolves majority by adding compatible clades E A B C D AB CD ABCD E A B C D AB ABC DE E A B C D

21 Consensus Methods: AMT Phillips & Warnow (95) E A B C D E A B C D E A B C D AB CD ABCD ABCDE AB ABC DE ABCDE BCD ABCD ABCDE Asymmetric Median Tree: maximum (weighted) collection of compatible clades AB ABC ABCD BCD DE CD AB CD ABCD ABCDE AB ABC ABCD ABCDE AB CD ABCD ABCDE

22 Consensus of Positive Constraints Formalize constraint, go through existing consensus methods, see if satisfies or can be extended Positive ConstraintsStrict+ resMaj+ resGrdyAMTInput All input have isomorphic T ... all output have T  One input has isomorphic T, no contradictions  output have T  All input have clade  all output have One input has clade, no con- tradictions  output have   Partially from Steel et al. 2000

23 1.a and b are separated by C 2.C is closer to a than b – same as positive Negative ConstraintsStrict+ resMaj+ resGrdyAMTInput All input have 1 .all output…. have 1  One input has 1, no contradictions  output have 1   Consensus of Negative Constraints

24 Conclusions and Future (Part II) Existing methods are insufficient (Consensus with respect to temporal, frequency constraints) Developing new methods that preserve 4 types of constraints Network phylogeny Error measure and evaluation of quality

25 Even Bigger Future Phylogeny Getting good reconstructions fast Heterogeneous data Network phylogeny Epidemiology Flu SIR model, combining data Vaccination strategies Population biology Discrete methods for small populations (esp. conservation)

26 Work is supported by the National Science Foundation postdoctoral fellowship grant EIA 02-03584 Thank you

27 Controlled Breeding (joint work with Cris Moore and Jared Saia) Given an initial population of animals design a mating strategy that achieves a breeding goal (within shortest time)

28 Controlled Breeding: Background Conservation Biology and Agriculture Breeding strategies: designed and evaluated empirically or using stochastic time-step modeling Empirical evaluation – too slow! Stochastic modeling – mathematically and biologically inappropriate. Classic algorithm design problem

29 Breeding All Possible Animals Given k binary strings of length n Design an algorithm that Produces all possible strings With the smallest expected # matings Greedy: mate two animals with the highest probability of producing new Upper bound: 2.322 n

30 Breeding a Target Animal Given k strings of length n Design an algorithm that Produces a target string With the smallest expected # matings Alg 1: breed for one trait at a time O(n lg n) Alg 2: breed the animals closest to the target O(n 2 )

31 Algorithm: One Trait at a Time AddOneTrait (11…100...0, 00…010…0) x = 11…100…0 y = 00…010…0 While (y has < i+1 ones) do Mate x and y twice y = string with 1 in bit (i+1) Return y The Algorithm (e 1,e 2,…,e n ) x = e 1 For x = 2..n do x = AddOneTrait(x,e i )

32 More Realistic Breeding Gender Variable probability of outcome Deaths Minimize number of generations Goal: maximum diversity On-line: maintain the distribution


Download ppt "How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance."

Similar presentations


Ads by Google