Inferring Trees from Trees Consensus and Supertree Methods

Slides:



Advertisements
Similar presentations
1 Modified Mincut Supertrees Roderic Page University of Glasgow.
Advertisements

CONSENSUS “general or widespread agreement” Consensus tree – a tree depicting agreement among a set of treesConsensus tree – a tree depicting agreement.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Reading Phylogenetic Trees Gloria Rendon NCSA November, 2008.
Introduction to Phylogenies
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Reading Phylogenetic Trees
Chapter 18 Classification
Phylogenetic trees Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Chapter 2.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Calculating & Reporting Healthcare Statistics
Phylogeny Reconstruction II. The edges of tree can be freely rotated without changing the relationships among the terminal nodes. Trees are like mobiles.
The Tree of Life From Ernst Haeckel, 1891.
The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan.
Tree Evaluation Tree Evaluation. Tree Evaluation A question often asked of a data set is whether it contains ‘significant cladistic structure’, that is.
Supertrees: Algorithms and Databases Roderic Page University of Glasgow DIMACS Working Group Meeting on Mathematical and Computational.
Phylogenetic trees Sushmita Roy BMI/CS 576
Processing & Testing Phylogenetic Trees. Rooting.
Chapter 3: Central Tendency
Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately describes the center of the.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular phylogenetics
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Slides for “Data Mining” by I. H. Witten and E. Frank.
SuperTriplets: a triplet-based supertree approach to phylogenomics Vincent Ranwez, Alexis Criscuolo and Emmanuel J.P. Douzery.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Underlying Principles of Zoology Laws of physics and chemistry apply. Principles of genetics and evolution important. What is learned from one animal group.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Lecture 2: Principles of Phylogenetics
Introduction to Phylogenetics
Reading Phylogenetic Trees
The bootstrap, consenus-trees, and super-trees Phylogenetics Workhop, August 2006 Barbara Holland.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
Confidence Interval Estimation For statistical inference in decision making:
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Understanding sets of trees CS 394C September 10, 2009.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Evaluating the Fossil Record with Model Phylogenies Cladistic relationships can be determined without ideas about stratigraphic completeness; implied gaps.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Chapter 26 Phylogeny and the Tree of Life
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
Tree Terminologies. Phylogenetic Tree - phylogenetic relationships are normally displayed in a tree-like diagram (phylogenetic tree/cladogram) - a cladogram.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Unsupervised Learning
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Reading Cladograms Who is more closely related?
Hierarchical clustering approaches for high-throughput data
Cladistics.
Phylogeny and the Tree of Life
Reading Phylogenetic Trees
CS 581 Tandy Warnow.
September 1, 2009 Tandy Warnow
Phylogenetic Trees Jasmin sutkovic.
Unsupervised Learning
Presentation transcript:

Inferring Trees from Trees Consensus and Supertree Methods Mark Wilkinson Department of Zoology, The Natural History Museum mw@bmnh.org

CONSENSUS “general or widespread agreement” Consensus tree – a tree depicting agreement among a set of trees - a representation of a set of trees - a phylogenetic inference from a set of trees Consensus method – a technique for producing consensus trees (of a particular type) Consensus index – a measure of the agreement among a set of trees (based on their consensus tree)

Uses of Consensus Trees Consensus trees are used to represent (or make inferences from) multiple trees Agreement (conservative) Central tendency (liberal) There are a number of different contexts in which this may be of interest (sets of trees can be obtained in a variety of ways) The ultimate aims may be quite different Different methods may be more or less appropriate given the aim/context

Mathematician’s Perspective Subsequently there has been an amazing proliferation of consensus methods and consensus indices): a proliferation stimulated by confusions, disagreements, and uncertainties concerning what consensus methods depict and what consensus indices measure. Thus, for example, consensus indices for trees are understood to measure agreement, balance, information, resolution, shape, similarity, and symmetry. One has the impression that taxonomists do not know (or cannot agree on) what consensus objects should depict or how it should be depicted; they do not know (or cannot agree on) what consensus indices should measure or how it should be quantified. Consequently, taxonomists may not appreciate (or do not articulate) the relationships that might or should exist between consensus method and consensus index. Day and McMorris 1985

Strict (Component) CM Uniquely defined in terms of two properties Pareto - if a component is present in all the input trees it is in the consensus Strict - if a component is in the consensus it is present in all the input trees Can also be defined in terms of an algorithm an objective function

Strict CM(s) Require complete agreement across all the input trees and show relationships that would be true if any input tree were true With MPTs they show only those relationships that are unambiguously supported by the parsimonious interpretation of the data The commonest method focuses on components (clusters, groups, splits, clades or monophyletic groups) This method produces a consensus tree that includes all (Pareto) and only (strict) the common clades Other relationships (those in which the input trees disagree) are shown as unresolved polytomies Component version widely implemented

Strict CM(s) TWO INPUT TREES STRICT (COMPONENT) CONSENSUS TREE A B C D F G A B C E D F G A B C D E F G STRICT (COMPONENT) CONSENSUS TREE

Interpreting Polytomies Polytomies in trees have alternative interpretations. The hard interpretation ‘multiple speciation’ The soft interpretation ‘uncertain resolution’ is appropriate for strict component consensus trees The consensus permits all resolutions of the polytomy (i.e. it does not conflict with any resolution)

Semi-strict CM(s) Semi-strict methods require assertion of a relationship by one or more trees and non-contradiction by any tree The commonest method focuses on components/clades Produces a tree including all components that are present and uncontradicted in the input trees - all that could be true if any input tree were true Generally, similar to the strict but may be more resolved when the input trees include polytomies It is based on the soft interpretation of polytomies Other relationships are shown as unresolved polytomies Component version implemented in e.g. PAUP

Semi-strict (component) CM TWO INPUT TREES

Properties of Semi-strict CM(s) Tend to produce more resolved consensus trees Reasonable when combining trees based on different data set With trees based on a single data set extra resolution is of relationships that are not true of all the optimal trees: the consensus includes relationships that are not supported by all best interpretations of the data It may include relationships that cannot be simultaneously supported by any parsimonious (or other) interpretation of the data These relationships might reasonably be considered less well supported (if supported at all)

Semi-strict (component) CM TWO INPUT TREES SEMI-STRICT (component) CONSENSUS TREE

Equally Optimal Trees Many phylogenetic analyses yield multiple equally optimal trees Multiple trees are due to either: Alternative equally optimal interpretations of conflicting data Missing data Or both We can further select among these trees with additional (secondary) criteria, but A consensus tree may be needed to represent or draw conclusions about the set of MTPs Typically phylogeneticists are interested in relationships common to all the optimal trees (they want to know that a relationship in the consensus is in all the trees - STRICT)

Loss of Resolution Generally, as the number of optimal trees increases the resolution of ‘strict consensus trees’ decreases. In the extreme, the ‘strict consensus tree’ may be completely unresolved/uninformative. This extreme is sometimes met in practice (e.g. fossils). The ‘consensus tree’ can also be poorly resolved when there are few optimal trees, and.furthermore. The optimal trees need not differ greatly, thus... Lack of resolution in a ‘strict consensus tree’ is not always a good guide to the level of agreement among the optimal trees.

Optimal trees need not differ greatly for ‘the consensus tree’ to be unresolved

Adams-2 Consensus Adams (1972) method was defined by a recursive algorithm, which, beginning at the root, identifies common sub-clusters using intersection rules (product partitions) The basal splits in these trees yield the four clusters ABC, EFD, AC & BEFD. Their intersections yield AC, B & EFD and these produce the three branches at the base of the consensus tree. The procedure is repeated for the subtrees induced by E, F and D, which in this case are identical.

Adams Consensus and Nesting Adams (1972) described the first consensus methods, only one of his methods is used Adams (1986) characterised his method in terms of nestings A group X (e.g. AB) nests within another Y (e.g. A-D) if the last common ancestor of Y is an ancestor of the last common ancestor of X The Adams consensus tree includes all those nestings that are in all the input trees (Pareto) and for all clusters displayed by the Adams tree there is a corresponding nesting in each input tree (strict). They show only those nestings that are unambiguously supported by the parsimonious interpretation of the data Implemented in e.g. PAUP

Adams Consensus and Nesting (1) taxa A & C are more closely related to each other than either is to taxa D, E, or F; (2) taxa E & F are more closely related to each other than either is to taxa A, C, or D; (3) taxon D is more closely related to E & F than it is to either A or C. Swofford (1991) But - Not quite right!

Adams Consensus (1) taxa A & C are more closely related to each other than either is to taxa D, E, or F; In each tree A & C are more closely related to each other than they are to D-F and/or B (AC)D-F and/or (AC)B (2) taxa E & F are more closely related to each other than either is to taxa A, C, or D; (EF)D (3) taxon D is more closely related to E & F than it is to either A or C. (D-F)AC and/or (D-F)B

Adams polytomies are cladistically ambiguous What can be inferred from the consensus? (A,B)CD - No (AB)C - No (AB)D - No (AB)C and/or (AB)D INPUT TREES

Note on the meaning of cladistic relationship Cladistic relationships are based on recency of common ancestry (& dependent on rooted trees). Two taxa are more closely related to each other than either is to a third iff they share a more recent common ancestor - e.g. (AB)CDE. Nestings are also based on common ancestry but nestings are ambiguous with respect to cladistic relationships - e.g. {AB}CD = (AB)C and/or (AB)D

Properties of Adams Consensus Trees Adams consensus trees are more topologically sensitive to shared structure in input trees than is the strict component consensus, but... Care must be taken in the interpretation of their ‘elastic’ polytomies Adams consensus trees can include groups that don’t occur in any input tree (Rholf Groups) It exists only for rooted trees

Greatest Agreement Subtrees TWO INPUT TREES A B C D E F G A G B C D E F A B C D E F G A B C D E F Strict component consensus completely unresolved GAS/LCP TREE Taxon G is excluded

Strict Reduced CM Strict component consensus TWO INPUT TREES A B C D E F G A G B C D E F A B C D E F G Strict component consensus A B C D E F B C D E F A STRICT REDUCED CONSENSUS TREE Agreement Subtrees Taxon G is excluded

Rhynchosaurs

Fossil & Recent Arthropods

Fossil & Recent Arthropods

Majority-rule CM(s) Majority-rule consensus methods require agreement across a majority of the input trees The commonest method focuses on components/clades This method produces a consensus tree that includes all and only those clades found in a majority (>50%) of the input trees Majority components which are necessarily mutually compatible Other relationships are shown as unresolved polytomies Of particular use in bootstrapping, jackknifing, quartet puzzling, Bayesian inference (with e.g. average branch lengths). Component version widely implemented

Majority-rule (component) CM THREE INPUT TREES A B C D E F G A B C E F D G A B C E D F G A B C E D F G 100 66 Numbers indicate frequency of clades in the input trees 66 66 66 MAJORITY-RULE (COMPONENT) CONSENSUS

Properties of Majority-rule Tend to produce more resolved consensus trees Extra resolution is of relationships that are not true of all the optimal trees In the context of equally optimal trees, this means the consensus includes relationships that are not supported by all the best interpretations of the data These relationships might reasonably be considered less well supported (if supported at all) Related to the Median consensus (objective function minimises the sum of the symmetric differences between the consensus and input trees)

Adding minority components Further resolution can sometimes be achieved by adding relationships that occur in a minority of trees. These must be compatible with the majority relationships Two approaches Greedy (PAUP) Frequency-difference (TNT)

Majority-rule

Other Consensus methods A variety of other consensus methods have been devised but few implemented These include: Other intersection rules based on cluster height Nelson, Asymmetric Median and other clique consensus methods Other matrix respresentation methods, e.g. MRP Average consensus

A Consensus Classification Consensus trees vary with respect to: The kind of agreement (components, triplets, nestings, subtrees) The level of agreement (strict, semi-strict, majority-rule, largest clique) Adams Reduced LCP / GAS Full Splits Nestings Splits Subtrees Strict Yes Yes Yes Yes Semi-strict Yes ? Yes ? Majority-rule Yes ? Yes ? Nelson (clique) Yes ? Yes ?

Consensus methods Use strict methods to identify those relationships unambiguously supported by parsimonious interpretation of the data Use more liberal (semi-strict, majority-rule) consensus methods for taxonomic congruence Use majority-rule methods in bootstrapping etc. Use Adams consensus when strict component consensus is poorly resolved - if Adams is better resolved use strict reduced consensus Use reduced methods where consensus trees are poorly resolved Avoid over-interpreting results from methods which have ambiguous interpretations

Input Trees Consensus Trees More or less Conservative More or less Liberal

Input Trees SuperTrees More or less Conservative More or less Liberal

Biologists want (Big) Trees “Nothing in Biology makes sense except in the light of evolution” Dobzhansky, 1973 The Tree of Life: Holy Grail of Systematics Bigger Trees: more powerful comparative analyses Adaptation Biogeography Speciation and diversification Conservation

When Input Trees Conflict Semi-strict Gene Tree Parsimony MinCut (modified Aho) Quartet puzzling Matrix representations Splits (standard MRP, MRC, MRF) Sister groups (Purvis MRP) Triplets Quartets Distances (MRD) Analysed with Parsimony (MRP) - ,  Clique (MRC) MinFlip (MRF) Least squares

MRP Tree 1 Tree 2 Component Purvis Triplet Fecampiida 111 111 111 111 1??11??111 Neodermata ??? 111 ??? 111 ?????????? Tricladida 111 100 111 10? ?1???11111 Lecithoepitheliata 110 000 110 0?? ??111110?? Polycladida 000 ??? 00? ??? 0000?0??0? Kalytporhyncha 100 110 1?? 110 111?0?0??0 MRP-outgroup 000 000 000 000 0000000000

MRP Components unordered - A,B,C irreversible - C Triplets - A Quartets - A & C

MRP – an unusual consensus

MRP, total evidence and taxonomic congruence

Majority-rule Goloboff and Pol (2002), Goloboff (2006) Majority rule supertrees desirable in principle Fundamental problem in generalising frequency of occurrence of groups MRP is a (poor) surrogate

Majority-rule Majority-rule consensus is also a median tree for the symmetric difference Alternative basis for generalising beyond consensus How to compute symmetric difference for trees with different leaf sets? Convert into trees with identical leaf sets Prune leaves from supertree Graft leaves onto supertree

ML Supertrees

‘Taxonomic Congruence’

‘Taxonomic Congruence’