Presentation on theme: "Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete: Time needed to find solution in- creases exponentially with size of problem."— Presentation transcript:
Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete: Time needed to find solution in- creases exponentially with size of problem -> t = c n
Computational Complexity ß How do we proceed ? ß What about the quality of the solution ? Optimality criterion Exact and Exhaustive Enumeration Branch and Bound (maybe) Off-Target and Incomplete Heuristics
Optimality - 1 ß Parsimony analysis: comprises a group of related methods, united by the goal of optimizing some evolutionary significant quantity but differing in their underlying evolutionary assumptions.
Optimality - 2 ß How good is the solution : What is its score [relative to alternatives]?. ß Relation of score to evolutionary assumptions Fitch and Wagner Parsimony Dollo Parsimony Camin-Sokal Parsimony Generalized Parsimony Constrained Parsimony Group / Component Compatibility Character Compatibility
Exact and Exhaustive ß Enumeration is computationally unfeasible if # taxa is over, say, 10. ß Branch and Bound is computationally feasible for over 20 taxa (50 may even work).
(maybe) Off-Target and Incomplete ß Heuristics Step-wise Addition Star Decomposition Branch Swapping
Step-wise Addition - 1 D A EC B E A DC B A B C BA CD BA CD B AC D A D B E C DE A BCBE C DA
Step-wise Addition - 2 ß Dependent on taxon sequence in data matrix. ß Excessively greedy. ß Susceptible to local optima.
Branch Swapping ß Local rearrangements of parts of cladogram Nearest Neighbor Interchange Subtree Pruning and Regrafting Tree Bisection and Reconnection
Optimality - 3 Kind of Scores ß Length (number of steps) ß Consistency Index (CI) ß Retention index (RI) ß Corrected Extra Length (CEL) ß Redundancy Quotient ß AUCC ß HDR ß CCSI ß…ß…
Fitch & Wagner ß Characters: W: binary, ordered multistate, continuous F: unordered multistate ß Transformation: Free reversibility root and cladogram-length decoupled. Change in any direction equally probable (symmetry). W: intermediate states always involved. Thus 1 -> 3 implies 2 steps. F: Any state can transform into any other. Thus 1 -> 3 implies 1 step.
Wagner: Cladogram length - 1 B C A D E BC A DE 0 BC 213 0 A DE 0,21,3 1,2 ?? ? 0213 0
0 BC 213 0 A DE Wagner: Cladogram length - 2 0 BC 213 0 A DE 0,21,3 1,2 1 1 0 BC 213 0 A DE 10 0 1 2 1 1 1 1 2 2
Fitch: Cladogram length A E D B C 0203 2 BC A DE 0,2 0 0,3 0203 2 BC A DE 0 0 0 1 1 1
AB C D E Dollo: Multiple origins not allowed 01 0 1 0 AB C D E 01 0 1 0 0 0 0 1 1 1
Generalized Parsimony 1 2 3 1 2 1 a b c d abcdabcd 1 2 1 3 2 1 Wagner 1 1 1 1 a b c d abcdabcd 1 1 1 1 Fitch M 2M 3M M 2M M a b c d abcdabcd 1 2 1 3 2 1 Dollo 5 1 5 5 1 5 A C G T ACGTACGT 5 1 5 5 1 5 T-sition/T- version 1 Gain vs Loss 0 1 0101 1
Models of Evolutionary Change ß Molecular Data ß Maximum Likelihood: “Given the phylogeny, what is the probability to find the data as I did ?” Substitution Types Substitution Probabilities
Models: Substitution Types GTR TrNSYM HKY F84 K3ST F81 K2P JC T-versions; 2 T-sition class T-versions vs T-sitions Single substitution type Equal base frequencies T-versions; 2 T-sition class T-versions vs T-sitions Equal base freq’s
Substitution Types: What do they all mean ? ß GTR, e.g., stands for Generalized Time Reversible, meaning that the overall rate of change from base i to base j in a given length of time is the same as the rate of change from base j to base i. ß Each type corresponds to a table of substitution rates for all pairs of the nucleotides A, C, G, and T
Substitution Rate Table Q = R +. X A 0 0 0 0 C 0 0 0 0 G 0 0 0 0 T ACGTACGT A C G T a b c d e f A C G T ACGTACGT g h i j k l A = frequency parameter = mean instantaneous SR ß a, … k, l = relative rate parameters. ß All models can be obtained by restricting the parameters in R.
Models: Substitution Rates ß GTR: a=g, b=h, …, e=k, f=l ß TrN: a = c = d = f K3ST: A = C = G = T = 1/4 JC: a = b = c = d = e = f = 1 A = C = G = T = 1/4 A 0 0 0 0 C 0 0 0 0 G 0 0 0 0 T ACGTACGT A C G T a b c d e f A C G T ACGTACGT g h i j k l
Models: Substitution Probabilities ß P (t) = e Qt ß P is evaluated by decomposing Q into its eigenvalues and eigenvectors. ß We have a P for every branch t in the cladogram.
Rate vs Time ß All models: P(i->j) depends on t and through the product t. A branch can be long because it represents a long period of time OR because the rate of substitution has been high. Impossible to tell apart, unless perfect mol. clock.
Rate + Time = Branch Length If: Mean substitution rate is set to 1. ß And: Relative rate parameters a, b, … f are scaled: -> average at equilibrium = 1 ß Then: Branch Length = expected number of substitutions per site.
Recap. Evolution of DNA sequences is modeled by a stochastic process in which each site evolves in time (t) independently of all other sites, according to a Poisson process with rate . Because the rate only occurs in products of the form t, the absolute value of is arbitrary. ß Thus, all times should be considered relative to one another, and not as absolute values. Products of the form t represent expected amounts of change.
Likelihood of a Cladogram - 1 ß If: sites in the sequence evolve independently, ß Then: data represent multinomial sample. ß Thus: overall goodness- of-fit statistic is applicable (Log Likelihood Ratio Test).
Likelihood of a Cladogram - 2 Likelihood of Clado- gram Likelihoods of occurrence of each state at each node as a function of cladogram topology and branch lengths. ß Cladogram is given: How good is it ?
Likelihood of a Cladogram - 3 ß The conditional likelihood of state i at sequence position j in taxon A is: L ( Aj =i) = [ P ik ( AB )L( Bj =k)]. [ P il ( AC )L( Cj =l )]
Likelihood of a Cladogram - 4 ß See figure 10 in SOWH.
Maximum Likelihood ß Pro: Consistency As the number of items of data (n) increases, the probability that the estimator is far from the true value of the parameter (cladogram structure) decreases to zero. ß But: Inferential consistency depends on the model. Only finite amounts of data are considered, thus a ‘long-term’ property is not necessary.
Maximum Likelihood - 2 ß “ Anyone who considers this model (Poisson Process Model of DNA substitution) complex should bear in mind that it is the simplest mathematical model of state change with constant probabilities per unit time, and that a particular case (that of a very low rate of change) is used to justify parsimony methods. ß The model does not allow for insertions, deletions, and inversions.
When does ML = Parsimony ? ß They estimate different parameters, therefore the estimates cannot match exactly. ß For cladogram structure alone: If PPM is correct, and we assume the expected amount of change, t, to be very small, then the probability structures become the same. For realistic values of t, the two models do not behave identically.
Extensions of ML ß Rate heterogeneity among sites ß Other data types (except sequences) gene frequencies restriction sites ß Pairwise Distance Methods immunological data DNA-DNA hybridizations