Clustering methods Tree building methods for distance-based trees

Clustering methods Tree building methods for distance-based trees
For a set of sequences, all possible pairwise distances are calculated using your method of choice (JC69, K2P, GTR, etc.) Now, how do you build a tree? This provides measures of dissimilarity that can then be clustered according to a several major tree building algorithms U(W)PGMA - Unweighted (Weighted) Pair Group Method with Arithmetic Means Minimum Evolution Neighbor Joining

Clustering methods U(W)PGMA - Unweighted (Weighted) Pair Group Method with Arithmetic Means A single best rooted tree is built using the calculated distances to group pairs of OTUs A molecular clock is assumed, thus terminal nodes are equidistant from the root Summary – Obtain distance matrix Group two most closely related taxa Find mean of distances Group these two as a single new OTU Continue until you run out of sequences

Clustering methods U(W)PGMA Summary – Obtain distance matrix
Group two most closely related taxa Find mean of distances Group these two as a single new OTU Continue until you run out of sequences 2 1 3 4 5 5 3 4 1 2

Clustering methods U(W)PGMA Details – Start with your distance matrix
Group the taxa with the shortest distance The depth of the divergence between A and B is their distance divided by 2 Recalculate the distances between AB and other taxa A B C D E 2 4 6 F 8 A B 1

Clustering methods U(W)PGMA Details –
Recalculate the distances between AB and other taxa d(AB)C = (dAC + dBC)/2 = 4 d(AB)D = (dAD + dBD)/2 = 6, etc. Repeat with next closest cluster, D & E We could just as easily group C with AB AB C D E 4 6 F 8 A B 1 D E 2

Recalculate the distances between DE and other taxa Join the next closest group AB & C AB C DE 4 6 F 8 A B 1 C 2 1 D E 2

Recalculate the distances between ABC and other taxa How long should the branch joining ABC to DE be? The distance between D (or E) and A, B or C should be a total of 6. Join DE to ABC ABC DE 6 F 8 A B 1 C 2 1 1 D E 2

This leaves only F to add to the tree The total distance between F and any other taxon should be 8 ABCDE F 8 A B 1 C 2 1 1 F 1 4 D E 2

Clustering methods U(W)PGMA
Weakness – constant molecular clock assumption Note that 3 and 5 are actually less divergent from one another than 4 and 5 This can’t be depicted in the tree using this method Non-homogeneous rates can introduce serious problems with tree reconstruction when some taxa evolve much faster than others 2 1 3 4 5 5 3 4 1 2

Clustering methods U(W)PGMA is rarely used anymore
Other, newer methods avoid the weaknesses – ultrametricity due to the constant molecular clock assumption Newer methods do not require ultrametricity (still require additivity) Require that the distance between any pair of OTUs equal the sum of the lengths of their branch lengths, not the average Allows for variable molecular clock among taxa A B C D E 5 4 7 10 6 9 F 8 11 1 A 1 4 1 B 2 1 C 3 D 1 2 E 4 F

Clustering methods Minimum Evolution Method
The tree that minimizes the lengths of the tree (the sum of the lengths of all branches) is the best tree Reasonably good at finding the best tree Major drawback – Must examine all possible trees, computationally onerous when dealing with large numbers of taxa

Clustering methods Neighbor Joining Method (Saitou and Nei, 1987)
A heuristic method for determining the best distance-based tree Heuristic methods explore a subset of all possible trees in the hope that the best tree lies within that subset Heuristic methods often fall into the ‘hill-climbing’ category Start with a tree and alter it in some way If the alteration makes it worse (using some optimality criterion), abandon it and try again. If the alteration makes it better, keep it and continue Heuristic methods include: Stepwise addition Branch swapping More on these later NJ is a star decomposition heuristic

Clustering methods The major drawback to heuristics is that of finding local optima in the tree space

Clustering methods NJ Combines computational speed with unique results (you always get a single best tree) The NJ algorithm Compute the net divergence, r, for every end node rA = = 30 rB = = 42 rC = 32, rD = 38, rE = 34, rF = 44 Create a ‘rate-corrected distance matrix Mi = dij – (ri + rj)/(N-2) N = number of end nodes MAB = 5 – ( )/4 = -13 MAC = 4 – ( )/4 = -11.5 A B C D E 5 4 7 10 6 9 F 8 11 A B C D E -13 -11.5 -10 -10.5 F -11

Clustering methods NJ The NJ algorithm Draw your star phylogeny A F B
A B C D E -13 -11.5 -10 -10.5 F -11 A F B E C D

Clustering methods NJ The NJ algorithm
Define a new node, U, that groups minimally diverged taxa Either AB or DE U = AB Determine branch lengths from U to A and B SAU = dAB/2 + (rA – rB)/2(N – 2) SAU = 5/2 + (30 – 32)/2(6 – 2) = 1 Because the distances are additive, SBU = dAB – SAU = 4 A B C D E -13 -11.5 -10 -10.5 F -11 F A E B C D F E C D A B 1 4 U A B C D E 5 4 7 10 6 9 F 8 11 rA = 30 rB = 42 rC = 32, rD = 38, rE = 34, rF = 44

Redefine distances based on the new node dCU = (dAC + dBC – dAB)/2 dCU = (4 + 7 – 5)/2 = 3, etc. Repeat previous steps A B C D E 5 4 7 10 6 9 F 8 11 U C D E 3 6 7 5 F 8 9 F E C D A B 1 4 U

Calculate net divergences for every end node (U now counts as an end node so, N = N - 1) rU = 21, rC = 24, rD = 27, rE = 24, rF = 32 Compute rate-corrected distances Create new node, V, from either UC or DE Calculate branch lengths from node V to C and U SUV = dCU/2 + (rU – rC)/2(N – 2) SUV = 3/2 + (21 – 24)/2(5 – 2) = 1 SCV = 2 Compute new distance matrix from V to all terminal nodes U C D E -12 -10 -11 F -10.7 U C D E 3 6 7 5 F 8 9 F E C D A B 1 4 U 1 2 B F E D V A 4 U V D E 5 4 F 6 9 8 C

Clustering methods NJ The NJ algorithm And so on… U F E C D A 1 V W 2
3 B 4 5 U F E C D A 1 V W 2 3 B 4 1 2 B F E D V A 4 U U F E C D A 1 V W 2 3 B 4 U F E C D A 1 V B 4 1 2 C

Note that this is an unrooted tree Where do we put the root? Using external information we can determine where the root belongs U F E C D A 1 V W 2 3 B 4 A B 4 1 C 2 D E 3 F 1 1 1 1 2

Clustering methods UPGMA vs NJ

Clustering methods Alternatives to classical NJ Software
BIONJ (Gascuel, 1997) Generalized neighbor joining (Pearson et al., 1999) Weighted neighbor joining (Bruno et al., 2000) Multi-neighbor-joining (Silva et al., 2005) Relaxed neighbor joining (Evans et al., 2006) Software PHYLIP MEGA PAUP DAMBE TREECON

Clustering methods Pros Cons Easy to understand Easy to implement Fast
Produce a single, best tree Use an explicit substitution model Topology AND branch lengths are calculated (NJ) Cons Reduce most of the data to single value (reduce the amount of information) Different data sequences can yield the same distance matrix

Clustering methods Tree building methods for distance-based trees

Similar presentations

Presentation on theme: "Clustering methods Tree building methods for distance-based trees"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering methods Tree building methods for distance-based trees

Similar presentations

Presentation on theme: "Clustering methods Tree building methods for distance-based trees"— Presentation transcript:

Similar presentations

About project

Feedback