UPGMA Algorithm
Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains Algorithm Add a leaf to the tree for each taxon Initially make each taxon be its own cluster Find the closest clusters and connect with node in the tree (place new node at equal distance from the clusters) Repeat previous step until all clusters are connected UPGMA Algorithm x4x4 x2x2 x3x3 x5x5 x1x1 x3x3 x5x5 x1x1 x2x2 x4x4 root
The algorithm needs to compute distance between clusters The distance between clusters C i and C j is defined to be the average distance between all pairs of taxa in C i and C j UPGMA Clustering
The algorithm needs to compute distance between clusters The distance between clusters C i and C j is defined to be the average distance between all pairs of taxa in C i and C j Shortcut when combining C i and C j to form new cluster C k UPGMA Clustering
UPGMA Example
Assume the following distance matrix x1x1 x2x2 x3x3 x4x4 x5x5 x1x x2x2 - 8 x3x x4x4 8 - x5x Closest Pair is {x 3, x 5 } so cluster them, C 1 = {x 3,C 5 } Compute the distance from C 1 to the rest d(C 1,x 1 ) = 1/2 (d(x 3,x 1 ) + d(x 5,x 1 ) ) = 6 d(C 1,x 2 ) = 1/2 (d(x 3,x 2 ) + d(x 5,x 2 ) ) = 16 d(C 1,x 4 ) = 1/2 (d(x 3,x 4 ) + d(x 5,x 4 ) ) = 16 Add new node for x 3, x 5 at height d(x 3,x 5 ) / 2 = 1 x3x3 x5x5 1 1 UPGMA
x1x1 x2x2 x4x4 C1C1 x1x x2x2 -8 x4x4 8- C1C1 6 - Closest Pair is {x 1, C 1 } so cluster them, C 2 = {x 1,C 1 } Compute the distances from C 2 to the d(C 2,x 2 ) = 1/3 (d(x 1,x 2 ) + d(x 3,x 2 ) +d(x 5,x 2 ) ) = 16 d(C 2,x 4 ) = 1/3 (d(x 1,x 4 ) + d(x 3,x 4 ) +d(x 5,x 4 ) ) = 16 Add new node for x 1, C 1 at height d(x 1,C 1 ) / 2 = 3 The updated distance matrix – C 1 replaced x 3, x 5 x1x1 3 2 x3x3 x5x5 1 1 UPGMA
Closest Pair is {x 2, x 4 } so cluster them, C 3 = {x 2,x 4 } Compute the distances from C 3 to the rest d(C 3,C 2 ) = 1/6 (d(x 2,x 1 ) + d(x 2,x 3 ) +d(x 2,x 5 ) + d(x 4,x 1 ) + d(x 4,x 3 ) +d(x 4,x 5 )) = 16 Add new node for x 2, x 4 at height d(x 2,x 4 ) / 2 = 4 The updated distance matrix – C 2 replaced x 1, C 1 x2x2 x4x4 C2C2 x2x x4x4 8- C2C2 - x3x3 x5x5 1 x1x x2x2 x4x4 44 UPGMA
Closest Pair is {C 2, C 3 } so cluster them, C 4 = {C 2,C 3 } Add new node for C 2, C 3 at height d(C 2,C 4 ) / 2 = 8 The updated distance matrix – C 3 replaced x 2, x 4 C2C2 C3C3 C2C2 -16 C3C3 - x3x3 x5x5 1 x1x x2x2 x4x root UPGMA Done! Double-check if original distances between taxa are preserved (not guaranteed)
UPGMA Summary Distance-based algorithm that produces rooted trees Assumes that all species evolve at the same rate (molecular clock hypothesis) Implication of molecular clock hypothesis is that distance from root to any taxon is the same Final tree may not preserve original distances between the taxa x3x3 x5x5 1 x1x x2x2 x4x root