Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.

Presentation on theme: "Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG."— Presentation transcript:

Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG Linkage algorithm Distance metric A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and b) a linkage algortihm

Between clusters Within clusters Cluster analysis tries to minimize within cluster distances and to maximize between cluster distances.

Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG The distance metric P.symP.xanP.polaC.platC.gradD.sym P.sym0237913 P.xan20411 15 P.pola34010 12 C.plat711100219 C.grad911102019 D.sym13151219 0 A distance matrix counts in the simplest case the number of differences between two data sets.

Site 1 Site 2Site 3Site 4 P.sym1011 P.xan1001 P.pola0101 C.plat0111 C.grad1000 D.sym1011 Sum4235 Species presence-absence matrix A Site 1 Site 2Site 3Site 4 Site 14023 Site 20 2 12 Site 321 3 3 Site 432 35 Site 1 Site 2Site 3Site 4 Site 1100.5714290.666667 Site 2010.40.571429 Site 30.5714290.410.75 Site 40.6666670.5714290.751 Distance matrix D = A T A Soerensen index Jaccard index

Site 1 Site 2Site 3Site 4 P.sym0.310.120.240.05 P.xan0.200.650.540.44 P.pola0.380.810.280.52 C.plat0.350.690.860.30 C.grad0.070.990.640.84 D.sym0.430.780.730.21 Sum1.754.043.302.36 Abundance data Euclidean distance Manhattan distance Correlation distance Site 1 Site 2Site 3Site 4 Site 11-0.27534-0.04805-0.71587 Site 2-0.2753410.5191390.807173 Site 3-0.048050.51913910.157251 Site 4-0.715870.8071730.1572511 Correlation distance matrix Bray Curtis distance Due to squaring Euclidean distances put particulalry weight on outliers. Needs a linear scale. The Manhattan distance needs linear scales. Despite of a large distance the metric might be zero. Correlations are sensitive to non-linearities in the data. The Bray-Curtis distance is equivalent to the Soerensen index for presence-absence data. Suffers from the same shortcoming as the Manhattan distance.

P.symP.xanP.polaC.platC.gradD.sym P.sym0237913 P.xan20411 15 P.pola34010 12 C.plat711100219 C.grad911102019 D.sym13151219 0 Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG Linkage algorithm We first combine species that are nearest to from an inner cluster In the next step we look for a species or a cluster that is clostest to the average distance or the initial cluster We continue this procedure until all species are grouped. The single linkage algorithm tends to produce many small clusters. P.sym P.xan P.pola C.plat C.grad D.sym

Sequential versus simultaneous algorithms In simultaneous algorithms the final solution is obtained in a single step and not stepwise as in the single linkage above. Agglomeration versus division algorithms Agglomerative procedures operate bottom up, division procedures top down. Monothetic versus polythetic algorithms Polythetic procedures use several descriptors of linkage, monothetic use the same at each step (for instance maximum association). Hierarchical versus non-hierarchical algorithms Hierarchical methods proceed in a non- overlapping way. During the linkage process all members of lower clusters are members of the next higher cluster. Non hierarchical methods proceed by optimization within group homogeneity. Hence they might include members not contained in higher order cluster. The single linkage algorithm uses the minimum distance between the members of two clusters as the measure of cluster distance. It favours chains of small clusters. The average linkage uses average distances between clusters. It gives frequently larger clusters. The most often used average linkage algorithm is the Unweighted Pair-Groups Method Average (UPGMA). The Ward algorithm calculates the total sum of squared deviations from the mean of a cluster and assigns members as to minimize this sum. The method gives often clusters of rather equal size. Median clustering tries to minimize within cluster variance.

To check the performance of different cluster algorithms and distance metrics we use a matrix of random numbers. Which clusters to accept?

Different cluster algorithms give different results. We accept those clusters that are stable irrespective of algorithm. In the case of our random numbers clustering is very unstable.

Two methods detected the clusters OP and ABC All other items are not clearly separated. The position of item F remains unclear

Clustering using a predefined number of clusters K-means O P A B D C F E H K I LN M J G K-means clustering starts from a predefind number of clusters and then arranges the items in a way that the distances between clusters are maximized with respect to the distances within the clusters. Technically the algorithm first randomly assigns cluster means and then places items (each time calculating new cluster means) until an optimal solution (convergence) has been reached). K-means always uses Euclidean distances

Neighbour joining Neighbour joining is particularly used to generate phylogenetic trees Dissimilarities You need similarities (phylogenetic distances)  (XY) between all elements X and Y. Select the pair with the lowest value of Q Calculate new dissimilarities Calculate the distancies from the new node Calculate

Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is cluster analysis. An easy but powerful technique is principal component analysis (PCA).

Factor analysis Is it possible to group the variables according to their values for the countries? T (Jan)T (July)Mean TDiff T GDP GDP/C Elev Factor 1 Factor 2 Factor 3 Correlations The task is to find coefficients of correlation etween the original variables and the exctracted factors from the analysis of the coefficiencts of correlation between the original variables.

Because the f values are also Z-transformed we have Eigenvalue

How to compute the factor loadings? The dot product of orthonormal matrices gives the unity matrix Fundamental theorem of factor analysis

1 2 3 4 5 6 F1F2 f 11 f 21 f 31 f 41 f 51 f 61 f 12 f 22 f 32 f 42 f 52 f 62 Z-trans- formed Factor values b Cases n Factors F Factors are new variables. They have factor values (independent of loadings) for each case. These factors can now be used in further analysis, for instance in regression analysis.

We are looking for a new x,y system were the data are closest to the longest axis. PCA in fact rotates the original data set to find a solution where the data are closest to the axes. PCA leaves the number of axes unchanged. Only a few of these rotated axes can be interpreted from the distances to the original axes. We interpret the new axis on the basis of their distance (measured by their angle) to the original axes. The new axes are the principal axes (eigenvectors) of the dispersion matrix obtained from raw data. X1 Y1 X’1 Y’1 PCA is an eigenvector method Principal axes are eigenvectors.

The programs differ in the direction of eigenvectors. This does not change the results but might pose problems with the interpretation of factors according to the original variables.

Pincipal coordinate analysis PCoA uses different metrics to generate the dispersion matrix

Using PCA or PCoA to group cases v A factor might be interpreted if more than two variables have loadings higher than 0.7. A factor might be interpreted if more than four variables have loadings higher than 0.6. A factor might be interpreted if more than 10 variables have loadings higher than 0.4.

Correspondence analysis (reciprocal averaging, seriation, contingency table analysis) Correspondence analysis ordinates rows and columns of matrices simultaneously according their principal axes. It uses the  2-distances instead of correlations coefficients or Euclidean distances.  distances Contingency table

We take the transposed raw data matrix and calculate eigenvectors in the same way Correspondence analyis is row and column ordination. Joint plot

The plots are similar but differ numerically and in orientation. The orientation problem comes again from the way Ecxel calculates eigenvalues. Row and column eigenvectors differ in scale. For a joint plot the vectors have to be rescaled.

Reciprocal averaging Sorting according to row/column eigenvalues rearranges the matrix in a way where the largest values are near the matrix diagonal.

=los() =(B85*B\$97+C85*C\$97+D85*D\$97+E85*E\$97)/\$F85 =(H85-H\$94)/H\$95 Seriation using reciprocal averaging Repeat until scores become stable Weighed mean Z-transformed weighed means

Similar presentations