Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Similar presentations


Presentation on theme: "Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis."— Presentation transcript:

1 Cluster Analysis Hal Whitehead BIOL4062/5062

2 What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis Hierarchical agglomerative cluster analysis –Linkage: single, complete, average, … –Cophenetic correlation coefficient Additive trees Problems with cluster analyses

3 Cluster Analysis “Classification” Maximize within cluster homogeneity (similar individuals within cluster) “The Search for Discontinuities” Discontinuities: places to put divisions between clusters ?

4 Discontinuities: Discontinuities generally present: taxonomy social organization community ecology??

5 Types of cluster analysis: Uses: data, dissimilarity, similarity matrix Non-hierarchical –K-means Hierarchical –Hierarchical divisive (repeated K-means, network methods) –Hierarchical agglomerative single linkage, average linkage,... Additive trees

6 Non-hierarchical Clustering Techniques: K-Means Uses data matrix with Euclidean distances Maximizes between-cluster variance for given number of clusters –i.e. Choose clusters to maximize F-ratio in 1- way MANOVA

7 K-Means Works iteratively: 1. Choose number of clusters 2. Assigns points to clusters Randomly or some other clustering technique 3. Moves each point to other clusters in turn-- increase in between cluster variance? 4. Repeat step 3. until no improvement possible

8 K-means with three clusters

9 Variable Between SS df Within SS df F-ratio X 0.536 2 0.007 7 256.163 Y 0.541 2 0.050 7 37.566 ** TOTAL ** 1.078 4 0.058 14

10 K-means with three clusters Cluster 1 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 1 0.02 | X 0.41 0.45 0.49 0.04 Case 2 0.11 | Y 0.03 0.19 0.27 0.11 Case 3 0.06 | Case 4 0.05 | Cluster 2 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 7 0.06 | X 0.11 0.15 0.19 0.03 Case 8 0.03 | Y 0.61 0.70 0.77 0.07 Case 9 0.02 | Case 10 0.06 | Cluster 3 of 3 contains 2 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 5 0.01 | X 0.77 0.77 0.78 0.01 Case 6 0.01 | Y 0.33 0.35 0.36 0.02

11 Disadvantages of K-means Reaches optimum, but not necessarily global Must choose number of clusters before analysis –How many clusters?

12 Example: Sperm whale codas Patterned series of clicks: | | | | | ic1 ic2 ic3 ic4 For 5-click codas: 681 x 4 data set

13 5-click codas: | | | | | ic1 ic2 ic3 ic4 93% of variance in 2 PC’s

14 5-click codas: K-means with 10 clusters

15 Hierarchical Cluster Analysis Usually represented by: –Dendrogram or tree-diagram

16 Hierarchical Cluster Analysis Hierarchical Divisive Cluster Analysis Hierarchical Agglomerative Cluster Analysis

17 Hierarchical Divisive Cluster Analysis Starts with all units in one cluster, successively splits them –Successive use of K-Means, or some other divisive technique, with n=2 –Either: Each time use the cluster with the greatest sum of squared distances –Or: Split each cluster each time. Hierarchical divisive are good techniques, but rarely used, outside network analysis

18 Hierarchical Agglomerative Cluster Analysis Start with each individual units occupying its own cluster The clusters are then gradually merged until just one is left The most common cluster analyses

19 Hierarchical Agglomerative Cluster Analysis Works on dissimilarity matrix or negative similarity matrix may be Euclidean, Penrose, … distances At each step: 1. There is a symmetric matrix of dissimilarities between clusters 2. The two clusters with least dissimilarity are merged 3. The dissimilarity between the new (merged) cluster and all others is calculated Different techniques do step 3. in different ways:

20 Hierarchical Agglomerative Cluster Analysis ABCDE A0.... B0.350... C0.450.670.. D0.110.450.570. E0.220.560.780.190 AD BCE AD0... B?0.. C?0.670. E?0.560.780 First link A and D How to calculate new disimmilarities?

21 Hierarchical Agglomerative Cluster Analysis Single Linkage ABCDE A0.... B0.350... C0.450.670.. D0.110.450.570. E0.220.560.780.190 AD BCE AD0... B0.350.. C?0.670. E?0.560.780 d(AD,B)=Min{d(A,B), d(D,B)}

22 Hierarchical Agglomerative Cluster Analysis Complete Linkage ABCDE A0.... B0.350... C0.450.670.. D0.110.450.570. E0.220.560.780.190 AD BCE AD0... B0.450.. C?0.670. E?0.560.780 d(AD,B)=Max{d(A,B), d(D,B)}

23 Hierarchical Agglomerative Cluster Analysis Average Linkage ABCDE A0.... B0.350... C0.450.670.. D0.110.450.570. E0.220.560.780.190 AD BCE AD0... B0.400.. C?0.670. E?0.560.780 d(AD,B)=Mean{d(A,B), d(D,B)}

24 Hierarchical Agglomerative Cluster Analysis Centroid Clustering (uses data matrix, or true distance matrix) V1V2V3 A0.110.750.33 B0.350.990.41 C0.450.670.22 D0.110.710.37 E0.220.560.78 F0.130.140.55 G0.550.900.21 V1(AD)=Mean{V1(A),V1(D)} V1V2V3 AD0.110.730.35 B0.350.990.41 C0.450.670.22 E0.220.560.78 F0.130.140.55 G0.550.900.21

25 Hierarchical Agglomerative Cluster Analysis Ward’s Method Minimizes within-cluster sum-of squares Similar to centroid clustering

26 1 1.00 2 0.00 1.00 4 0.53 0.00 1.00 5 0.18 0.05 0.00 1.00 9 0.22 0.09 0.13 0.25 1.00 11 0.36 0.00 0.17 0.40 0.33 1.00 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00 14 0.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00 15 0.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.09 1.00 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00 1 2 4 5 9 11 12 14 15 19 20

27

28 Hierarchical Agglomerative Clustering Techniques Single Linkage –Produces “straggly” clusters –Not recommended if much experimental error –Used in taxonomy –Invariant to transformations Complete Linkage –Produces “tight” clusters –Not recommended if much experimental error –Invariant to transformations Average Linkage, Centroid, Ward’s –Most likely to mimic input clusters –Not invariant to transformations in dissimilarity measure

29 Cophenetic Correlation Coefficient CCC Correlation between original disimilarity matrix and dissimilarity inferred from cluster analysis CCC >~ 0.8 indicate a good match CCC <~ 0.8, dendrogram not a good representation –probably should not be displayed Use CCC to choose best linkage method (highest coefficient) 1 1.00 2 0.00 1.00 4 0.53 0.00 1.00 5 0.18 0.05 0.00 1.00 9 0.22 0.09 0.13 0.25 1.00 11 0.36 0.00 0.17 0.40 0.33 1.00 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00 14 0.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00 15 0.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.09 1.00 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00 1 2 4 5 9 11 12 14 15 19 20

30 CCC=0.83 CCC=0.75 CCC=0.77 CCC=0.80

31 Additive trees Dendrogram in which path lengths represent dissimilarities Computation quite complex (cross between agglomerative techniques and multidimensional scaling) Good when data are measured as dissimilarities Often used in taxonomy and genetics ABCDE A..... B14.... C612... D81713.. E171616.

32 Problems with Cluster Analysis Are there really biologically-meaningful clusters in the data? Does the dendrogram represent biological reality (web-of-life versus tree-of-life)? How many clusters to use? – stopping rules are arbitrary Which method to use? –best technique is data-dependent Dendrograms become messy with many units

33 Social Structure of 160 northern bottlenose whales

34 Clustering Techniques Type Technique Use Non-hierarchical K-Means Dividing data sets Hierarchical divisive Repeated K-means Good technique on small data sets Network methods... Hierarchical agglomerative Single linkage Taxonomy Complete linkage Tighter Clusters Average linkage, Centroid, Ward’s Usually Preferred HierarchicalAdditive treesExcellent for displaying dissimilarity; taxonomy, genetics

35


Download ppt "Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis."

Similar presentations


Ads by Google