Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate Statistical Methods

Similar presentations


Presentation on theme: "Multivariate Statistical Methods"— Presentation transcript:

1 Multivariate Statistical Methods
Cluster Analysis By Jen-pei Liu, PhD Division of Biometry, Department of Agronomy, National Taiwan University and Division of Biostatistics and Bioinformatics National Health Research Institutes 2018/12/10 Copyright by Jen-pei Liu, PhD

2 Copyright by Jen-pei Liu, PhD
Cluster Analysis Introduction Measures of Similarity Hierarchical Clustering K-mean Clustering Summary 2018/12/10 Copyright by Jen-pei Liu, PhD

3 Copyright by Jen-pei Liu, PhD
Introduction A sample of n objects, each with measurements of p variables To use the measurements of p variables to devise a scheme for grouping n objects into classes Similar objects are in the same class 2018/12/10 Copyright by Jen-pei Liu, PhD

4 Copyright by Jen-pei Liu, PhD
Introduction In general, the number of clusters is not known in advance – unsupervised analysis The number of class is pre-specified in the discriminant analysis and is based on a predicted function– supervised analysis 2018/12/10 Copyright by Jen-pei Liu, PhD

5 Copyright by Jen-pei Liu, PhD
Introduction Examples Cluster of depressed patients Data reduction Marketing Test markets: large number of cities Small number of groups of similar cities one member from each group selected for testing Microarray Clusters of genes Clusters of subjects 2018/12/10 Copyright by Jen-pei Liu, PhD

6 Copyright by Jen-pei Liu, PhD
Introduction Types of Clustering Methods Hierarchical Clustering To find a series of partition A bottom-to-up clustering Partitional Method To produce a single partition of objects A up-to-bottom clustering 2018/12/10 Copyright by Jen-pei Liu, PhD

7 Copyright by Jen-pei Liu, PhD
Introduction Example Student Chinese (X1) Math (X2) 2018/12/10 Copyright by Jen-pei Liu, PhD

8 Measures of Similarity
2018/12/10 Copyright by Jen-pei Liu, PhD

9 Measures of Similarity
Euclidean Distances Matrix for 6 students 2018/12/10 Copyright by Jen-pei Liu, PhD

10 Measures of Similarity
2018/12/10 Copyright by Jen-pei Liu, PhD

11 Measures of Similarity
The Manhattan (city block) distance: 2018/12/10 Copyright by Jen-pei Liu, PhD

12 Measures of Similarity
2018/12/10 Copyright by Jen-pei Liu, PhD

13 Measures of Similarity
2018/12/10 Copyright by Jen-pei Liu, PhD

14 Measures of Similarity
Correlation coefficient A measure for association Not a measure for similarity (or agreement) Euclidean distance A measure for agreement Not a measure for association 2018/12/10 Copyright by Jen-pei Liu, PhD

15 Measures of Similarity
Example Case I Case II Case III X1 X2 X1 X2 X1 X2 r=1, d2=0 r=1, d2=30 r=1, d2=270 2018/12/10 Copyright by Jen-pei Liu, PhD

16 Hierarchical Clustering
General Steps for n objects Step 1: There are n clusters at the beginning and each object is a cluster. Compute pairwise distances among all clusters Step 2: Find the minimum distance and merge the corresponding two clusters into one cluster Step 3: Based on n-1 clusters, compute pairwise distances among all n-1 clusters Step 4: Find the minimum distance and merge the corresponding two clusters into one cluster Step 5: Repeat 2-4 until all n objects merge into one big cluster 2018/12/10 Copyright by Jen-pei Liu, PhD

17 Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD

18 Hierarchical Clustering
Single Linkage (Nearest-neighbor) Method Use the minimal distance Distance matrix for 5 objects 1 0 2 9 0 2018/12/10 Copyright by Jen-pei Liu, PhD

19 Hierarchical Clustering
Single Linkage Method Step 1: 5 clusters:{1},{2},{3},{4},{5} Step 2: min{dij} = d35 = 2 and merge objects 3 and 5 into one cluster {35} Step 3: Find the minimal distance among {3,5},{1},{2},{4} d{35}1 = min[d31,d51]=min[3,11]=3 d{35}2 = min[d32,d52]=min[7,10]=7 d{35}4 = min[d34,d54]=min[9,8]=8 2018/12/10 Copyright by Jen-pei Liu, PhD

20 Hierarchical Clustering
Single Linkage Method Update the distance matrix {35} 1 2 4 {35} 0 1 3 0 2018/12/10 Copyright by Jen-pei Liu, PhD

21 Hierarchical Clustering
Single Linkage Method Step 4: Minimal distance is 3 between {35} and {1} and merge {35} and {1} into {135} Step 5: Find the distances between {135} and {2} and {4} d{135}2 = min[d{35}2,d12]=min[7,9]=7 d{135}4 = min[d{35}4,d14]=min[8,6]=6 2018/12/10 Copyright by Jen-pei Liu, PhD

22 Hierarchical Clustering
Single Linkage Method Update the distance matrix {135} 2 4 {135} 0 The minimal distance is 5 between {2} and {4} Merge {2} and {4} into {24} 2018/12/10 Copyright by Jen-pei Liu, PhD

23 Hierarchical Clustering
Single Linkage Method Find the minimum distance between {135} and {24} d{135}{24} = min[d{135}2,d{135}4]=min[7,6]=7 Update the distance matrix {135} {24} {135} 0 {24} 6 0 2018/12/10 Copyright by Jen-pei Liu, PhD

24 Hierarchical Clustering
Single Linkage Method Distance Clusters 2 {1},{35},{2},{4} 3 {135},{2},{4} 4 {135},{2},{4} 5 {135},{24} 6 {12345} 2018/12/10 Copyright by Jen-pei Liu, PhD

25 Hierarchical Clustering
Dendrograms A 2-dimensional tree structure rooted in the top One dimension is the distance measure Another dimension is the clustering results The height of vertical (horizontal) line represents the distance between the two clusters it mergers Greater height represents greater distance 2018/12/10 Copyright by Jen-pei Liu, PhD

26 Hierarchical Clustering
Complete Linkage (Farthest-neighbor) Method Use the maximal distance Distance matrix for 5 objects 1 0 2 9 0 2018/12/10 Copyright by Jen-pei Liu, PhD

27 Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD

28 Hierarchical Clustering
Complete Linkage Method Step 1: 5 clusters:{1},{2},{3},{4},{5} Step 2: min{dij} = d35 = 2 and merge objects 3 and 5 into one cluster {35} Step 3: Find the maximal distance among {3,5},{1},{2},{4} d{35}1 = max[d31,d51]=min[3,11]=11 d{35}2 = max[d32,d52]=min[7,10]=10 d{35}4 = max[d34,d54]=min[9,8]=9 2018/12/10 Copyright by Jen-pei Liu, PhD

29 Hierarchical Clustering
Complete Linkage Method Update the distance matrix {35} 1 2 4 {35} 0 1 11 0 2018/12/10 Copyright by Jen-pei Liu, PhD

30 Hierarchical Clustering
Complete Linkage Method Step 4: Minimal distance is 5 between {2},{4} and merge {2} and {4} into {24} Step 5: Find the maximal distances d{24}{35} = max[d2{35}, d4{35}]=max[10,9]=10 d{24}1 = max[d21, d41]=max[9,6]=9 2018/12/10 Copyright by Jen-pei Liu, PhD

31 Hierarchical Clustering
Complete Linkage Method Update the distance matrix {35} {24} 1 {35} 0 {24} The maximal distance is 9 between {1} and {24} Merge {1} and {24} into {124} 2018/12/10 Copyright by Jen-pei Liu, PhD

32 Hierarchical Clustering
Complete Linkage Method Find the maximal distance between {124} and {35} d{124}{35} = min[d1{35}d{25}{35}] =max[10,11]=11 Update the distance matrix {35} {124} {35} 0 {124} 11 0 2018/12/10 Copyright by Jen-pei Liu, PhD

33 Hierarchical Clustering
Complete Linkage Method Distance Clusters 2 {35},{1},{2},{4} 5 {35},{1},{24} 9 {35},{124} 11 {12345} 2018/12/10 Copyright by Jen-pei Liu, PhD

34 Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD

35 Copyright by Jen-pei Liu, PhD
Average Clustering Average Linkage Method Use the average distance 2018/12/10 Copyright by Jen-pei Liu, PhD

36 Copyright by Jen-pei Liu, PhD
Average Clustering Average Linkage Method Use the average distance Distance matrix for 5 objects 1 0 2 9 0 2018/12/10 Copyright by Jen-pei Liu, PhD

37 Hierarchical Clustering
Average Linkage Method Step 1: 5 clusters:{1},{2},{3},{4},{5} Step 2: min{dij} = d35 = 2 and merge objects 3 and 5 into one cluster {35} Step 3: Find the average distance among {3,5},{1},{2},{4} d{35}1 =(d31+d51)/(2x1)=(3+11)/2=7 d{35}2 = (d32+d52)/(2x1)=(7+10)/2=8.5 d{35}4 = (d34+d54)/(2x1)=(9+10)/2=8.5 2018/12/10 Copyright by Jen-pei Liu, PhD

38 Hierarchical Clustering
Average Linkage Method Update the distance matrix {35} 1 2 4 {35} 0 1 11 0 2018/12/10 Copyright by Jen-pei Liu, PhD

39 Hierarchical Clustering
Average Linkage Method Step 4: Minimal distance is 5 between {2} and {4} and merge {2} and {4} into {24} Step 5: Find the average distances d{24}{35} = (d23+ d25+d43+ d45)/(2x2) =( )/(2x2)=8.5 d{24}1 = (d21+d41)/(2x1)= =(9+6)/2=7.5 2018/12/10 Copyright by Jen-pei Liu, PhD

40 Hierarchical Clustering
Average Linkage Method Update the distance matrix {35} {24} 1 {35} 0 {24} The minimal distance is 7 between {1} and {35} Merge {1} and {35} into {135} 2018/12/10 Copyright by Jen-pei Liu, PhD

41 Hierarchical Clustering
Average Linkage Method Find the average distance between {24} and {135} d{24}{135} = (d12+d14 +d32+d34 +d52+d54)/(3x2) =( )/6 =8.17 Update the distance matrix {135} {24} {135} 0 {24} 2018/12/10 Copyright by Jen-pei Liu, PhD

42 Hierarchical Clustering
Average Linkage Method Distance Clusters 2 {35},{1},{2},{4} 5 {35},{1},{24} 7 {135},{24} 9 {12345} 2018/12/10 Copyright by Jen-pei Liu, PhD

43 Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD

44 Hierarchical Clustering
Example Manly (2005) Distance Matrix of 5 objects 1 0 2 2 0 2018/12/10 Copyright by Jen-pei Liu, PhD

45 Hierarchical Clustering
Single Linkage Method Distance Clusters 2 {12},{3},{4},{5} 3 {12},{3},{45} 4 {12},{345} 5 {12345} Same results are obtained from complete and average linkage methods 2018/12/10 Copyright by Jen-pei Liu, PhD

46 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

47 Hierarchical Clustering
Example: Canine group by single linkage clustering Distance Clusters # 0.72 {MD,PD},GJ,CW,IW,CU,DI 6 1.38 {MD,PD,CU},GJ,CW,IW,DI 5 1.68 {MD,PD,CU,DI},GJ,CW,IW 4 2.07 {MD,PD,CU,DI,GJ},CW,IW 3 2.31 {MD,PD,CU,DI,GJ},{CW,IW} 2 2.37 {MD,PD,CU,DI,GJ,CW,IW} 1 2018/12/10 Copyright by Jen-pei Liu, PhD

48 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

49 Results of single linkage method for European employment data
2018/12/10 Copyright by Jen-pei Liu, PhD

50 Hierarchical Clustering
Centroid (Center or Average) Method Start with each object being a cluster Merge the two clusters with the shortest distance Compute the centroid as the average of all variables in the new cluster and update the distance matrix using the averages of the new clusters Compute the centroid as the averages of all variables in the new clusters and update the distance matrix using the averages of the new clusters Repeat above steps until it forms one cluster 2018/12/10 Copyright by Jen-pei Liu, PhD

51 Copyright by Jen-pei Liu, PhD
Introduction Example Student Chinese (X1) Math (X2) 2018/12/10 Copyright by Jen-pei Liu, PhD

52 Hierarchical Clustering
Centroid Method Euclidean Distance matrix of 6 students 1 0 2018/12/10 Copyright by Jen-pei Liu, PhD

53 Hierarchical Clustering
Centroid Method The shortest distance is between student {1} and student {4} Merge {1} and {4} into {14} Compute the averages for Chinese and math Average of Chinese = (85+90)/2 = 87.5 Average of math = (82+95)/2=88.5 2018/12/10 Copyright by Jen-pei Liu, PhD

54 Hierarchical Clustering
Centroid Method Update the Euclidean distance matrix {14} {14} 0 2018/12/10 Copyright by Jen-pei Liu, PhD

55 Hierarchical Clustering
Centroid Method The shortest distance is between {2} and {5} Merge {2} and {5} into {35} The average of Chinese of {35} is 32.5 The average of math of {35} is 31.0 2018/12/10 Copyright by Jen-pei Liu, PhD

56 Hierarchical Clustering
Centroid Method Update the Euclidean distance matrix {14} {25} 3 6 {14} 0 {25} 2018/12/10 Copyright by Jen-pei Liu, PhD

57 Hierarchical Clustering
Centroid Method The shortest distance is between {3} and {6} Merge {3} and {6} into {36} The average of Chinese of {36} is 62.5 The average of math of {36} is 62.5 2018/12/10 Copyright by Jen-pei Liu, PhD

58 Hierarchical Clustering
Centroid Method Update the Euclidean distance matrix {14} {25} {36} {14} 0 {25} {36} 2018/12/10 Copyright by Jen-pei Liu, PhD

59 Hierarchical Clustering
Centroid Method The shortest distance is between {14} and {36} Merge {14} and {36} into {1346} Cluster means Cluster Chinese Math {25} (1346} 2018/12/10 Copyright by Jen-pei Liu, PhD

60 Hierarchical Clustering
Centroid Method Distance between {25} and {1346} is 61.53 Distance Clusters 13.93 {14},{2},{3},{5},{6} 15.13 {14},{25},{3},{6} 15.81 {14},{25},{36} 36.07 {1436},{25} 61.53 {123456} 2018/12/10 Copyright by Jen-pei Liu, PhD

61 Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD

62 Hierarchical Clustering
Application to gene expression data from microarray experiments # of genes >>> # of subjects Clustering in two directions Clusters of subjects (patients) Clusters of genes 2018/12/10 Copyright by Jen-pei Liu, PhD

63 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

64 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

65 Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD

66 Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD

67 Hierarchical Clustering
2018/12/10 Copyright by Jen-pei Liu, PhD

68 Hierarchical Clustering
The complexity of a bottom-up method can vary between n2 and n3 depend on the linkage chosen. The complexity of a top-down method can vary between nlogn and n2 depend on the linkage chosen. 2018/12/10 Copyright by Jen-pei Liu, PhD

69 Hierarchical Clustering
Determination of the number of clusters Criteria Root-mean-square total-sample standard deviation (RMSSTD) Semipartial R-square (SPRSQ) R-square (RSQ) Minimum distance (MD) 2018/12/10 Copyright by Jen-pei Liu, PhD

70 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

71 Hierarchical Clustering
Determination of the number of clusters Example: test scores of 6 students # of clusters RMSSTD SPRSQ RSQ MD 2018/12/10 Copyright by Jen-pei Liu, PhD

72 Copyright by Jen-pei Liu, PhD
K-means Clustering Step 1: Select the number of clusters, say K and determine the distance measure such as Euclidean distance or 1-Pearson correlation coefficient Step 2: Divide n objects into K clusters, either randomly or based on a preliminary hierarchical clustering Step 3: Compute the centroids of each clusters and calculate the distances of each object to centroids of all clusters 2018/12/10 Copyright by Jen-pei Liu, PhD

73 Copyright by Jen-pei Liu, PhD
K-means Clustering Step 4: For each object, find the minimal distance and reallocate the object to the corresponding cluster with the minimal distance Step 5: Update the clusters and its centroids Step 6: Repeat Step 3 and Step 4 until no reallocation of objects among clusters occurs 2018/12/10 Copyright by Jen-pei Liu, PhD

74 Copyright by Jen-pei Liu, PhD
K-means Clustering 2018/12/10 Copyright by Jen-pei Liu, PhD

75 Copyright by Jen-pei Liu, PhD
K-means Clustering 2018/12/10 Copyright by Jen-pei Liu, PhD

76 Copyright by Jen-pei Liu, PhD
K-means Clustering 2018/12/10 Copyright by Jen-pei Liu, PhD

77 Copyright by Jen-pei Liu, PhD
K-means Clustering 2018/12/10 Copyright by Jen-pei Liu, PhD

78 Copyright by Jen-pei Liu, PhD
K-means Clustering The number of computations that need to be performed can be written as c*p where c is a value that does depend on the number of iterations and p is the number of variables (e.g., the number of genes) 2018/12/10 Copyright by Jen-pei Liu, PhD

79 Copyright by Jen-pei Liu, PhD
K-means Clustering The number of clusters is selected to maximize the between-cluster sum of squares (variation) and to minimize the within-cluster sum of squares (variation) The best-of-10 partition: to apply K-means method 10 times using 10 different randomly chosen sets of initial clusters and choose the result that minimizes the within-cluster sum of squares 2018/12/10 Copyright by Jen-pei Liu, PhD

80 Issues and Limitations
With considerable overlap between the initial groups, cluster analysis may produce a result that is quite different from the true situation Different approaches obtained different results. The dendrogram itself is almost never the answer to the research question. Hierarchical diagrams convey information only in their topology 2018/12/10 Copyright by Jen-pei Liu, PhD

81 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

82 Issues and Limitations
Shape of clusters will create difficulty in cluster analysis (a) and (b) by any reasonable algorithms (c) some methods will fail because of overlapping points (d), (e) and (f): great challenges for most of clustering algorithms 2018/12/10 Copyright by Jen-pei Liu, PhD

83 Issues and Limitations
Anything can be clustered The clustering algorithm applied to the same data may produce different results Ignore the magnitudes of distance measures in dendrogram Position of the patterns with the clusters does not reflect their relationship in the input space 2018/12/10 Copyright by Jen-pei Liu, PhD

84 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

85 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

86 Copyright by Jen-pei Liu, PhD
2018/12/10 Copyright by Jen-pei Liu, PhD

87 Copyright by Jen-pei Liu, PhD
Summary Goals Methods Hierarchical Methods Single Complete Average Centroid K-means Clutering Limitations 2018/12/10 Copyright by Jen-pei Liu, PhD


Download ppt "Multivariate Statistical Methods"

Similar presentations


Ads by Google