Presentation on theme: "Basic techniques for cluster detection"— Presentation transcript:
1Basic techniques for cluster detection Chapter FourBasic techniques for cluster detection
2Chapter Overview The problem of cluster detection Measuring proximity between data objectsThe K-means cluster detection methodThe agglomeration cluster detection methodPerformance issues of the basic methodsCluster evaluation and interpretationUndertaking a clustering task in Weka
3Problem of Cluster Detection What is cluster detection?Cluster: a group of objects known as membersThe centre of a cluster is known as the centroidMembers of a cluster are similar to each otherMembers of different clusters are differentClustering is a process of discovering clusters: centroids
4Problem of Cluster Detection Outputs of cluster detection processAssigned cluster tag for members of a clusterCluster summary: size, centroid, variations, etc.Cluster 2: Size: 5Centroid:(130, 51)Variation: bodyHeight = 10,bodyWeight = 14.48Cluster 1: Size: 6Centroid:(154, 90)Variation: bodyHeight = 5.16bodyWeight = 5.32
5Problem of Cluster Detection Basic elements of a clustering solutionA sensible measure for similarity, e.g. EuclideanAn effective and efficient clustering algorithm, e.g. K-meansA goodness-of-fit function for evaluating the quality of resulting clusters, e.g. SSE??Internal variationInter-cluster distanceGood or Bad?
6Problem of Cluster Detection Requirements for clustering solutionsScalabilityAble to deal with different types of attributesAble to discover clusters of arbitrary shapesMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input data recordsAble to deal with high dimensionalityIncorporation of user-specified constraintsInterpretability and usability
7Measures of Proximity Basics Proximity between two data objects is represented by either similarity or dissimilaritySimilarity: a numeric measure of the degree of alikeness, dissimilarity: numeric measure of the degree of difference between two objectsSimilarity measure and dissimilarity measure are often convertible; normally dissimilarity is preferredMeasure of dissimilarity:Measuring the difference between values of the corresponding attributesCombining the measures of the differences
8Measures of Proximity Distance function Metric properties of function d:d(x, y) 0 and d(x, x) = 0, for all data objects x and yd(x, y) = d(y, x), for all data objects x and yd(x, y) d(x, z) + d(z, y), for all data objects x, y and zDifference of values for a single attribute is directly related to the domain type of the attribute.It is important to consider which operations are applicable.Some measure is better than no measure at all.
9Measures of Proximity Difference between Attribute Values Difference between nominal valuesIf two names are the same, the difference is 0; otherwise the maximume.g. diff(“John”, “John”) = 0, diff(“John”, “Mary”) = Same for difference between binary valuese.g. diff(Yes, No) = Difference between ordinal valuesDifferent degree of proximity can be comparede.g. diff(A, B) < diff(A, D).Converting ordinal values to consecutive integerse.g. A: 5, B: 4, C: 3, D: 2, E:1. A – B 1 and A – D 3Distance measure for interval and ratio attributesDifference between values that may be unknowndiff(NULL, v) = |v|, diff(NULL, NULL) =
10Measures of Proximity Distance between data objects Ratio of mismatched features for nominal attributesGiven two data objects i and j of p nominal attributes. Let m represent the number of attributes where the values of the two objects match.e.g.
11Measures of Proximity Distance between data objects Minkowski function for interval/ratio attributesqpjxid)|...(|,(21-+=Special cases:Manhattan distance (q = 1)Euclidean distance (q = 2)Supremum/Chebyshev (q = )
12Measures of Proximity Distance between data objects Minkowski function for interval/ratio attributes (example)ManhattanEuclideanNo. of Trans1020304050TenureRevenue2004006008001000Chebyshev
13Measures of Proximity Distance between data objects For binary attributesGiven two data objects i and j of p binary attributes,f00 : the number of attributes where i is 0 and j is 0f01 : the number of attributes where i is 0 and j is 1f10 : the number of attributes where i is 1 and j is 0f11 : the number of attributes where i is 1 and j is 1Simple mismatch coefficient (SMC) for symmetric values:Jaccard coefficient is defined for asymmetric values:
14Measures of Proximity Distance between data objects For binary attributes (example)SMC not that different; JC very different: two-word (out of 3) differenceSMC very similar; JC still quite different: one word (out of 2) difference
15Measures of Proximity Similarity between data objects Cosine similarity functionTreating two data objects as vectorsSimilarity is measured as the angle between the two vectorsSimilarity is 1 when = 0, and 0 when = 90Similarity function:ij
16Measures of Proximity Similarity between data objects Cosine similarity function (illustrated)Given two data objects:x = (3, 2, 0, 5), and y = (1, 0, 0, 0)Since,x y = 3*1 + 2*0 + 0*0 + 5*0 = 3||x|| = sqrt( ) 6.16||y|| = sqrt( ) = 1Then, the similarity between x and y: cos(x, y) = 3/(6.16 * 1) = 0.49The dissimilarity between x and y: 1 – cos(x,y) = 0.51
17Measures of Proximity Distance between data objects Combining heterogeneous attributesBased on the principle of ratio of mismatched featuresFor the kth attribute, compute the dissimilarity dk in [0,1]Set the indicator variable k as follows:k = 0, if the kth attribute is an asymmetric binary attribute and both objects have value 0 for the attributek = 1, otherwiseCompute the overall distance between i and j as:
18Measures of Proximity Distance between data objects Attribute scaling When:on the same attribute when data from different data sources are mergedon different attributes when data is projected into the N-spaceNormalising variables into comparable ranges:divide each value by the meandivide each value by the rangez-scoreAttribute weightingThe weighted overall dissimilarity function:
19K-means, a Basic Clustering Method Outline of main stepsDefine the number of clusters (k)Choose k data objects randomly to serve as the initial centroids for the k clustersAssign each data object to the cluster represented by its nearest centroidFind a new centroid for each cluster by calculating the mean vector of its membersUndo the memberships of all data objects. Go back to Step 3 and repeat the process until cluster membership no longer changes or a maximum number of iterations is reached.
20K-means, a Basic Clustering Method Illustration of the method:
21K-means, a Basic Clustering Method Strengths & weaknessesStrengthsSimple and easy to implementQuite efficientWeaknessesNeed to specify the value of k, but we may not know what the value should be beforehandSensitive to the choice of initial k centroids: the result can be non-deterministicSensitive to noiseApplicable only when mean is meaningful to the given data set
22K-means, a Basic Clustering Method Overcoming the weaknesses:Using cluster quality to determine the value of kImproving how the initial k centroids are chosenRunning the clustering a number of times and select the result with highest qualityUsing hierarchical clustering to locate the centresFinding centres that are farther apartDealing with noiseRemoving outliers before clustering?K-medoid method, using the nearest data object to the virtual centre as the centroid.When mean cannot be defined,K-mode method, calculating mode instead of mean for the centre of the cluster.
23K-means, a Basic Clustering Method Value of k and cluster qualityScree plotCluster errors (e.g. SSE)Number ofclusters
24K-means, a Basic Clustering Method Choosing initial k centroidsRunning the clustering many times (only trial and error)Using hierarchical clustering to locate the centres (why partition based?)Finding centres that are farther apart
25K-means, a Basic Clustering Method K-medoid:Bisecting K-means
26The Agglomeration Method Outline of main stepsTake all n data objects as individual clusters and build a n x n dissimilarity matrix. The matrix stores the distance between any pair of data objects.While the number of clusters > 1 do:Find a pair of data objects/clusters with the minimum distanceMerge the two data objects/clusters into a bigger clusterReplace the entries in the matrix for the original clusters or objects by the cluster tag of the newly formed clusterRe-calculate relevant distances and update the matrix
27The Agglomeration Method Illustration of the method
28The Agglomeration Method Illustration of the method (dendrogram)# of clusters12345678910
29The Agglomeration Method Agglomeration schemesSingle link: the distance between two closest pointsComplete link: the distance between two farthest pointsGroup average: the average of all pair-wise distancesCentroids: the distance between the centroids
30The Agglomeration Method Strengths and weaknessesStrengthsDeterministic resultsMultiple possible versions of clusteringNo need to specify the value of a k beforehandCan create clusters of arbitrary shapes (single-link)WeaknessesDoes not scale up for large data setsCannot undo membership like the K-meansProblems with agglomeration schemes (see Chapter 5)
31Cluster Evaluation & Interpretation Cluster qualityPrinciple:High-level similarity/low-level variation within a clusterHigh-level dissimilarity between clustersThe measuresCohesion: sum of squared errors (SSE), and sum of SSEs for all clusters (WC)Separation: sum of distances between clusters (BC)Combining the cohesion and separation, the ratio BC/WC is a good indicator of overall quality.Ck: cluster krk: centroid of Ck
32Cluster Evaluation & Interpretation Cluster quality illustratedCluster c2Cluster c1 C1 is a better quality cluster than C2.
33Cluster Evaluation & Interpretation Using cluster quality for clusteringWith K-means:Add an outer loop for different values of K (from low to high)At an iteration, conduct K-means clustering using the current KMeasure the overall cluster quality and decide whether the resulting cluster quality acceptableIf not, increase the value of K by 1 and repeat the processWith agglomeration:Traverse the hierarchy level by level from the rootAt a level, evaluate the overall quality of clustersIf the quality is acceptable, take the clusters at the level as the final result. If not, move to the next level and repeat the process.
34Cluster Evaluation & Interpretation Cluster tendencyCluster tendency: do clusters really exist?Measures for tendency:Quality measure: when BC and WC are similar, it means clusters do not exist.Use Hopkins statisticP: a set of n randomly generated data pointsS: a sample of n data points from the data settp: the nearest neighbour of point p in Stm: the nearest neighbour of point m in P
35Cluster Evaluation & Interpretation Cluster interpretationWithin clusterHow values of the clustering attributes are distributedHow values of supplementary attributes are distributedOutside clusterExceptions and anomaliesBetween clusterComparative viewValue distributions for the populationValue distributions for the clusterValue distributions for the populationValue distributions for the cluster
36K-means & Agglomeration in Weka Clustering in Weka: Preprocess pageSpecify “No Class”Specify all attributes for clustering
37K-means & Agglomeration in Weka Clustering in Weka: Cluster page2. Set parameters1. Choose a Clustering Solution4. Observe results3. Execute the chosen solution5. Select “Visualise Cluster Assignment”
38K-means & Agglomeration in Weka Clustering in Weka: SimpleKMeansSpecify the distance function usedSpecifythe value of KSpecify the max. number of iterationsSpecify the random seed affecting the initial random selection of K centroids
39K-means & Agglomeration in Weka Clustering in Weka: SimpleKMeansSave membership into a fileVisualiseCluster membership
40K-means & Agglomeration in Weka Clustering in Weka: AgglomerationTree-shaped DendrogramSelect Cobweb
41Chapter SummaryA clustering solution must provide a sensible proximity function, effective algorithm and a cluster evaluation functionProximity is normally measured by a distance function that combines measures of value differences upon attributesThe K-Means method continues to refine prototype partitions until membership changes no longer occurThe agglomeration method constructs all possible groupings of individual data objects into a hierarchy of clustersGood clustering results mean high similarity among members of a cluster and low similarity between members of different clustersNormal procedure of clustering in Weka is explained
42References Read Chapter 4 of Data Mining Techniques and Applications Useful further referencesTan, P-N., Steinbach, M. and Kumar, V. (2006), Introduction to Data Mining, Addison-Wesley, Chapters 2 (section 2.4) and 8.