Download presentation
Presentation is loading. Please wait.
1
Clustering I Data Mining Soongsil University
2
What is clustering ?
3
What is a natural grouping among these objects?
4
What is a natural grouping among these objects?
Clustering is subjective
5
What is Similarity? The quality or state of being similar, likeness, resemblance as a similarity of features. Similarity is hard to define, but We know it when we see it The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.
6
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the universe of possible objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1,O2)
7
Unsupervised learning :Clustering
Unseen Data Black Box
8
2-dimensional clustering, showing three data clusters
Age Income
9
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized
10
What Is A Good Clustering?
High intra-class similarity and low inter-class similarity Depending on the similarity measure The ability to discover some or all of the hidden patterns
11
Requirements of Clustering
Scalability Ability to deal with various types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters
12
Requirements of Clustering
Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability
13
A technique demanded by many real world tasks
Biology: taxonomy of living things such as kingdom, phylum, class, order, family, genus and species Information retrieval: document/multimedia data clustering Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identify groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understand earth climate, find patterns of atmospheric and ocean - Social network mining: special interest group discovery
15
For memory-based clustering
Data Matrix For memory-based clustering Also called object-by-variable structure Represents n objects with p variables (attributes, measures) A relational table
16
For memory-based clustering
Dissimilarity Matrix For memory-based clustering Also called object-by-object structure Proximities of pairs of objects d(i,j): dissimilarity between objects i and j Nonnegative Close to 0: similar
17
How Good Is A Clustering?
Dissimilarity/similarity depends on distance function Different applications have different functions Judgment of clustering quality is typically highly subjective
18
Types of Attributes There are different types of attributes Nominal
Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: length, time, counts
19
Types of Data in Clustering
Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types
20
Similarity and Dissimilarity Between Objects
Distances are normally used measures Minkowski distance: a generalization If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Weighed distance
21
Properties of Minkowski Distance
Nonnegative: d(i,j) 0 The distance of an object to itself is 0 d(i,i) = 0 Symmetric: d(i,j) = d(j,i) Triangular inequality d(i,j) d(i,k) + d(k,j)
22
Categories of Clustering Approaches (1)
Partitioning algorithms Partition the objects into k clusters Iteratively reallocate objects to improve the clustering Hierarchy algorithms Agglomerative: each object is a cluster, merge clusters to form larger ones Divisive: all objects are in a cluster, split it up into smaller clusters
23
Partitional Clustering
A Partitional Clustering Original Points
24
Hierarchical Clustering
Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram
25
Categories of Clustering Approaches (2)
Density-based methods Based on connectivity and density functions Filter out noise, find clusters of arbitrary shape Grid-based methods Quantize the object space into a grid structure Model-based Use a model to find the best fit of data
26
Partitioning Algorithms: Basic Concepts
Partition n objects into k clusters Optimize the chosen partitioning criterion Global optimal: examine all partitions (kn-(k-1)n-…-1) possible partitions, too expensive! Heuristic methods: k-means and k-medoids K-means: a cluster is represented by the center K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster
27
Overview of K-Means Clustering
K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster
28
K-Means Objective Function
Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated centers Source: J. Ye 2006
29
K Means Example
30
K Means Example Randomly Initialize Means
31
Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
32
Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
33
Second Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
34
Second Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
35
Pros and Cons of K-means
Relatively efficient: O(tkn) n: # objects, k: # clusters, t: # iterations; k, t << n. Often terminate at a local optimum Applicable only when mean is defined What about categorical data? Need to specify the number of clusters Unable to handle noisy data and outliers Unsuitable to discover non-convex clusters
36
Variations of the K-means
Aspects of variations Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Use mode instead of mean Mode: the most frequent item(s) A mixture of categorical and numerical data: k-prototype method
37
Categorical Values Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes Mode of an attribute: most frequent value Mode of instances: each attribute = most frequent value K-mode is equivalent to K-means Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method 37
38
K-medoids: the most centrally located object in a cluster
A Problem of K-means + + Sensitive to outliers Outlier: objects with extremely large values May substantially distort the distribution of the data K-medoids: the most centrally located object in a cluster 1 2 3 4 5 6 7 8 9 10
39
PAM: A K-medoids Method
PAM: partitioning around Medoids Arbitrarily choose k objects as the initial medoids Until no change, do (Re)assign each object to the cluster to which the nearest medoid Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’ If S < 0 then swap o with o’ to form the new set of k medoids
40
K-Medoids example 1, 2, 6, 7, 8, 10, 15, 17, 20 – break into 3 clusters Cluster = 6 – 1, 2 Cluster = 7 Cluster = 8 – 10, 15, 17, 20 Random non-medoid – 15 replace 7 (total cost=-13) Cluster = 6 – 1 (cost 0), 2 (cost 0), 7(1-0=1) Cluster = 8 – 10 (cost 0) New Cluster = 15 – 17 (cost 2-9=-7), 20 (cost 5-12=-7) Replace medoid 7 with new medoid (15) and reassign Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20
41
K-Medoids example (continued)
Random non-medoid – 1 replaces 6 (total cost=2) Cluster = 8 – 7 (cost 6-1=5)10 (cost 0) Cluster = 15 – 17 (cost 0), 20 (cost 0) New Cluster = 1 – 2 (cost 1-4=-3) 2 replaces 6 (total cost=1) Don’t replace medoid 6 Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20 Random non-medoid – 7 replaces 6 (total cost=2) Cluster = 8 – 10 (cost 0) Cluster = 15 – 17(cost 0), 20(cost 0) New Cluster = 7 – 6 (cost 1-0=1), 2 (cost 5-4=1)
42
K-Medoids example (continued)
Don’t Replace medoid 6 Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20 Random non-medoid – 10 replaces 8 (total cost=2) don’t replace Cluster = 6 – 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 15 – 17 (cost 0), 20(cost 0) New Cluster = 10 – 8 (cost 2-0=2) Random non-medoid – 17 replaces 15 (total cost=0) don’t replace Cluster = 8 – 10 (cost 0) New Cluster = 17 – 15 (cost 2-0=2), 20(cost 3-5=-2)
43
K-Medoids example (continued)
Random non-medoid – 20 replaces 15 (total cost=3) don’t replace Cluster = 6 – 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 8 – 10 (cost 0) New Cluster = 20 – 15 (cost 5-0=2), 17(cost 3-2=1) Other possible changes all have high costs 1 replaces 15, 2 replaces 15, 1 replaces 8, … No changes, final clusters Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20
44
Semi-Supervised Clustering
45
Outline Overview of clustering and classification
What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification What is semi-supervised clustering? Why semi-supervised clustering? Semi-supervised clustering algorithms Source: J. Ye 2006
46
Supervised classification versus unsupervised clustering
Unsupervised clustering Group similar objects together to find clusters Minimize intra-class distance Maximize inter-class distance Supervised classification Class label for each training sample is given Build a model from the training data Predict class label on unseen future data points Source: J. Ye 2006
47
What is clustering? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized Source: J. Ye 2006
48
What is Classification?
Source: J. Ye 2006
49
Clustering algorithms
K-Means Hierarchical clustering Graph based clustering (Spectral clustering) Bi-clustering Source: J. Ye 2006
50
Classification algorithms
K-Nearest-Neighbor classifiers Naïve Bayes classifier Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Logistic Regression Neural Networks Source: J. Ye 2006
51
Supervised Classification Example
. . . .
52
Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .
53
Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .
54
Unsupervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
55
Unsupervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
56
Semi-Supervised Learning
Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data. Unsupervised clustering Semi-supervised learning Supervised classification
57
Semi-Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .
58
Semi-Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .
59
Semi-Supervised Classification
Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. Co-training [Blum:COLT98]. Transductive SVM’s [Vapnik:98,Joachims:ICML99]. Graph based algorithms Assumptions: Known, fixed set of categories given in the labeled data. Goal is to improve classification of examples into these known categories.
60
Semi-supervised clustering: problem definition
Input: A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) A small amount of domain knowledge Output: A partitioning of the objects into k clusters (possibly with some discarded as outliers) Objective: Maximum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge
61
Why semi-supervised clustering?
Why not clustering? The clusters produced may not be the ones required. Sometimes there are multiple possible groupings. Why not classification? Sometimes there are insufficient labeled data. Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/ categorization Image categorization
62
Semi-Supervised Clustering
Domain knowledge Partial label information is given Apply some constraints (must-links and cannot-links) Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both
63
Search-Based Semi-Supervised Clustering
Alter the clustering algorithm that searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz: ANNIE99]. Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. Use the labeled data to initialize clusters in an iterative refinement algorithm (k-Means,) [Basu:ICML02]. Source: J. Ye 2006
66
K Means Example Assign Points to Clusters
67
K Means Example Re-estimate Means
68
K Means Example Re-assign Points to Clusters
69
K Means Example Re-estimate Means
70
K Means Example Re-assign Points to Clusters
71
K Means Example Re-estimate Means and Converge
72
Semi-Supervised K-Means
Partial label information is given Seeded K-Means Constrained K-Means Constraints (Must-link, Cannot-link) COP K-Means
73
Semi-Supervised K-Means for partially labeled data
Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. Seed points are only used for initialization, and not in subsequent steps. Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm. Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.
74
Seeded K-Means Use labeled data to find the initial centroids and
then run K-Means. The labels for seeded points may change. Source: J. Ye 2006
75
Seeded K-Means Example
76
Seeded K-Means Example Initialize Means Using Labeled Data
77
Seeded K-Means Example Assign Points to Clusters
78
Seeded K-Means Example Re-estimate Means
79
Seeded K-Means Example Assign points to clusters and Converge
the label is changed x
80
Constrained K-Means Use labeled data to find the initial centroids and
then run K-Means. The labels for seeded points will not change. Source: J. Ye 2006
81
Constrained K-Means Example
82
Constrained K-Means Example Initialize Means Using Labeled Data
83
Constrained K-Means Example Assign Points to Clusters
84
Constrained K-Means Example Re-estimate Means and Converge
85
COP K-Means COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort. Source: J. Ye 2006
86
COP K-Means Algorithm
87
Illustration Determine its label Must-link x x Assign to the red class
88
Illustration Determine its label Cannot-link Assign to the red class x
89
Illustration Determine its label Must-link Cannot-link
x x Cannot-link The clustering algorithm fails
90
Summary Seeded and Constrained K-Means: partially labeled data
COP K-Means: constraints (Must-link and Cannot-link) Constrained K-Means and COP K-Means require all the constraints to be satisfied. May not be effective if the seeds contain noise. Seeded K-Means use the seeds only in the first step to determine the initial centroids. Less sensitive to the noise in the seeds. Experiments show that semi-supervised k-Means outperform traditional K-Means.
91
References Ye , Jieping Introduction to Data Mining, Department of Computer Science and Engineering Arizona State University, 2006 Clifton, Chris Introduction to Data Mining, Purdue University, 2006 Zhu, Xingquan & Davidson, Ian , Knowledge Discovery and Data Mining, 2007
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.