Clustering I Data Mining Soongsil University.

Clustering I Data Mining Soongsil University

What is clustering ?

What is a natural grouping among these objects?

What is a natural grouping among these objects?
Clustering is subjective

What is Similarity? The quality or state of being similar, likeness, resemblance as a similarity of features. Similarity is hard to define, but We know it when we see it The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.

Defining Distance Measures
Definition: Let O1 and O2 be two objects from the universe of possible objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1,O2)

Unsupervised learning :Clustering
Unseen Data Black Box

2-dimensional clustering, showing three data clusters
Age Income

What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

What Is A Good Clustering?
High intra-class similarity and low inter-class similarity Depending on the similarity measure The ability to discover some or all of the hidden patterns

Requirements of Clustering
Scalability Ability to deal with various types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters

Requirements of Clustering
Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability

A technique demanded by many real world tasks
Biology: taxonomy of living things such as kingdom, phylum, class, order, family, genus and species Information retrieval: document/multimedia data clustering Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identify groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understand earth climate, find patterns of atmospheric and ocean - Social network mining: special interest group discovery

For memory-based clustering
Data Matrix For memory-based clustering Also called object-by-variable structure Represents n objects with p variables (attributes, measures) A relational table

For memory-based clustering
Dissimilarity Matrix For memory-based clustering Also called object-by-object structure Proximities of pairs of objects d(i,j): dissimilarity between objects i and j Nonnegative Close to 0: similar

How Good Is A Clustering?
Dissimilarity/similarity depends on distance function Different applications have different functions Judgment of clustering quality is typically highly subjective

Types of Attributes There are different types of attributes Nominal
Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: length, time, counts

Types of Data in Clustering
Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types

Similarity and Dissimilarity Between Objects
Distances are normally used measures Minkowski distance: a generalization If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Weighed distance

Properties of Minkowski Distance
Nonnegative: d(i,j)  0 The distance of an object to itself is 0 d(i,i) = 0 Symmetric: d(i,j) = d(j,i) Triangular inequality d(i,j)  d(i,k) + d(k,j)

Categories of Clustering Approaches (1)
Partitioning algorithms Partition the objects into k clusters Iteratively reallocate objects to improve the clustering Hierarchy algorithms Agglomerative: each object is a cluster, merge clusters to form larger ones Divisive: all objects are in a cluster, split it up into smaller clusters

Partitional Clustering
A Partitional Clustering Original Points

Hierarchical Clustering
Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Categories of Clustering Approaches (2)
Density-based methods Based on connectivity and density functions Filter out noise, find clusters of arbitrary shape Grid-based methods Quantize the object space into a grid structure Model-based Use a model to find the best fit of data

Partitioning Algorithms: Basic Concepts
Partition n objects into k clusters Optimize the chosen partitioning criterion Global optimal: examine all partitions (kn-(k-1)n-…-1) possible partitions, too expensive! Heuristic methods: k-means and k-medoids K-means: a cluster is represented by the center K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster

Overview of K-Means Clustering
K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

K-Means Objective Function
Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated centers Source: J. Ye 2006

K Means Example

K Means Example Randomly Initialize Means

Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .

Second Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .

Pros and Cons of K-means
Relatively efficient: O(tkn) n: # objects, k: # clusters, t: # iterations; k, t << n. Often terminate at a local optimum Applicable only when mean is defined What about categorical data? Need to specify the number of clusters Unable to handle noisy data and outliers Unsuitable to discover non-convex clusters

Variations of the K-means
Aspects of variations Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Use mode instead of mean Mode: the most frequent item(s) A mixture of categorical and numerical data: k-prototype method

Categorical Values Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes Mode of an attribute: most frequent value Mode of instances: each attribute = most frequent value K-mode is equivalent to K-means Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method 37

K-medoids: the most centrally located object in a cluster
A Problem of K-means + + Sensitive to outliers Outlier: objects with extremely large values May substantially distort the distribution of the data K-medoids: the most centrally located object in a cluster 1 2 3 4 5 6 7 8 9 10

PAM: A K-medoids Method
PAM: partitioning around Medoids Arbitrarily choose k objects as the initial medoids Until no change, do (Re)assign each object to the cluster to which the nearest medoid Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’ If S < 0 then swap o with o’ to form the new set of k medoids

K-Medoids example 1, 2, 6, 7, 8, 10, 15, 17, 20 – break into 3 clusters Cluster = 6 – 1, 2 Cluster = 7 Cluster = 8 – 10, 15, 17, 20 Random non-medoid – 15 replace 7 (total cost=-13) Cluster = 6 – 1 (cost 0), 2 (cost 0), 7(1-0=1) Cluster = 8 – 10 (cost 0) New Cluster = 15 – 17 (cost 2-9=-7), 20 (cost 5-12=-7) Replace medoid 7 with new medoid (15) and reassign Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20

K-Medoids example (continued)
Random non-medoid – 1 replaces 6 (total cost=2) Cluster = 8 – 7 (cost 6-1=5)10 (cost 0) Cluster = 15 – 17 (cost 0), 20 (cost 0) New Cluster = 1 – 2 (cost 1-4=-3) 2 replaces 6 (total cost=1) Don’t replace medoid 6 Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20 Random non-medoid – 7 replaces 6 (total cost=2) Cluster = 8 – 10 (cost 0) Cluster = 15 – 17(cost 0), 20(cost 0) New Cluster = 7 – 6 (cost 1-0=1), 2 (cost 5-4=1)

Don’t Replace medoid 6 Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20 Random non-medoid – 10 replaces 8 (total cost=2) don’t replace Cluster = 6 – 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 15 – 17 (cost 0), 20(cost 0) New Cluster = 10 – 8 (cost 2-0=2) Random non-medoid – 17 replaces 15 (total cost=0) don’t replace Cluster = 8 – 10 (cost 0) New Cluster = 17 – 15 (cost 2-0=2), 20(cost 3-5=-2)

Random non-medoid – 20 replaces 15 (total cost=3) don’t replace Cluster = 6 – 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 8 – 10 (cost 0) New Cluster = 20 – 15 (cost 5-0=2), 17(cost 3-2=1) Other possible changes all have high costs 1 replaces 15, 2 replaces 15, 1 replaces 8, … No changes, final clusters Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20

Semi-Supervised Clustering

Outline Overview of clustering and classification
What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification What is semi-supervised clustering? Why semi-supervised clustering? Semi-supervised clustering algorithms Source: J. Ye 2006

Supervised classification versus unsupervised clustering
Unsupervised clustering Group similar objects together to find clusters Minimize intra-class distance Maximize inter-class distance Supervised classification Class label for each training sample is given Build a model from the training data Predict class label on unseen future data points Source: J. Ye 2006

What is clustering? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized Source: J. Ye 2006

What is Classification?
Source: J. Ye 2006

Clustering algorithms
K-Means Hierarchical clustering Graph based clustering (Spectral clustering) Bi-clustering Source: J. Ye 2006

Classification algorithms
K-Nearest-Neighbor classifiers Naïve Bayes classifier Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Logistic Regression Neural Networks Source: J. Ye 2006

Supervised Classification Example
. . . .

Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .

Unsupervised Clustering Example
. . . . . . . . . . . . . . . . . . . .

Semi-Supervised Learning
Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data. Unsupervised clustering Semi-supervised learning Supervised classification

Semi-Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .

Semi-Supervised Classification
Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. Co-training [Blum:COLT98]. Transductive SVM’s [Vapnik:98,Joachims:ICML99]. Graph based algorithms Assumptions: Known, fixed set of categories given in the labeled data. Goal is to improve classification of examples into these known categories.

Semi-supervised clustering: problem definition
Input: A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) A small amount of domain knowledge Output: A partitioning of the objects into k clusters (possibly with some discarded as outliers) Objective: Maximum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge

Why semi-supervised clustering?
Why not clustering? The clusters produced may not be the ones required. Sometimes there are multiple possible groupings. Why not classification? Sometimes there are insufficient labeled data. Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/ categorization Image categorization

Semi-Supervised Clustering
Domain knowledge Partial label information is given Apply some constraints (must-links and cannot-links) Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both

Search-Based Semi-Supervised Clustering
Alter the clustering algorithm that searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz: ANNIE99]. Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. Use the labeled data to initialize clusters in an iterative refinement algorithm (k-Means,) [Basu:ICML02]. Source: J. Ye 2006

K Means Example Assign Points to Clusters

K Means Example Re-estimate Means

K Means Example Re-assign Points to Clusters

K Means Example Re-estimate Means

K Means Example Re-assign Points to Clusters

K Means Example Re-estimate Means and Converge

Semi-Supervised K-Means
Partial label information is given Seeded K-Means Constrained K-Means Constraints (Must-link, Cannot-link) COP K-Means

Semi-Supervised K-Means for partially labeled data
Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. Seed points are only used for initialization, and not in subsequent steps. Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm. Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.

Seeded K-Means Use labeled data to find the initial centroids and
then run K-Means. The labels for seeded points may change. Source: J. Ye 2006

Seeded K-Means Example

Seeded K-Means Example Initialize Means Using Labeled Data

Seeded K-Means Example Assign Points to Clusters

Seeded K-Means Example Re-estimate Means

Seeded K-Means Example Assign points to clusters and Converge
the label is changed x

Constrained K-Means Use labeled data to find the initial centroids and
then run K-Means. The labels for seeded points will not change. Source: J. Ye 2006

Constrained K-Means Example

Constrained K-Means Example Initialize Means Using Labeled Data

Constrained K-Means Example Assign Points to Clusters

Constrained K-Means Example Re-estimate Means and Converge

COP K-Means COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort. Source: J. Ye 2006

COP K-Means Algorithm

Illustration Determine its label Must-link x x Assign to the red class

Illustration Determine its label Cannot-link Assign to the red class x

Illustration Determine its label Must-link Cannot-link
x x Cannot-link The clustering algorithm fails

Summary Seeded and Constrained K-Means: partially labeled data
COP K-Means: constraints (Must-link and Cannot-link) Constrained K-Means and COP K-Means require all the constraints to be satisfied. May not be effective if the seeds contain noise. Seeded K-Means use the seeds only in the first step to determine the initial centroids. Less sensitive to the noise in the seeds. Experiments show that semi-supervised k-Means outperform traditional K-Means.

References Ye , Jieping Introduction to Data Mining, Department of Computer Science and Engineering Arizona State University, 2006 Clifton, Chris Introduction to Data Mining, Purdue University, 2006 Zhu, Xingquan & Davidson, Ian , Knowledge Discovery and Data Mining, 2007

Clustering I Data Mining Soongsil University.

Similar presentations

Presentation on theme: "Clustering I Data Mining Soongsil University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering I Data Mining Soongsil University.

Similar presentations

Presentation on theme: "Clustering I Data Mining Soongsil University."— Presentation transcript:

Similar presentations

About project

Feedback