1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: J.W. Han, I. Witten, E. Frank.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.
Conceptual Clustering
CS690L: Clustering References:
Qiang Yang Adapted from Tan et al. and Han et al.
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Clustering Methods Professor: Dr. Mansouri
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Clustering II.
Cluster Analysis.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering.
An Introduction to Clustering
Clustering II.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Instructor: Qiang Yang
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
Cluster Analysis.
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis.
What is Cluster Analysis?
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Cluster Analysis Part I
Advanced Database Technologies
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Clustering Sunita Sarawagi
Topic9: Density-based Clustering
November 1, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques Clustering.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Data Mining and Warehousing: Chapter 8
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Clustering.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Data Mining Algorithms
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
More on Clustering in COSC 4335
©Jiawei Han and Micheline Kamber Department of Computer Science
Topic 3: Cluster Analysis
CSE572, CBS598: Data Mining by H. Liu
Clustering.
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: J.W. Han, I. Witten, E. Frank

2 Essentials Terminology: Objects = rows = records Variables = attributes = features A good clustering method high on intra-class similarity and low on inter-class similarity What is similarity? Based on computation of distance Between two numerical attributes Between two nominal attributes Mixed attributes

3 The database Object i

4 Numerical Attributes Distances are normally used to measure the similarity or dissimilarity between two data objects Euclideandistance: where i = (x i1, x i2, …, x ip ) and j = (x j1, x j2, …, x jp ) are two p-dimensional records, Manhattan distance

5 Binary Variables ({0, 1}, or {true, false}) A contingency table for binary data Simple matching coefficient Invariant of coding of binary variable: if you assign 1 to “pass” and 0 to “fail”, or the other way around, you’ll get the same distance value. Row i Row j

6 Nominal Attributes A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of matches, p: total # of variables Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states

7 Other measures of cluster distance Minimum distance Max distance Mean distance Avarage distance

8 Major clustering methods Partition based (K-means) Produces sphere-like clusters Good when know number of clusters, Small and med sized databases Hierarchical methods (Agglomerative or divisive) Produces trees of clusters Fast Density based (DBScan) Produces arbitrary shaped clusters Good when dealing with spatial clusters (maps) Grid-based Produces clusters based on grids Fast for large, multidimensional databases Model-based Based on statistical models Allow objects to belong to several clusters

9 The K-Means Clustering Method : for numerical attributes Given k, the k-means algorithm is implemented in four steps: Partition objects into k non-empty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment

10 The mean point XY The mean point can be a virtual point

11 The K-Means Clustering Method Example K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign

12 Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum. Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers too well Not suitable to discover clusters with non-convex shapes

13 Robustness XY

14 Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method

15 K-Modes: See J. X. Huang’s paper online (Data Mining and Knowledge Discovery Journal, Springer)

16 Formalization of K-Means

17 K-Means: Cont.

18 K-Modes: See J. X. Huang’s paper online (Data Mining and Knowledge Discovery Journal, Springer)

19 K-Modes (Cont.)

20 K-Modes

21 K-Modes: Cost Function

22 Finding K-Modes

23 Mixed Types: K-Prototypes

24 K-Modes: Evaluation Data

25 K-Modes: Evaluation

26 Some Experiments

27 What is the problem of k-Means Method? The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster

28 The K-Medoids Clustering Method Find representative objects, called medoids, in clusters Medoids are located in the center of the clusters. Given data points, how to find the medoid?

29 K-Medoids: most centrally located objects

30 CLARA

31 CLASA: Simulated Annealing

32 Sampling based method: MCMRS

33 KMedoids: Evaluation

34 Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD ’ 96) OPTICS: Ankerst, et al (SIGMOD ’ 99). DENCLUE: Hinneburg & D. Keim (KDD ’ 98) CLIQUE: Agrawal, et al. (SIGMOD ’ 98)

35 Density-Based Clustering Clustering based on density (local cluster criterion), such as density-connected points Each cluster has a considerable higher density of points than outside of the cluster

36 Density-Based Clustering: Background Two parameters:  : Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps- neighbourhood of that point N  (p):{q belongs to D | dist(p,q) <=  } Directly density-reachable: A point p is directly density- reachable from a point q wrt. , MinPts if 1) p belongs to N  (q) 2) core point condition: |N  (q)| >= MinPts p q MinPts = 5  = 1 cm

37 Density-Based Clustering: Background (II) Density-reachable: A point p is density-reachable from a point q wrt. , MinPts if there is a chain of points p 1, …, p n, p 1 = q, p n = p such that p i+1 is directly density-reachable from p i Density-connected A point p is density-connected to a point q wrt. , MinPts if there is a point o such that both, p and q are density-reachable from o wrt.  and MinPts. p q p1p1 pq o

38 DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5

39 DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt  and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.

40 DBSCAN Properties Generally takes O(nlogn) time Still requires user to supply Minpts and  Advantage Can find points of arbitrary shape Requires only a minimal (2) of the parameters

41 Model-Based Clustering Methods Attempt to optimize the fit between the data and some mathematical model Statistical and AI approach Conceptual clustering A form of clustering in machine learning Produces a classification scheme for a set of unlabeled objects Finds characteristic description for each concept (class) COBWEB (Fisher’87) A popular a simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that concept

42 The COBWEB Conceptual Clustering Algorithm The COBWEB algorithm was developed by D. Fisher in the 1990 for clustering objects in a object-attribute data set. Fisher, Douglas H. (1987) Knowledge Acquisition Via Incremental Conceptual Clustering The COBWEB algorithm yields a classification tree that characterizes each cluster with a probabilistic description Probabilistic description of a node: (fish, prob=0.92) Properties: incremental clustering algorithm, based on probabilistic categorization trees The search for a good clustering is guided by a quality measure for partitions of data COBWEB only supports nominal attributes CLASSIT is the version which works with nominal and numerical attributes

43 The Classification Tree Generated by the COBWEB Algorithm

44 Input: A set of data like before Can automatically guess the class attribute That is, after clustering, each cluster more or less corresponds to one of Play=Yes/No category Example: applied to vote data set, can guess correctly the party of a senator based on the past 14 votes!

45 Clustering: COBWEB In the beginning tree consists of empty node Instances are added one by one, and the tree is updated appropriately at each stage Updating involves finding the right leaf an instance (possibly restructuring the tree) Updating decisions are based on partition utility and category utility measures

46 Clustering: COBWEB  The larger this probability, the greater the proportion of class members sharing the value (Vij) and the more predictable the value is of class members.

47 Clustering: COBWEB  The larger this probability, the fewer the objects that share this value (Vij) and the more predictive the value is of class Ck.

48 Clustering: COBWEB  The formula is a trade-off between intra-class similarity and inter- class dissimilarity, summed across all classes (k), attributes (i), and values (j).

49 Clustering: COBWEB

50 Clustering: COBWEB Increase in the expected number of attribute values that can be correctly guessed (Posterior Probability) The expected number of correct guesses give no such knowledge (Prior Probability)

51 The Category Utility Function The COBWEB algorithm operates based on the so- called category utility function (CU) that measures clustering quality. If we partition a set of objects into m clusters, then the CU of this particular partition is Question: Why divide by m? - hint: if m=#objects, CU is max!

52 Insights of the CU Function For a given object in cluster C k, if we guess its attribute values according to the probabilities of occurring, then the expected number of attribute values that we can correctly guess is

53 Given an object without knowing the cluster that the object is in, if we guess its attribute values according to the probabilities of occurring, then the expected number of attribute values that we can correctly guess is

54 P(C k )is incorporated in the CU function to give proper weighting to each cluster. Finally, m is placed in the denominator to prevent over-fitting.

55 Question about CU Are their other ways to define category utility for a partition? For example, using information theory? Recall that mutual information I(X,Y) defines the reduction of uncertainty in X when knowing Y: I(X,Y)=H(X)-H(X|Y), where H(X)=-p(X)log(X), and H(X|Y)=E[-p(X|Y)logp(X|Y)] over Y=y_i Now, let X: X_i=(A_i=V_{ij}), Y: y_l=C_l I(A_i,C)=E_{clusters}(H(A_i)-H(A_i|C_j)} I(C)=E_{A_i}(H(A_i, C))

56 Finite mixtures Probabilistic clustering algorithms model the data using a mixture of distributions Each cluster is represented by one distribution The distribution governs the probabilities of attributes values in the corresponding cluster They are called finite mixtures because there is only a finite number of clusters being represented Usually individual distributions are normal distribution Distributions are combined using cluster weights

57 A two-class mixture model A 51 A 43 B 62 B 64 A 45 A 42 A 46 A 45 A 45 B 62 A 47 A 52 B 64 A 51 B 65 A 48 A 49 A 46 B 64 A 51 A 52 B 62 A 49 A 48 B 62 A 43 A 40 A 48 B 64 A 51 B 63 A 43 B 65 B 66 B 65 A 46 A 39 B 62 B 64 A 52 B 63 B 64 A 48 B 64 A 48 A 51 A 48 B 64 A 42 A 48 A 41 data model  A =50,  A =5, p A =0.6  B =65,  B =2, p B =0.4

58 Using the mixture model The probability of an instance x belonging to cluster A is: with The likelihood of an instance given the clusters is:

59 Learning the clusters Assume we know that there are k clusters To learn the clusters we need to determine their parameters I.e. their means and standard deviations We actually have a performance criterion: the likelihood of the training data given the clusters Fortunately, there exists an algorithm that finds a local maximum of the likelihood

60 The EM algorithm EM algorithm: expectation-maximization algorithm Generalization of k-means to probabilistic setting Similar iterative procedure: 1.Calculate cluster probability for each instance (expectation step) 2.Estimate distribution parameters based on the cluster probabilities (maximization step) Cluster probabilities are stored as instance weights

61 More on EM Estimating parameters from weighted instances: Procedure stops when log-likelihood saturates Log-likelihood (increases with each iteration; we wish it to be largest):