Presentation is loading. Please wait.

Presentation is loading. Please wait.

Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su

Similar presentations


Presentation on theme: "Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su"— Presentation transcript:

1 Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su
Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su 2001/11/06 The Lab of Intelligent Database System, IDS

2 The Lab of Intelligent Database System, IDS
Outline Motivation Objective Research Review Notation K-means Algorithm K-mode Algorithm K-prototype Algorithm Experiment Conclusion Personal opinion 2001/11/06 The Lab of Intelligent Database System, IDS

3 The Lab of Intelligent Database System, IDS
Motivation K-means methods are efficient for processing large data sets K-means is limited to numeric data Numeric and categorical data are mixed with million objects in real world 2001/11/06 The Lab of Intelligent Database System, IDS

4 The Lab of Intelligent Database System, IDS
Objective Extending K-means to categorical domains and domains with mixed numeric and categorical values 2001/11/06 The Lab of Intelligent Database System, IDS

5 The Lab of Intelligent Database System, IDS
Research review Partition methods Partitioning algorithm organizes the objects into K partition(K<N) K-means[ MacQueen, 1967] K-medoids[ Kaufman and Rousseeuw, 1990] CLARANS[ Ng and Han, 1994] 2001/11/06 The Lab of Intelligent Database System, IDS

6 The Lab of Intelligent Database System, IDS
Notation [A1,A2,…..Am] means attribute numbers ,each Ai describes a domains of values, denoted by DOM(Ai) X={X1,X2,…..,Xn} be a set of n objects,object Xi is represented as [Xi,1,Xi,2,…..,Xi,m} Xi=Xk if Xi,j =Xk,j for 1<=j<=m [ ], the first p elements are numeric values, the rest are categorical values 2001/11/06 The Lab of Intelligent Database System, IDS

7 The Lab of Intelligent Database System, IDS
K-means Algorithm Problem P minimise ,1<=i<=n Subject to ,1<=i<=n, 1<=l<=k K is clustering numbers, n is objects number W is an nxk partition matrix, Q={Q1,Q2,…Qk} is a set of objects in the same object domain d(.,.) is the Euclidean distance between two objects 2001/11/06 The Lab of Intelligent Database System, IDS

8 K-means Algorithm (cont.)
Problem P can be solved by iteratively solving the following two problems: Problem P1: fix Q= , reduced problem P(W, ) wi,l=1 if d(Xi,Ql) <= d(Xi,Qt), for 1 <= t <= k wi,t=0 for t <> l Problem P2: fix W= , reduced problem P( ,Q) ,1 <= l <= k, and 1<= j <= m 2001/11/06 The Lab of Intelligent Database System, IDS

9 K-means Algorithm (cont.)
Choose an initial and solve P(W, ) to obtain Set t=0 Let = and solve P( ,Q) to obtain if P( , )=P( , ), output , and stop; otherwise, go to 3 Let = and solve P(W, ) to obtain if P( , )=P( , ), output , and stop; otherwise, let t=t+1 and go to 2 2001/11/06 The Lab of Intelligent Database System, IDS

10 The Lab of Intelligent Database System, IDS
K-mode Algorithm Using a simple matching dissimilarity measure for categorical objects Replacing means of clusters by modes Using a frequency-based method to find the modes 2001/11/06 The Lab of Intelligent Database System, IDS

11 K-mode Algorithm( cont.)
Dissimilarity measure where Mode of a set A mode of X ={X1,X2,…..,Xn} is a vector Q=[q1,q2,…,qm] minimise 2001/11/06 The Lab of Intelligent Database System, IDS

12 K-mode Algorithm( cont.)
Find a mode for a set let be the number of objects having the Kth category in attribute the relative frequency of category in X Theorem 1 D(X,Q) is minimised iff for qj <> for all j=1,…,m 2001/11/06 The Lab of Intelligent Database System, IDS

13 K-mode Algorithm( cont.)
Two initial mode selection methods Select the first K distinct records from the data sets as the K modes Select the K modes by frequency-based method 2001/11/06 The Lab of Intelligent Database System, IDS

14 K-mode Algorithm( cont.)
where and To calculate the total cost P against the whole data set each time when a new Q or W is obtained 2001/11/06 The Lab of Intelligent Database System, IDS

15 K-mode Algorithm( cont.)
Select K initial modes, one for each cluster Allocate an object to the cluster whose mode is the nearest to it . Update the mode of the cluster after each allocation according to theorem 1 2001/11/06 The Lab of Intelligent Database System, IDS

16 K-mode Algorithm( cont.)
After all objects have been allocated to clusters, retest the dissimilarity of objects against the current modes if an object is found its nearest mode belongs to another cluster, reallocate the object to that cluster and update the modes of both clusters Repeat 3 until no objects has changed clusters 2001/11/06 The Lab of Intelligent Database System, IDS

17 K-prototypes Algorithm
To integrate the k-means and k-modes algorithms and to cluster the mixed-type objects ,m is the attribute numbers the first p means numeric data, the rest means categorical data 2001/11/06 The Lab of Intelligent Database System, IDS

18 K-prototypes Algorithm( cont.)
The first term is the Euclidean distance measure on the numeric attributes and the second term is the simple matching dissimilarity measure on the categorical attributes The weight is used to avoid favouring either type of attribute 2001/11/06 The Lab of Intelligent Database System, IDS

19 K-prototypes Algorithm( cont.)
Cost function Minimise 2001/11/06 The Lab of Intelligent Database System, IDS

20 K-prototypes Algorithm( cont.)
Choose clusters Modify the mode 2001/11/06 The Lab of Intelligent Database System, IDS

21 K-prototypes Algorithm( cont.)
Modify the mode 2001/11/06 The Lab of Intelligent Database System, IDS

22 The Lab of Intelligent Database System, IDS
Experiment K-modes the data set was the soybean disease data set, with 4 diseases 47 instances: {D=10,C=10,R=10,p=17}, 21 attributes K-prototype the second data was the credit approval data set, with 2 class 666 instances { approval=299, reject=367}, 6 numeric and 9 categorical attributes 2001/11/06 The Lab of Intelligent Database System, IDS

23 The Lab of Intelligent Database System, IDS
Experiment( cont.) 2001/11/06 The Lab of Intelligent Database System, IDS

24 The Lab of Intelligent Database System, IDS
Experiment( cont.) 2001/11/06 The Lab of Intelligent Database System, IDS

25 The Lab of Intelligent Database System, IDS
Experiment( cont.) 2001/11/06 The Lab of Intelligent Database System, IDS

26 The Lab of Intelligent Database System, IDS
Conclusion The k-modes algorithm is faster than the k-means and k-prototypes algorithm because it needs less iterations to converge How many clusters are in the data? The weight adds an additional problem 2001/11/06 The Lab of Intelligent Database System, IDS

27 The Lab of Intelligent Database System, IDS
Personal opinion Conceptual inclusion relationships Outlier problem Massive data sets cause efficient problem 2001/11/06 The Lab of Intelligent Database System, IDS


Download ppt "Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su"

Similar presentations


Ads by Google