Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zeidat&Eick, MLMTA, Las Vegas 2004 1 K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.

Similar presentations


Presentation on theme: "Zeidat&Eick, MLMTA, Las Vegas 2004 1 K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer."— Presentation transcript:

1 Zeidat&Eick, MLMTA, Las Vegas 2004 1 K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer Science University of Houston

2 Eick&Zeidat, MLMTA, Las Vegas 2004 2 Talk Outline 1. What is Supervised Clustering? 2. Representative-based Clustering Algorithms 3. Benefits of Supervised Clustering 4. Algorithms for Supervised Clustering 5. Empirical Results 6. Conclusion and Areas of Future Work

3 Eick&Zeidat, MLMTA, Las Vegas 2004 3 1. (Traditional) Clustering Partition a set of objects into groups of similar objects. Each group is called cluster Clustering is used to “detect classes” in data set (“unsupervised learning”) Clustering is based on a fitness function that relies on a distance measure and usually tries to minimize distance between objects within a cluster.

4 Eick&Zeidat, MLMTA, Las Vegas 2004 4 (Traditional) Clustering… (continued) A CB Attribute2 Attribute1

5 Eick&Zeidat, MLMTA, Las Vegas 2004 5 Supervised Clustering Assumes that clustering is applied to classified examples. The goal of supervised clustering is to identify class-uniform clusters that have a high probability density.  prefers clusters whose members belong to single class (low impurity) We would, also, like to keep the number of clusters low (small number of clusters).

6 Eick&Zeidat, MLMTA, Las Vegas 2004 6 Supervised Clustering … (continued) Attribute 1 Attribute 2 Traditional ClusteringSupervised Clustering

7 Eick&Zeidat, MLMTA, Las Vegas 2004 7 A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0

8 Eick&Zeidat, MLMTA, Las Vegas 2004 8 2. Representative-Based Supervised Clustering (RSC) Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. The remaining objects in the data set are, then, clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.

9 Eick&Zeidat, MLMTA, Las Vegas 2004 9 Representative Based Supervised Clustering … (Continued) Attribute2 Attribute1

10 Eick&Zeidat, MLMTA, Las Vegas 2004 10 Representative Based Supervised Clustering … (Continued) Attribute2 Attribute1 1 2 3 4

11 Eick&Zeidat, MLMTA, Las Vegas 2004 11 Representative Based Supervised Clustering … (Continued) Attribute2 Attribute1 1 2 3 4 Objective of RSC: Find a subset O R of O such that the clustering X obtained by using the objects in O R as representatives minimizes q(X).

12 Eick&Zeidat, MLMTA, Las Vegas 2004 12 Why do we use Representative-Based Clustering Algorithms? Representatives themselves are useful: – can be used for summarization – can be used for dataset compression Smaller search space if compared with algorithms, such as k-means. Less sensitive to outliers Can be applied to datasets that contain nominal attributes (not feasible to compute means)

13 Eick&Zeidat, MLMTA, Las Vegas 2004 13 3. Applications of Supervised Clustering Enhance classification algorithms. – Use SC for Dataset Editing to enhance NN-classifiers [ICDM04] – Improve Simple Classifiers [ICDM03] Learn Sub-classes / Summary Generation Distance Function Learning Dataset Compression/Reduction For Measuring the Difficulty of a Classification Task

14 Eick&Zeidat, MLMTA, Las Vegas 2004 14 Representative Based Supervised Clustering  Dataset Editing A C E a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives. Attribute1 D B Attribute2 F Attribute1

15 Eick&Zeidat, MLMTA, Las Vegas 2004 15 Representative Based Supervised Clustering  Enhance Simple Classifiers Attribute1 Attribute2

16 Eick&Zeidat, MLMTA, Las Vegas 2004 16 Representative Based Supervised Clustering  Learning Sub-classes Attribute2 Ford Trucks Attribute1 Ford Trucks Ford Vans GMC Trucks GMC Van :Ford :GMC

17 Eick&Zeidat, MLMTA, Las Vegas 2004 17 4. Clustering Algorithms Currently Investigated 1. Partitioning Around Medoids (PAM).  Traditional 2. Supervised Partitioning Around Medoids (SPAM). 3. Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). 4. Top Down Splitting Algorithm (TDS). 5. Supervised Clustering using Evolutionary Computing (SCEC).

18 Zeidat&Eick, MLMTA, Las Vegas 2004 18 Algorithm SRIDHCR REPEAT r TIMES curr := a randomly created set of representatives (with size between c+1 and 2*c) WHILE NOT DONE DO 1.Create new solutions S by adding a single non- representative to curr and by removing a single representative from curr. 2.Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one). 3.IF q(s)<q(curr) THEN curr:=s ELSE IF q(s)=q(curr) AND |s|>|curr| THEN Curr:=s ELSE terminate and return curr as the solution for this run. Report the best out of the r solutions found.

19 Zeidat&Eick, MLMTA, Las Vegas 2004 19 Set of Medoids after adding one non-medoid q(X)Set of Medoids after removing a medoid q(X) 8 42 62 148 (Initial solution)0.08642 62 1480.086 8 42 62 148 10.0918 62 1480.073 8 42 62 148 20.0918 42 1480.313 ……....…….8 42 620.333 8 42 62 148 520.06542 62 1480.086 ……………. 8 42 62 148 1500.0715 Trials in first part (add a non-medoid)Trials in second part (drop a medoid) RunSet of Medoids producing lowest q(X) in the runq(X)Purity 08 42 62 148 (Init. Solution)0.0860.947 18 42 62 148 520.0650.947 28 42 62 148 52 1220.0410.973 342 62 148 52 122 1170.0300.987 48 62 148 52 122 1170.0210.993 58 62 148 52 122 117 870.0161.000 68 62 52 122 117 870.0141.000 78 62 122 117 870.0121.000

20 Zeidat&Eick, MLMTA, Las Vegas 2004 20 Algorithm SPAM Build Initial Solution curr: ( given # of clusters k ) 1.Determine the medoid of the most frequent class in the dataset. Insert that object m into curr. 2.For k-1 times, add an object v in the dataset to curr (that is not already in curr) that gives the lowest value for q(X) for curr  {v}. Improve Initial Solution curr: DO FOREVER FOR ALL representative objects r in curr DO FOR ALL non-representatives objects o in dataset DO 1.Create a new solution v by clustering the dataset around the representative set curr  {r }  {o} and insert v into S. 2.Calculate q(v) for this clustering. Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one). IF q(s)<q(curr) THEN curr:=s ELSE TERMINATE returning curr as the final solution.

21 Eick&Zeidat, MLMTA, Las Vegas 2004 21 Differences between SPAM and SRIDHCR 1. SPAM tries to improve the current solution by replacing a representative by a non-representative, whereas SRIDHCR improves the current solution by removing a representative/by inserting a non-representative. 2. SPAM is run keeping the number of clusters k fixed, whereas SRIDHCR searches for a “good” value of k, therefore exploring a larger solution space. However, in the case of SRIDHCR which choices for k are good is somewhat restricted by the selection of the parameter . 3. SRIDHCR is run r times starting from a random initial solution, SPAM is only run once.

22 Eick&Zeidat, MLMTA, Las Vegas 2004 22 5. Performance Measures for the Experimental Evaluation The investigated algorithms were evaluated based on the following performance measures: Cluster Purity (Majority %). Value of the fitness function q(X). Average dissimilarity between all objects and their representatives (cluster tightness). Wall-Clock Time (WCT). Actual time, in seconds, that the algorithm took to finish the clustering task.

23 Zeidat&Eick, MLMTA, Las Vegas 2004 23 AlgorithmPurityq(X)Tightness(X). Iris-Plants data set, # clusters=3 PAM0.9070.09330.081 SRIDHCR0.9810.02000.093 SPAM0.9730.02670.133 Vehicle data set, # clusters =65 PAM 0.7010.3260.044 SRIDHCR0.8350.1920.072 SPAM0.7640.2630.097 Image-Segment data set, # clusters =53 PAM0.8800.1350.027 SRIDHCR0.9800.0350.050 SPAM0.9440.0710.061 Pima-Indian Diabetes data set, # clusters =45 PAM0.7630.2370.056 SRIDHCR0.8590.1640.093 SPAM0.8220.2020.086 19% 7% Table 4: Traditional vs. Supervised Clustering (β=0.1)

24 Zeidat&Eick, MLMTA, Las Vegas 2004 24 Algorithmq(X)PurityTightness (X) WCT (Sec.) IRIS-Flowers Dataset, # clusters=3 PAM0.09330.9070.0810.06 SRIDHCR0.02000.9800.09311.00 SPAM0.02670.9730.1330.32 Vehicle Dataset, # clusters=65 PAM0.3260.7010.044372.00 SRIDHCR0.1920.8350.0721715.00 SPAM0.2630.7640.0971090.00 Segmentation Dataset, # clusters=53 PAM0.1350.8800.0274073.00 SRIDHCR0.0350.9800.05011250.00 SPAM0.0710.9440.0611422.00 Pima-Indians-Diabetes, # clusters=45 PAM0.2370.7630.056186.00 SRIDHCR0.1640.8590.093660.00 SPAM0.2020.8220.08658.00 Table 5: Comparative Performance of the Different Algorithms, β=0.1

25 AlgorithmAvg. PurityTightness(X)Avg.WCT (Sec.) IRIS-Flowers Dataset, # clusters=3 PAM0.9070.0810.06 SRIDHCR0.9590.1040.18 SPAM0.9730.1330.33 Vehicle Dataset, # clusters=56 PAM0.6810.046505.00 SRIDHCR0.7620.08122.58 SPAM0.7540.100681.00 Segmentation Dataset, # clusters=32 PAM0.8750.0321529.00 SRIDHCR0.9460.054169.39 SPAM0.9400.0651053.00 Pima-Indians-Diabetes, # clusters=2 PAM0.6560.1040.97 SRIDHCR0.7950.1095.08 SPAM0.7720.1252.70 Table 6: Average Comparative Performance of the Different Algorithms, β=0.4

26 Eick&Zeidat, MLMTA, Las Vegas 2004 26 Why is SRIDHCR performing so much better than SPAM? SPAM is relatively slow compared with a single run of SRIDHCR allowing for 5-30 restarts of SRIDHCR using the same resources. This enables SRIDHCR to conduct a more balanced exploration of the search space. Fitness landscape induced by q(X) contains many plateau-like structures (q(X1)=q(X2)) and many local minima and SPAM seems to get stuck more easily. The fact that SPAM uses a fixed k-value does not seem beneficiary for finding good solutions, e.g.: SRIDHCR might explore {u1,u2,u3,u4}  …  {u1,u2,u3,u4,v1,v2}  …  {u3,u4,v1,v2}, whereas SPAM might terminate with the sub- optimal solution {u1,u2,u3,u4}, if neither the replacement of u1 through v1 nor the replacement of u2 by v2 enhances q(X).

27 Zeidat&Eick, MLMTA, Las Vegas 2004 27 DatasetkβTies % Using q(X)Ties % Using Tightness(X) Iris-Plants100.000015.80.0004 Iris-Plants100.45.70.0004 Iris-Plants500.0000120.50.0019 Iris-Plants500.420.90.0018 Vehicle100.000011.040.000001 Vehicle100.41.060.000001 Vehicle500.000011.780.000001 Vehicle500.41.840.000001 Segmentation100.000010.2200.000000 Segmentation100.40.2250.000001 Segmentation500.000010.6260.000001 Segmentation500.40.6380.000000 Diabetes100.000012.060.0 Diabetes100.42.050.0 Diabetes500.000013.430.0002 Diabetes500.43.450.0002 Table 7: Ties distribution

28 Zeidat&Eick, MLMTA, Las Vegas 2004 28 Figure 2: How Purity and k Change as β Increases

29 Eick&Zeidat, MLMTA, Las Vegas 2004 29 6. Conclusions 1. As expected, supervised clustering algorithms produced significantly better cluster purity than traditional clustering. Improvements range between 7% and 19% for different data sets. 2. Algorithms that too greedily explore the search space, such as SPAM, do not seem to be very suitable for supervised clustering. In general, algorithms that explore the search space more randomly seem to be more suitable for supervised clustering. 3. Supervised clustering can be used to enhance classifiers, dataset summarization, and generate better distance functions.

30 Eick&Zeidat, MLMTA, Las Vegas 2004 30 Future Work 1. Continue work on supervised clustering algorithms – Find better solutions – Faster – Explain some observations 2. Using supervised clustering for summary generation/learning subclasses 3. Using supervised clustering to find “compressed” nearest neighbor classifiers. 4. Using supervised clustering to enhance simple classifiers 5. Distance function learning

31 Eick&Zeidat, MLMTA, Las Vegas 2004 31 K-Means Algorithm Attribute2 Attribute1 1 2 3 4

32 Eick&Zeidat, MLMTA, Las Vegas 2004 32 K-Means Algorithm Attribute2 Attribute1 1 2 3 4


Download ppt "Zeidat&Eick, MLMTA, Las Vegas 2004 1 K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer."

Similar presentations


Ads by Google