Presentation is loading. Please wait.

Presentation is loading. Please wait.

A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.

Similar presentations


Presentation on theme: "A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science."— Presentation transcript:

1 A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science Department University of Houston, TX 1

2 O UTLINE 1. Motivation 2. Goals 3. Overviews 4. Related work 5. An architecture and algorithms for multi-run clustering 6. Experimental results 7. Conclusion and future works 2

3 1. M OTIVATION 3 Region discovery framework A family of clustering algorithms A family of plug-in fitness functions Domain experts Region discovery framework A family of clustering algorithms A family of plug-in fitness functions Multi-run clustering Manually select parameters of clustering algorithms Rely on active learning to automatically select parameters of clustering algorithms Cougar^2: Open Source Data Mining and Machine Learning Framework https://cougarsquared.dev.java.net

4 2. G OALS Given O = { o 1,…, o n } as a spatial dataset. A clustering algorithm seeks for a clustering X that maximizes a fitness function q ( X ). X = { x 1, x 2,…, x k }, x i  x j = , ( i  j ),, and The goal is to automatically find a set of distinct and high quality clusters that originate from different runs. 4

5 3. O VERVIEWS OF MULTI - RUN CLUSTERING – 1 Key hypothesis: better clustering results can be obtained by combining clusters that originate from multiple runs of a clustering algorithm. 5

6 3. O VERVIEWS OF MULTI - RUN CLUSTERING – 2 Challenges: Selecting appropriate parameters for an arbitrary clustering algorithm Determining which clusters to be stored as candidate clusters. Generating a final clustering from candidate clusters Alternative clusters, e.g. hotspots in spatial datasets at different granularities 6

7 4. R ELATED WORK Meta clustering [ Caruana et al. 2006 ]: early create diverse clusterings, cluster them into groups afterward, and finally let users choose a group of clusterings that is the best for their needs. Ensemble clustering [ Gionis et al. 2005; Zeng et al. 2002 ]: aggregates different clusterings into one consolidated clustering 7

8 D EFINITION OF A STATE A state s in a state space S ( S  R 2bm ) : s = { s 1_min, s 1_max,…, s m_min, s m_max }, s i   2b A state s for CLEVER s = { k’ min, k’ max, p min, p max, p’ min, p’ max } 8

9 5. A N ARCHITECTURE OF MULTI - RUN CLUSTERING SYSTEM State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S1 S2 S4S3 S6 S5 Parameters M M X X M’ Steps in multi-run clustering: S1: Parameter selection. S2: Run a clustering algorithm. S3: Compute a state feedback. S4: Update the state utility table. S5: Update the cluster list M. S6: Summarize clusters discovered M’. 9

10 P RE - PROCESSING STEP. C OMPUTE NECESSARY STATISTICS TO SET UP MULTI - RUN CLUSTERING SYSTEM. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S0 we run m rounds of CLEVER by randomly selecting k’, p and p’. 10

11 S TEP 1. S ELECT PARAMETERS OF A CLUSTERING ALGORITHM. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S1  1. Randomly select a state.  2. Choose a state with the maximum state utility value.  3. Choose a state in the neighborhood of the state having the maximum state utility value. Fig. 2. Examples of the policies  P(  1) = 0.2, P(  2) = 0.6, P(  3) = 0.2. s 1 = {k’ min =1, k’ max =10, p min =1, p max =10, p’ min =11, p’ max =20} s 2 = {k’ min =11, k’ max =20, p min =41, p max =50, p’ min =31, p’ max =40} Selected state: {k’=12, p=45, p’=40} 11

12 S TEP 2. R UN CLEVER TO GENERATE A CLUSTERING WITH RESPECT TO GIVEN PARAMETERS. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S2 k’=12, p=45, p’=40 Parameters 12 Fitness Function:

13 S TEP 3. C OMPUTE A STATE UTILITY. State Utility Learning Clustering Algorithm Clustering Algorithm Storage Unit Cluster Summarization Unit S3 A relative clustering quality function (RCQ) Novelty(X,M) = (1 - similarity(X,M))  Enhancement(X,M) M X 13 X = {x 1,…,x k }, and y i be the most similar cluster in the stored cluster list M to x i  X. RCQ(X,M) = Novelty(X,M) x ||Speed(X)|| x ||q(X)||

14 S TEP 4. U PDATE A STATE UTILITY. State Utility Learning Storage Unit Cluster Summarization Unit S4 Clustering Algorithm Clustering Algorithm Utility Update U’ 14

15 S TEP 5. U PDATE CLUSTER LISTS TO MAINTAIN A SET OF DISTINCT AND HIGH QUALITY CLUSTERS. State Utility Learning Storage Unit Cluster Summarization Unit S5 Clustering Algorithm Clustering Algorithm X Let M be the current set of multi-run clusters. X be a new clustering to be processed for updating M.  sim be a similarity threshold. r th be a reward storage threshold. X will be processed as follows: FOR c  X DO Let m be the most similar cluster in M to c. IF sim ( m, c )>  sim AND Reward ( m )< Reward ( c ) THEN replace ( m, c, M ) ELSE IF Reward ( c )> r th THEN insert ( c, M ) ELSE discard ( c ); Fig. 3. Cluster List Management algorithm (CLM) 15

16 S TEP 6. G ENERATE A FINAL CLUSTERING. State Utility Learning Storage Unit Cluster Summarization Unit S6 Clustering Algorithm Clustering Algorithm M M’ 16 Dominance-guided Cluster Reduction algorithm (DCR) Dominance graphs : a dominant cluster : dominated clusters A A B C D E F D E F AD 0.8 0.7 0.3 0.7 0.8

17 6. E XPERIMENTAL EVALUATION – 1 Evaluation of multi-run clustering on earthquake dataset* Show how multi-run clustering can discover interesting and alternative clusters in spatial data. Be interested in areas where deep earthquakes are in close proximity to shallow earthquakes. Use the High Variance function ( i ( c )) [Rinsurongkawong 2008] to find such regions. 17 *: earthquake dataset is available on the website of the U.S. Geological Survey Earthquake Hazards Program http://earthquake.usgs.gov/.

18 6. E XPERIMENTAL EVALUATION – 2 Fig. 6. Top 5 clusters of X TheBestRun (ordered by reward) Fig. 7. Multi-run clustering results: clusters in M’. 18

19 6. E XPERIMENTAL EVALUATION – 3 Our system can find 70% of the new and high- quality clusters that do not exist in the best single run. With overlapping threshold of 0.2, there are 43% of the positive-reward clusters of the best run are not in M’. 19

20 6. E XPERIMENTAL EVALUATION – 4 Fig. 8. Overlay the multi-run clustering result (in color) by the top 5 rewards clusters of the best run (in black). 20

21 7. C ONCLUSION – 1 Propose an architecture and a concrete system for multi-run clustering to cope with parameters selection of a clustering algorithm, and to obtain alternative clusters in highly automated fashion; Uses active learning to automate the parameter selection, and various techniques to find both different clusters and good clusters on the fly. Propose Dominance-guided Cluster Reduction algorithm that post-processes clusters from the multiple runs to generate a final clustering by restricting cluster overlap. 21

22 7. C ONCLUSION – 2 The experimental result on earthquake dataset supports our claim that multi-run clustering outperforms single-run clustering with respect to clustering quality. Multi-run clustering can discover additional novel, alternative, high-quality clusters and enhance the quality of clusters found using single-run clustering. 22

23 7. F UTURE WORK Systematically evaluate the use of utility learning in choosing parameters of a clustering algorithm. Ultimate goal is to construct multi-run multi- objective clustering in one system. 23

24 T HANK YOU 24


Download ppt "A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science."

Similar presentations


Ads by Google