Presentation is loading. Please wait.

Presentation is loading. Please wait.

CACTUS-Clustering Categorical Data Using Summaries

Similar presentations


Presentation on theme: "CACTUS-Clustering Categorical Data Using Summaries"— Presentation transcript:

1 CACTUS-Clustering Categorical Data Using Summaries
Advisor: Dr. Hsu Graduate:Min-Hung Lin IDSL seminar 2001/10/30

2 Outline Motivation Objective Related Work Definitions CACTUS
Performance Evaluation Conclusions Comments

3 Motivation Clustering with categorical attributes has received attention Previous algorithms do not give a formal description of the clusters Some of them need post-process the output of the algorithm to identify the final clusters.

4 Objective Introduce a novel formalization of a cluster for categorical attributes. Describe a fast summarization-based algorithm CACTUS that discovers clusters. Evaluate the performance of CACTUS on synthetic and real datasets.

5 Related Work EM algorithm [Dempster et al., 1977]
Iterative clustering technique STIRR algorithm[Gibson et al., 1998] Iterative algorithm based on non-linear dynamical systems ROCK algorithm[Guha et al., 1999] Hierarchical clustering algorithm

6 DEF:Support

7 DEF:Strongly Connected

8 DEF:Strongly Connected(cont’d)

9 Formal Definition of a Cluster

10 Formal Definition of a Cluster (cont’d)
is the cluster-projection of C on C is called a sub-cluster if it satisfies conditions (1) and (3) A cluster C over a subset of all attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster

11 DEF:Similarity

12 Inter-attribute Summaries

13 Intra-attribute Summaries

14 Experiments

15 Result STIRR fails to discover CACTUS correctly discovers all clusters
clusters consisting of overlapping cluster-projections on any attribute clusters where two or more clusters share the same cluster projection CACTUS correctly discovers all clusters

16 CACTUS Three-phase clustering algorithm Summarization Phase
Compute the summary information Clustering Phase Discover a set of candidate clusters Validation Phase Determine the actual set of clusters

17 Summarization Phase Inter-attribute Summaries
Intra-attribute Summaries

18 Clustering Phase Computing cluster-projections on attributes
Level-wise synthesis of clusters

19 Computing Cluster-Projections on Attributes
Step 1 :pairwise cluster-projection Step 2 :intersection

20 Computing Cluster-Projections on Attributes (cont’d)

21 Level-wise synthesis of clusters

22 Level-wise synthesis of clusters (cont’d)
Generation procedure

23 Level-wise synthesis of clusters (cont’d)
Candidate cluster

24 Validation Some of the candidate clusters may not have enough support because some of the 2-cluster may be due to different sets of tuples. Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster. Only clusters whose support on D passes the threshold are retained.

25 Validation Procedure Setting the supports of all candidate clusters to zero. For each tuple increment the support of the candidate cluster to which t belongs. At the end of the scan, delete all candidate clusters whose support is less than the threshold.

26 Extensions Large Attribute Value Domains Clusters in Subspaces

27 Performance Evaluation
Evaluation of CACTUS on Synthetic and Real Datasets Compared the performance of CACTUS with the performance of STIRR

28 Synthetic Datasets The test datasets were generated using the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)

29 Real Datasets Two sets of bibliographic entries
7766 entries are database-related 30919 entries are theory-related Four attributes: the first author, the second author, the conference, and the year. Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}

30 Real Datasets (cont’d)
Database-related Theory-related Mixture

31 Results CACTUS is very fast and scalable(only two scans of the dataset) CACTUS outperforms STIRR by a factor between 3 and 10

32 Conclusions Formalized the definition of a cluster for categorical attributes. Introduced a fast summarization-based algorithm CACTUS for discovering such clusters in categorical data. Evaluated algorithm against both synthetic and real datasets.

33 Future Work Relax the cluster definition by allowing sets of attribute values are “almost” strongly connected to each other. Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm Rank the clusters based on a measure of interestingness

34 Comments Pairwise cluster-projection is the NP-complete problem
A large number of candidate clusters is still a problem


Download ppt "CACTUS-Clustering Categorical Data Using Summaries"

Similar presentations


Ads by Google