ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, 1999. Proceedings.,

ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, 1999. Proceedings., 15th International Conference on Presented by 0356160 何禮亦, 0010007 雷宗翰, 0010012 吳宜倫

Outline Introduction Traditional Clustering Algorithms Contribution of ROCK Clustering Algorithm Proposed Method – ROCK New Clustering Model – Neighbors and Links Clustering Algorithm Experiments Experimental Setup Experimental Results Conclusions & Discussions

Introduction Clustering is a useful technique for grouping data points. ( Ex: Market Basket) - Boolean attributes Categorical attributes could have any arbitrary finite set of values. CustomerFrenchWineCheeseChocolate A1010

Traditional Way – Partition Clustering Divide point space into k clusters to optimize the criterion function. is the centroid of cluster Ci d(x,m) is the euclidean distance between x and m Use iterative, hill-climbing to minimize

Traditional Way – Partition Clustering Good for numeric attributes Not appropriate for categorical attributes Ex: number of attribute is large, size of transaction is smaller, and a transaction may have two subset belongs to different clusters

Traditional Way - Hierarchical Cluster Each point is treated as separate cluster Merge the cluster is their centroids/means are closet until desired clusters remain. Very different transaction will be in same cluster

Traditional Way - Hierarchical Cluster Boolean123456 A111010 B011110 C100100 D000001 AB(0.5,1,1,0.5,1,0)

Merged with Jaccard Coefficient

Contribution of ROCK Use Links between data, instead of distance. A pair of points can be neighbors if they have enough similarity. Numbers of link between data based on the common neighbors they have. A:1,2,3,5 B:2,3,4,5 C:1,4 D:6 {1,4}&{6} will have no links => would not be merged Link concept incorporates global info => more robust

New Clustering Model – Neighbors and Links Neighbors Define the similarity between point p i and p j is: and a threshold Ɵ Then point p i and p j are defined to be neighbors if sim(p i,p j ) is 1 if p i and p j are identical points sim(p i,p j ) is 0 if p i and p j are totally dissimilar

New Clustering Model – Neighbors and Links Define similarity--Market Basket Data The similarity between two transactions T 1 and T 2 is: where is the number of items in T i

New Clustering Model – Neighbors and Links Define similarity– categorical Data Corresponding to every attribute A and value v in its domain→A.v A transaction Ti for a record contains A.v if and only if the value of attribute A is v. For example: shoes={Blue,Red,Yellow } T-shirt={Red,Green} 2shoesT-shirt T1BlueGreen T2YellowRed Shoes.BlueShoes.RedShoes.YellowT-shirt.RedT-shirt.Green T110001 T200110

New Clustering Model – Neighbors and Links Links Define to be the number of common neighbors between p i and p j If is large, it is more probable that p i and p j belong to the same cluster.

New Clustering Model – Neighbors and Links This link-based approach solves the previous problem! Let Ɵ=0.5 and

New Clustering Model – Neighbors and Links Criterion Function where C i denotes cluster i of size n i

New Clustering Model – Neighbors and Links One of the goals is to maximize sum of link(p q,p r ) => However, it does not prevent a clustering in which all points are assigned to a single cluster.

New Clustering Model – Neighbors and Links So, dividing the total number of links involving pair of points in cluster C i by the expected total number of links in C i Each point in C i has approximately neighbors Each point in C i contributes links is the expected number of links in C i

Clustering Algorithm Steps of a ROCK cluster

Clustering Algorithm Goodness measure is the expected number of links between cluster C i and C j

Clustering Algorithm

a, g(a,b) c, g(b,c) b, g(b,c) b,g(a,b) c,g(a,c) a,g(a,b) c,g(b,c)b,g(b,c) a,g(a,c) Qq[a] q[b]q[c] big small

Clustering Algorithm a, g(a,b) c, g(b,c) b, g(b,c) b,g(a,b) c,g(a,c) a,g(a,b) c,g(b,c)b,g(b,c) a,g(a,c) Qq[a] q[b]q[c] d Merge a and b -> d

Clustering Algorithm d,g(c,d) q[c] q[d] c,g(c,d) Q d, g(c,d)

Clustering Algorithm Computation of links 1.compute the adjacency matrix, then compute A x A 2.More efficient way:

Clustering Algorithm Time and space complexity n : number of input data points m m : the maximum number of neighbors m a : the average number of neighbors Time complexity: O(n 2 +nm m m a +n 2 logn) Space complexity: O(min{n 2, nm m m a })

Experiments - Experimental Setup Data set (Real-life) Congressional votes: UCI Machine Learning Repository United States Congressional Voting Records in 1984 435 records, 16 issues 168 Republicans and 267 Democrats Mushroom UCI Machine Learning Repository 8124 records, 22 physical characteristics 4208 are edible and 3916 are poisonous

Experiments - Experimental Setup Compared with: Traditional centroid-based hierarchical clustering algorithm Convert categorical attributes to Boolean attributes with 0/1 values Euclidean distance Outlier handling is performed (eliminating clusters with only one point when the number of clusters reduces to 1/3 of the original number) ROCK: f(θ) = (1 - θ)/(1 + θ) Hardware: Sun Ultra-2/200 machine with 512MB of RAM and running Solaris 2.5

Experiments – Experimental Results Congressional Votes Data well-separated No significant difference in the sizes of the two clusters θ = 0.73 25% 12%

Experiments – Experimental Results Mushroom Data not well-separated Wide variance in the sizes of clusters θ = 0.8

Experiments – Experimental Results Mushroom

Conclusions ROCK is a clustering algorithm designed to deal with categorical data. Proposed a new concept of links to measure the similarity between a pair of data. Developed a robust hierarchical clustering algorithm that employs links instead of distances for merging clusters. ROCK not only generates better quality clusters than traditional algorithms, but also exhibits good scalability properties.

Discussions Strongest part of this paper The concept of links is more appropriate than distances to represent the relations between data points. The ROCK algorithm is very robust and scalable. Weak points of this paper Memory usage may be large Worst case time complexity is high Random sampling issues

Discussions Possible improvement Use linked list to store the categorical data Increase the number of random samples until there are large enough number of clusters Possible extension & applications Extension on both metric and non-metric data Social network: the connection of people

The End

ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, 1999. Proceedings.,

Similar presentations

Presentation on theme: "ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, 1999. Proceedings.,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, 1999. Proceedings.,

Similar presentations

Presentation on theme: "ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, 1999. Proceedings.,"— Presentation transcript:

Similar presentations

About project

Feedback