Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.

Similar presentations


Presentation on theme: "Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond."— Presentation transcript:

1 Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond Wu

2 DB-Seminar Slide 2 Talk Outline Introduction Related Work pCluster Model Performance Analysis Conclusion

3 DB-Seminar Slide 3 Motivation Why discovery of clusters based on pattern similarity is interesting and important? DNA micro-array analysis E-commerce: Recommendation systems & target marketing

4 DB-Seminar Slide 4 Background Knowledge Clustering: the process of grouping a set of objects into classes of similar objects. Subspace clustering: discovering clusters embedded in the subspace of a high dimensional datasets. Pattern similarity: coherent pattern on a subset of dimensions. ( Not require to have close values on at least one attribute)

5 DB-Seminar Slide 5 Example of Similar pattern on a subset of dimensions

6 DB-Seminar Slide 6 Challenges Identifying subspace clusters in high- dimensional data sets is difficult. Traditional distance functions can not capture the pattern similarity among the objects.

7 DB-Seminar Slide 7 How to detect shifting pattern? Given N attributes a 1,…,a n Define a derived attribute A ij =a i -a j for every pair of attributes a i -a j Thus, the problem equals to mine subspace clusters on the objects with the derived set of attributes. Drawback: The converted dataset will have N(N-1)/2 dimensions  intractable even for a small N

8 DB-Seminar Slide 8 Related Work Bicluster Model (Cheng et al): A IJ : sub Matrix of a DNA array, with the following mean squared residue score H(I,J): δ- bicluster: A IJ is called a δ- bicluster if H(I,J) ≤δ

9 DB-Seminar Slide 9 Bicluster Model (Example) (1) Shifting pattern (2) Scaling pattern H(I,J)=0 H(I,J)=2/3 (3) Not similar pattern (4) Submatrix of (2) H(I,J)=8 H(I,J)=2.25>2/3 If we set δ=2, (3),(4) are not δ - bicluster. a1a2a3 O1123 O2567 a1a2a3 O1124 O2248 a1a2a3 O12412 O2462 a1a3 O114 O228

10 DB-Seminar Slide 10 Drawbacks of Bicluster Model A submatrix of a δ- bicluster is not necessarily a δ- bicluster. Not sure to find all qualified clusters (randomly greedy algorithm provides only an approximate answer). Can not exclude outlier in a bicluster. Difficulties in designing efficient algorithm.

11 DB-Seminar Slide 11 Bicluster Model (Example) The bicluster shown in Figure (a) contains an obvious outlier but it still has a fairly small mean squared residue (4.238). If we get rid of such outliers by reducing the δ threshold, it will exclude many biclusters which do exhibit similar patterns.

12 DB-Seminar Slide 12 The pCluster Model pScore of a 2× 2 matrix: O : subset of objects in the database T : subset of attributes; (O,T): submatrix of dataset δ: user specified clustering threshold d xa : value of object X on attribute a Given x, y ∈ O, and a, b ∈ T

13 DB-Seminar Slide 13 The pCluster Model (Cont.) pScore(X) ≤ δ means that the change of values on the two attributes between the two objects in X is confined byδ, a user-specified threshold. Pair (O, T ) forms a δ-pCluster if for any 2 × 2 submatrix X in (O, T ), we have pScore(X) ≤ δ for some δ ≥ 0.

14 DB-Seminar Slide 14 The pCluster Model (Example) In Figure (a): Object 2, 3 and {b, c} form a 2× 2 submatrix X: d 2b= 12, d 2c= 15, d 3b= 40, d 3c= 43 pScore(X)=|(12-15)-(40-43)|=0 Objects 1,2,3 and {b,c,h,j,e} form a pCluster (δ=0)

15 DB-Seminar Slide 15 The pCluster Model (Cont.) Compact property of pCluster: let (O,T) be a δ-pCluster. Any of its submatrix, (O’,T’) is also a δ- pCluster (Based on the definition of pCluster); The volume of a pCluster: |O|×|T|; Definition of pCluster is symmetric: |(d xa - d xb ) - (d ya - d yb )| = |(d xa - d ya ) - (d xb - d yb )|

16 DB-Seminar Slide 16 Problem Statement Task: To find all pairs (O,T) such that (O,T) is a δ- pCluster according to its definition, and |O|≥ nr, |T|≥ nc. Parameters: D : dataset δ: a cluster threshold nc : a minimal number of columns nr : a minimal number of rows

17 DB-Seminar Slide 17 The Algorithm Definition of MDS: Assuming c = (O, T) is a δ-pCluster. Column set T is a Maximum Dimension Set (MDS) of c if there does not exist T’ T such that (O, T’) is also a δ-pCluster. Objects can form pClusters on multiple MDSs. The algorithm is depth-first, meaning only generate pClusters that cluster on MDSs.

18 DB-Seminar Slide 18 Pair-wise Clustering Pairwise Clustering Principle: Given objects X and Y, and a dimension set T, X and Y form a δ-pCluster on T iff the difference between the largest and smallest value in S(X, Y, T) is below δ. In other word, ({X,Y},T) is a pCluster if the following is true:

19 DB-Seminar Slide 19 Pair-wise Clustering (Example) Sorted sequence of S(X, Y, T) =s 1,…,s k,…,s n Object x and y forms a δ- pCluster if Three MDSs were found: {e,g,c}, {a,d,b,h}, {h,f}

20 DB-Seminar Slide 20 MDS Pruning MDS Pruning Principle: Let T xy be an MDS for objects x, y, and a ∈ T xy. For any O and T, a necessary condition of ({x, y} ∪ O, {a} ∪ T ) being a δ-pCluster is b ∈ T, O ab {x, y}. The pruning criterion can be stated as follows: For any dimension a in a MDS T xy, count the number of O ab that contain {x, y}. If the number of such O ab is less than nc-1, remove a from T xy. Furthermore, if the removal of a makes |T xy | < nc, we remove T xy as well.

21 DB-Seminar Slide 21 MDS Pruning (Example)

22 DB-Seminar Slide 22 The Main Algorithm First step: Scan the dataset to find column-pair MDSs and object-pair MDSs. Second step: Prune object-pair MDSs and column-pair MDSs by turn until no pruning can be made. Third step: Insert the remaining object-pair MDSs into a prefix tree. (Each node represents a cluster of objects, each edge represents the column selected)

23 DB-Seminar Slide 23 Construct a prefix tree Sort the order of columns e.g., a,b,c,… Insert 2-object pCluster(O,T) into the prefix tree. Perform a post-order traversal of the prefix tree. Prune nodes that |O|<nr. ( Add the objects in O to nodes whose column set T’ T and |T’|=|T|-1

24 DB-Seminar Slide 24 Construct a prefix tree (Example)

25 DB-Seminar Slide 25 Algorithm Complexity Main algorithm for mining pClusters has time complexity : where M is the # of columns and N is the # of objects. The worse case: However, the complexity can be greatly reduced because of the MDS pruning process.

26 DB-Seminar Slide 26 Experiments Datasets Synthetic datasets (parameters: different nr, nc, # of embedded perfect pCluster with δ=0) Gene expression data (yeast microarray) MovieLens dataset (E-commerce)

27 DB-Seminar Slide 27 Performance Analysis Response time VS. data size

28 DB-Seminar Slide 28 Performance Analysis (Cont.) Sensitiveness to mining parameters: δ, nc, and nr

29 DB-Seminar Slide 29 Performance Analysis (Cont.) Compare the pCluster with an alternative approach based on the subspace clustering algorithm CLIQUE.

30 DB-Seminar Slide 30 Performance Analysis (Cont.) The pruning process is essential in the pCluster algorithm. Without pruning, the pCluster Algorithm can not beyond 3,000 objects. As the number of the MDS become too large to put into a Prefix tree.

31 DB-Seminar Slide 31 Conclusion pCluster Model: capture the closeness of objects and pattern similarity among the objects in subsets of dimensions. Advantages : -Discover all the qualified pClusters. -The depth-first clustering algorithm avoids generating clusters which are part of other clusters. -More efficient than current algorithm. -Resilient to outliers

32 DB-Seminar Slide 32 References Y. Cheng and G. Church. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000. S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Yeast micro data set, 2000. In http://arep.med.harvard.edu/biclustering/yeast.matrix, R. C. Agarwal, C. C. Aggarwal, and V. Parsad. Depth first generation of long patterns. In SIGKDD, 2000. J. Yang, W. Wang, H. Wang, and P. S. Yu. δ-clusters: Capturing subspace correlation in a large data set. In ICDE, pages 517–528, 2002.

33 Thanks!!


Download ppt "Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond."

Similar presentations


Ads by Google