Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

gSpan: Graph-based substructure pattern mining
Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.
Avrilia Floratou, Sandeep Tata, and Jignesh M. Patel ICDE 2010 Efficient and Accurate Discovery of Patterns in Sequence Datasets.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Mutual Information Mathematical Biology Seminar
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
BINF6201/8201 Molecular phylogenetic methods
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Presented by Ho Wai Shing
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Analyzing Expression Data: Clustering and Stats Chapter 16.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Mining Social Ties Beyond Homophily Hongwei Liang * Ke Wang * Feida Zhu # * Simon Fraser University, Canada # Singapore Management University, Singapore.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Gspan: Graph-based Substructure Pattern Mining
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Challenges in Creating an Automated Protein Structure Metaserver
Subspace Clustering/Biclustering
CARPENTER Find Closed Patterns in Long Biological Datasets
Clustering.
A Fast Algorithm for Subspace Clustering by Pattern Similarity
Scaling up Link Prediction with Ensembles
CS 485G: Special Topics in Data Mining
Computational Genomics Lecture #3a
Clustering.
Presentation transcript:

Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond Wu

DB-Seminar Slide 2 Talk Outline Introduction Related Work pCluster Model Performance Analysis Conclusion

DB-Seminar Slide 3 Motivation Why discovery of clusters based on pattern similarity is interesting and important? DNA micro-array analysis E-commerce: Recommendation systems & target marketing

DB-Seminar Slide 4 Background Knowledge Clustering: the process of grouping a set of objects into classes of similar objects. Subspace clustering: discovering clusters embedded in the subspace of a high dimensional datasets. Pattern similarity: coherent pattern on a subset of dimensions. ( Not require to have close values on at least one attribute)

DB-Seminar Slide 5 Example of Similar pattern on a subset of dimensions

DB-Seminar Slide 6 Challenges Identifying subspace clusters in high- dimensional data sets is difficult. Traditional distance functions can not capture the pattern similarity among the objects.

DB-Seminar Slide 7 How to detect shifting pattern? Given N attributes a 1,…,a n Define a derived attribute A ij =a i -a j for every pair of attributes a i -a j Thus, the problem equals to mine subspace clusters on the objects with the derived set of attributes. Drawback: The converted dataset will have N(N-1)/2 dimensions  intractable even for a small N

DB-Seminar Slide 8 Related Work Bicluster Model (Cheng et al): A IJ : sub Matrix of a DNA array, with the following mean squared residue score H(I,J): δ- bicluster: A IJ is called a δ- bicluster if H(I,J) ≤δ

DB-Seminar Slide 9 Bicluster Model (Example) (1) Shifting pattern (2) Scaling pattern H(I,J)=0 H(I,J)=2/3 (3) Not similar pattern (4) Submatrix of (2) H(I,J)=8 H(I,J)=2.25>2/3 If we set δ=2, (3),(4) are not δ - bicluster. a1a2a3 O1123 O2567 a1a2a3 O1124 O2248 a1a2a3 O12412 O2462 a1a3 O114 O228

DB-Seminar Slide 10 Drawbacks of Bicluster Model A submatrix of a δ- bicluster is not necessarily a δ- bicluster. Not sure to find all qualified clusters (randomly greedy algorithm provides only an approximate answer). Can not exclude outlier in a bicluster. Difficulties in designing efficient algorithm.

DB-Seminar Slide 11 Bicluster Model (Example) The bicluster shown in Figure (a) contains an obvious outlier but it still has a fairly small mean squared residue (4.238). If we get rid of such outliers by reducing the δ threshold, it will exclude many biclusters which do exhibit similar patterns.

DB-Seminar Slide 12 The pCluster Model pScore of a 2× 2 matrix: O : subset of objects in the database T : subset of attributes; (O,T): submatrix of dataset δ: user specified clustering threshold d xa : value of object X on attribute a Given x, y ∈ O, and a, b ∈ T

DB-Seminar Slide 13 The pCluster Model (Cont.) pScore(X) ≤ δ means that the change of values on the two attributes between the two objects in X is confined byδ, a user-specified threshold. Pair (O, T ) forms a δ-pCluster if for any 2 × 2 submatrix X in (O, T ), we have pScore(X) ≤ δ for some δ ≥ 0.

DB-Seminar Slide 14 The pCluster Model (Example) In Figure (a): Object 2, 3 and {b, c} form a 2× 2 submatrix X: d 2b= 12, d 2c= 15, d 3b= 40, d 3c= 43 pScore(X)=|(12-15)-(40-43)|=0 Objects 1,2,3 and {b,c,h,j,e} form a pCluster (δ=0)

DB-Seminar Slide 15 The pCluster Model (Cont.) Compact property of pCluster: let (O,T) be a δ-pCluster. Any of its submatrix, (O’,T’) is also a δ- pCluster (Based on the definition of pCluster); The volume of a pCluster: |O|×|T|; Definition of pCluster is symmetric: |(d xa - d xb ) - (d ya - d yb )| = |(d xa - d ya ) - (d xb - d yb )|

DB-Seminar Slide 16 Problem Statement Task: To find all pairs (O,T) such that (O,T) is a δ- pCluster according to its definition, and |O|≥ nr, |T|≥ nc. Parameters: D : dataset δ: a cluster threshold nc : a minimal number of columns nr : a minimal number of rows

DB-Seminar Slide 17 The Algorithm Definition of MDS: Assuming c = (O, T) is a δ-pCluster. Column set T is a Maximum Dimension Set (MDS) of c if there does not exist T’ T such that (O, T’) is also a δ-pCluster. Objects can form pClusters on multiple MDSs. The algorithm is depth-first, meaning only generate pClusters that cluster on MDSs.

DB-Seminar Slide 18 Pair-wise Clustering Pairwise Clustering Principle: Given objects X and Y, and a dimension set T, X and Y form a δ-pCluster on T iff the difference between the largest and smallest value in S(X, Y, T) is below δ. In other word, ({X,Y},T) is a pCluster if the following is true:

DB-Seminar Slide 19 Pair-wise Clustering (Example) Sorted sequence of S(X, Y, T) =s 1,…,s k,…,s n Object x and y forms a δ- pCluster if Three MDSs were found: {e,g,c}, {a,d,b,h}, {h,f}

DB-Seminar Slide 20 MDS Pruning MDS Pruning Principle: Let T xy be an MDS for objects x, y, and a ∈ T xy. For any O and T, a necessary condition of ({x, y} ∪ O, {a} ∪ T ) being a δ-pCluster is b ∈ T, O ab {x, y}. The pruning criterion can be stated as follows: For any dimension a in a MDS T xy, count the number of O ab that contain {x, y}. If the number of such O ab is less than nc-1, remove a from T xy. Furthermore, if the removal of a makes |T xy | < nc, we remove T xy as well.

DB-Seminar Slide 21 MDS Pruning (Example)

DB-Seminar Slide 22 The Main Algorithm First step: Scan the dataset to find column-pair MDSs and object-pair MDSs. Second step: Prune object-pair MDSs and column-pair MDSs by turn until no pruning can be made. Third step: Insert the remaining object-pair MDSs into a prefix tree. (Each node represents a cluster of objects, each edge represents the column selected)

DB-Seminar Slide 23 Construct a prefix tree Sort the order of columns e.g., a,b,c,… Insert 2-object pCluster(O,T) into the prefix tree. Perform a post-order traversal of the prefix tree. Prune nodes that |O|<nr. ( Add the objects in O to nodes whose column set T’ T and |T’|=|T|-1

DB-Seminar Slide 24 Construct a prefix tree (Example)

DB-Seminar Slide 25 Algorithm Complexity Main algorithm for mining pClusters has time complexity : where M is the # of columns and N is the # of objects. The worse case: However, the complexity can be greatly reduced because of the MDS pruning process.

DB-Seminar Slide 26 Experiments Datasets Synthetic datasets (parameters: different nr, nc, # of embedded perfect pCluster with δ=0) Gene expression data (yeast microarray) MovieLens dataset (E-commerce)

DB-Seminar Slide 27 Performance Analysis Response time VS. data size

DB-Seminar Slide 28 Performance Analysis (Cont.) Sensitiveness to mining parameters: δ, nc, and nr

DB-Seminar Slide 29 Performance Analysis (Cont.) Compare the pCluster with an alternative approach based on the subspace clustering algorithm CLIQUE.

DB-Seminar Slide 30 Performance Analysis (Cont.) The pruning process is essential in the pCluster algorithm. Without pruning, the pCluster Algorithm can not beyond 3,000 objects. As the number of the MDS become too large to put into a Prefix tree.

DB-Seminar Slide 31 Conclusion pCluster Model: capture the closeness of objects and pattern similarity among the objects in subsets of dimensions. Advantages : -Discover all the qualified pClusters. -The depth-first clustering algorithm avoids generating clusters which are part of other clusters. -More efficient than current algorithm. -Resilient to outliers

DB-Seminar Slide 32 References Y. Cheng and G. Church. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Yeast micro data set, In R. C. Agarwal, C. C. Aggarwal, and V. Parsad. Depth first generation of long patterns. In SIGKDD, J. Yang, W. Wang, H. Wang, and P. S. Yu. δ-clusters: Capturing subspace correlation in a large data set. In ICDE, pages 517–528, 2002.

Thanks!!