Author：Rakesh Agrawal

Slides:

Advertisements

Similar presentations

Advertisements

CrowdER - Crowdsourcing Entity Resolution

Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Basic Concepts and Algorithms

PARTITIONAL CLUSTERING

Fast Algorithms For Hierarchical Range Histogram Constructions

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.

Clustering Prof. Navneet Goyal BITS, Pilani

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수

Alok 1Northwestern University PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Cluster Analysis.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Introduction to Bioinformatics Algorithms Clustering.

On Discovering Moving Clusters in Spatio-temporal Data Panos Kalnis National University of Singapore Nikos Mamoulis University of Hong Kong Spiridon Bakiras.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Fast Algorithms for Association Rule Mining

Branch and Bound Algorithm for Solving Integer Linear Programming

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Clustering IV. Outline Impossibility theorem for clustering Density-based clustering and subspace clustering Bi-clustering or co-clustering.

An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.

Gene expression & Clustering (Chapter 10)

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker:

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.

TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Manoranjan.

Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Chung-hung.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Presented by Ho Wai Shing

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Machine Learning Queens College Lecture 7: Clustering.

Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.

CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar GNET 713 BCB Module Spring 2007 Wei Wang.

On the Ability of Graph Coloring Heuristics to Find Substructures in Social Networks David Chalupa By, Tejaswini Nallagatla.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

What Is Cluster Analysis?

Data Mining Soongsil University

CACTUS-Clustering Categorical Data Using Summaries

Frequent Pattern Mining

©Jiawei Han and Micheline Kamber Department of Computer Science

CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

An Efficient Method for Projected Clustering

CSE572: Data Mining by H. Liu

Presentation transcript:

Author：Rakesh Agrawal Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application Author：Rakesh Agrawal Johannes Gehrke Dimitrios Gunopulos Prabhakar Raghavan

Schedule (1).Introduction (2).Motivation (3).Contributions Of The Paper (4).Subspace Clustering (5).CLIQUE(Clustering in Quest) (6).Performance Experiments (7).Conclusions

Introduction (1).Clustering is a descriptive task that seeks to identify homogen- -eous groups of objects based on their attributes(dimensions). (2).Clustering techniques have been studied in statistic(Multivariate Analysis：Cluster Analysis),machine learning and database ( CLARNANS,BIRCH,DBSCAN) (3).Clustering techniques：(a)Partitional Clustering：K-Means method,CLARANS,BIRCH,DBSCAN.(b).Hierachical Clustering

Motivation The need for developing new algorithms (1).Effective treatment of high dimensionality To effectively extract information from a huge amount of data in databases. In other words.The running time of algorithms must be predictable and usable in large database. (2).Interpretability of results：User expect clustering results in the high dimensional data to be interpretable,comprehensible. (3).Scalability and usability： Many clustering algorithms don’t well in a large database may contain millions of objects, Clustering on a sample of a given data set may lead to biased results.In other words,The clustering technique should be fast and scale with the number of dimensio- -ns and the size of input and insensitive to the order of input data.

Contributions of the paper (1).CLIQUE satisfies the above desiderata (Effective,interpretability, Scalability and Usability) (2).CLIQUE can automatically finds subspaces with high-density clusters (3).CLIQUE generates a minimal description for each cluster in DNF expressions. (4).Empirical evaluation shows that CLIQUE scales linearly with the number of input records and has good scalability as the number of dimension in the dimensionality of the hidden cluster.

Subspace Clustering What’s (a).unit (b).dense unit (c).a cluster (d).a minimal description of a cluster. In Figure 1,the two dim space(age,salary) has been partitioned by a 10x10 grid.ξ=10 The unit u=(30≤age<35)Λ(1≤salary<2) A and B are both region A=(30≤age<35)Λ(4≤salary<8) B =(40≤age<60)Λ(2≤salary<6) Assuming the dense units have been shaded, AUB is a cluster( A,B are connected regions) A∩B is not a maximal region. The minimal description for this cluster AUB is the DNF expression： ( (30≤age<35)Λ(4≤salary<8))v ( (40≤age<60)Λ(2≤salary<6))

In Figure2. Assuming T=20%(density threshold) If selectivity(u)>T then u is a dense unit. Where selectivity in the fraction of total data points Contained in the unit. No 2-dimen unit is dense and there are no clusters In the original data space. But if the points are projected on the salary dimen- -sion , there are three 1-dim dense units,and ther- e are two clusters in the 1-dim salary subspace, C’=5≤salary<7 and D’=2≤salary<3 But there is no dense unit and cluster in 1-dim age subspace

CLIQUE Algorithms CLIQUE consists of the following three steps： (1).Identification of subspace that contain clusters. (2).Identification of clusters . (3).Generation of minimal description for the clusters.

(1).Identification of subspace that contain cluster The difficulty in identifying subspaces that contain clusters lies in finding dense units in different subspaces. (1).we use a bottom-up algorithm to find dense units that exploits the monotonicity of the clustering criterion with respect to dim- -ensionality to prune the search space. Lemma1 (monotonicity)：If k-dim unit is dense ,then so are it’s projections in (k-1)-dim space. The bottom-up algorithm procceds (a).Determines 1-dim dense unit and interaction(self-join) to get 2-dim dense unit. Until having (k-1)dim dense units, We can self-join DK-1 to get the candidate k-dim units. (b).we discard those dense units from Ck which have a projection (k-1)- -dim that is’nt is not included in Ck-1 . (2).Making the bottom-up algorithm faster with MDL-base purning.

(2).Finding Clusters Depth-first search algorithm The input to the step of CLIQUE is a set dense units D all in the same k-dim space ,The output will be a partition of D into D1, D2 ,…,Dq,such that all units in Di are connected and no two units Ui E Di ,Uj E Dj with iǂj are connected. Depth-first search algorithm We use a Depth –first search algorithm to find the connected components of the graph,By starting with some U in D,Assign it the first cluster number and find all the units it is connected to,then if there still are units in D that have not yet been visited , we find one and repeat the procedure.

(3).Generating minimal cluster descriptions The input to this step consists of disjoint clusters in k-dim subspace. The goal is to generate a minimal description of each cluster with two steps：(1)Covering with maximal region. (2).Minimal cover. (1). Covering with maximal region We begin with an arbitrary dense unit U1 E C and greedily grow amaximal region R1 that covers U1, we add R1 to R , Then we find another unit U2 E C that is not yet covered by any a maximal regions in R, we greedily grow a maximal region R2 that covers U2, we repeat this procedure until all units in C are covered by some maximal region in R .

How to obtain a maximal region covering a dense unit U

(3).Minimal Cover Removal heuristic： The last step of CLIQUE take as input a cover for each cluster and finds a minimal cover. Minimality is define in terms of the number of maximal regions required to cover the cluster. Removal heuristic： (1).Remove from the cover the smallest(in number of units) maximal region which is redundant.(i.e. every unit is also contained in some other maximal region) (2).Break ties arbitrarily (3).Reape the procedure until no maximal region can be removed.

Performance Experiments We now empirically evaluate CLIQUE using synthetic data (Generator from M.Zait and H.Messatfa.A,Comparative study of clustering methods) The goals of the experiments are to assess the efficiency of CLIQUE： Efficiency ：Determine how the running time scales with -Dimensionality of the data space. -Dimensionality of clusters - Size of data. Accuracy：Test if CLIQUE recovers known clusters in some subspaces of a high dimensional data space.

Synthetic data results

Accuracy：Comparisions with BIRCH, DBSCAN and SVD We used clusters embedded in 5-dim subspaces while varying the dimensional of the space from 5 to50. CLIQUE was able to recover all clusters in every case.

Real data results

Conclusions (1)The problem of high dimensionality is often tackled by requiring the user to specify the subspace for clusteranalysis.But user-identification of quite error-prone.CLIQUE can find clusters embedded in subspaces of high dimensional data without requiring the user to guess subspaces that might have interesting clusters. (2)CLIQUE generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. (3)CLIQUE is insensitive to the oder of input recods, Some clustering algorithms are sensitive to the order of input data；for example ,the same set of data,iput in different orderings,may generate dramatically different clusters. (4).CLIQUE does not presume some canonical data distribution,that is very useful in application.

(5)Empirical evalution shows that CLIQUE scales linearly with the size of input and has good scalability as the number of dimension in the data. (6)CLIQUE can accurately discover clusters embedded in lower dimensional subspaces.