CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.

CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course

Outline Clustering High-Dimensional Data 2  Introduction  Solution Techniques  Feature/Attribute Transformation  Feature/Attribute Selection  Subspace Clustering  CLIQUE: A Dimension-Growth Subspace Clustering Method  Major steps  Example  Strength and Weakness

Introduction Clustering High-Dimensional Data 3  Most clustering methods are designed for clustering low-dimensional data and encounter challenges when the dimensionality of the data grows really high (say, over 10 dimensions, or even over thousands of dimensions for some tasks)  Issues:  Noise  Distance measure meaningless

What happen when dimensionality increases? Clustering High-Dimensional Data 4  Only a small number of dimensions are relevant to certain clusters  producing noise and masking the real clusters.  Data become increasingly sparse because the data points are likely located in different dimensional subspaces  data points can be considered as all equally distanced  the distance measure, which is essential for cluster analysis, becomes meaningless.

Solution Techniques Clustering High-Dimensional Data 5  Feature/Attribute Transformation  Feature/Attribute Selection  Subspace Clustering

Feature Transformation Clustering High-Dimensional Data 6  Examples:  Principal component analysis  Singular value decomposition  Transform the data onto a smaller space while preserving the original relative distance between objects.  They summarize data by creating linear combinations of the attributes

Feature Transformation Issues Clustering High-Dimensional Data 7  They do not remove any of the original attributes from analysis.  The irrelevant information may mask the real clusters, even after transformation.  The transformed features (attributes) are often difficult to interpret, making the clustering results less useful.  Thus, feature transformation is only suited to data sets where most of the dimensions are relevant to the clustering task.  Unfortunately, real-world data sets tend to have many highly correlated, or redundant, dimensions.

Feature Selection Clustering High-Dimensional Data 8  It is commonly used for data reduction by removing irrelevant or redundant dimensions (or attributes).  Given a set of attributes, attribute subset selection finds the subset of attributes that are most relevant to the data mining task.  Attribute subset selection involves searching through various attribute subsets and evaluating these subsets using certain criteria.  Supervised learning: the most relevant set of attributes are found with respect to the given class labels.  Unsupervised process: such as entropy analysis, which is based on the property that entropy tends to be low for data that contain tight clusters.

Subspace Clustering Clustering High-Dimensional Data 9  It is an extension to attribute subset selection that has shown its strength at high-dimensional clustering.  It is based on the observation that different subspaces may contain different, meaningful clusters.  Subspace clustering searches for groups of clusters within different subspaces of the same data set.  The problem becomes how to find such subspace clusters effectively and efficiently.

High-Dimensional Data Clustering Approaches Clustering High-Dimensional Data 10  Dimension-Growth Subspace Clustering  CLIQUE (CLustering InQUEst)  Dimension-Reduction Projected Clustering  PROCLUS (PROjected CLUStering)  Frequent Pattern Based Clustering  pCluster

CLIQUE: A Dimension-Growth Subspace Clustering Method 11 Clustering High-Dimensional Data

CLIQUE Overview 12  CLIQUE is used for the clustering of high- dimensional data present in large tables. By high- dimensional data we mean records that have many attributes.  CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering. Clustering High-Dimensional Data

Terminology 13  Unit : After forming a grid structure on the space, each rectangular cell is called a Unit.  Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter.  Cluster: A cluster is defined as a maximal set of connected dense units. Clustering High-Dimensional Data

How Does CLIQUE Work? 14  To cluster a set of records in terms of n-attributes (n- dimensional space).  MAJOR STEPS :  CLIQUE partitions each subspace that has dimension 1 into the same number of equal length intervals.  Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units.  CLIQUE finds dense units of higher dimensionality by finding the dense units in the subspaces. Clustering High-Dimensional Data

CLIQUE: Major Steps (cont.) 15  For example (in 3-dimensional space), CLIQUE finds the dense units in the 3 related PLANES (2- dimensional subspaces.)  It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist. Clustering High-Dimensional Data

CLIQUE: Major Steps (cont.) 16  Each maximal set of connected dense units is considered a cluster.  Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces.  The information of the subspaces is then used to find clusters in the n-dimensional space.  It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells. Clustering High-Dimensional Data

Example 17  Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age.  The data space for the this data would be 3- dimensional. Clustering High-Dimensional Data age salary vacation

Example (Cont.) 18  After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length.  Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle.  Now, our goal is to find the dense 3-D rectangular units. Clustering High-Dimensional Data

Example (Cont.) 19  To do this, we find the dense units of the subspaces of this 3-d space.  So, we find the dense units with respect to age for salary. This means that we look at the salary-age plane and find all the 2-D rectangular units that are dense.  We also find the dense 2-D rectangular units for the vacation-age plane. Clustering High-Dimensional Data

Example (Cont.) Clustering High-Dimensional Data 20

Example (Cont.) 21  Now let us try to visualize the dense units of the two planes on the following 3D figure : Clustering High-Dimensional Data

Example (Cont.) 22  We can extend the dense areas in the vacation-age plane inwards.  We can extend the dense areas in the salary-age plane upwards.  The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist.  We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units. Clustering High-Dimensional Data

Example (Cont.) 23  Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3D dense units.  So, What was the main idea?  We used the dense units in subspaces in order to find the dense units in the 3-dimensional space.  After finding the dense units, it is very easy to find clusters. Clustering High-Dimensional Data

Reflecting upon CLIQUE 24  Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces?  Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned.  The property for CLIQUE says that if a k- dimensional unit is dense then so are its projections in the (k-1) dimensional space. Clustering High-Dimensional Data

Strength and Weakness of CLIQUE 25  Strength  It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces.  It is quite efficient.  It is insensitive to the order of records in input and does not presume some canonical data distribution.  It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases.  Weakness  Obtaining meaningful clustering results is dependent on proper tuning of the grid size (which is a stable structure here) and the density threshold.  The accuracy of the clustering result may be degraded. Clustering High-Dimensional Data

Summary Clustering High-Dimensional Data 26  Introduction  Solution Techniques  Feature/Attribute Transformation  Feature/Attribute Selection  Subspace Clustering  CLIQUE: A Dimension-Growth Subspace Clustering Method  Major steps  Example  Strength and Weakness

CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.

Similar presentations

Presentation on theme: "CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.

Similar presentations

Presentation on theme: "CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course."— Presentation transcript:

Similar presentations

About project

Feedback