CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Component Analysis (Review)
PARTITIONAL CLUSTERING
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.
Dimensionality Reduction PCA -- SVD
Clustering Prof. Navneet Goyal BITS, Pilani
Course Syllabus 1.Color 2.Camera models, camera calibration 3.Advanced image pre-processing Line detection Corner detection Maximally stable extremal regions.
Course Syllabus 1.Color 2.Camera models, camera calibration 3.Advanced image pre-processing Line detection Corner detection Maximally stable extremal regions.
Data Mining Techniques: Clustering
Lecture Notes for Chapter 2 Introduction to Data Mining
Multivariate Methods Pattern Recognition and Hypothesis Testing.
Cluster Analysis.
Principal Component Analysis
UNIVERSITY OF JYVÄSKYLÄ Yevgeniy Ivanchenko Yevgeniy Ivanchenko University of Jyväskylä
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
University at BuffaloThe State University of New York WaveCluster A multi-resolution clustering approach qApply wavelet transformation to the feature space.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
CSE 634 Data Mining Techniques
Introduction to machine learning
COSC 4335 DM: Preprocessing Techniques
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Chapter 3 Data Exploration and Dimension Reduction 1.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Author:Rakesh Agrawal
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Presented by Ho Wai Shing
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining and Decision Support
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
2D-LDA: A statistical linear discriminant analysis for image matrix
Ultra-high dimensional feature selection Yun Li
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Chapter 13 Discrete Image Transforms
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar GNET 713 BCB Module Spring 2007 Wei Wang.
Unsupervised Learning II Feature Extraction
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Principal Component Analysis (PCA)
Unsupervised Learning
Data Mining Soongsil University
Principal Component Analysis (PCA)
Outlier Discovery/Anomaly Detection
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Feature space tansformation methods
Group 9 – Data Mining: Data
CSE572: Data Mining by H. Liu
Lecture 16. Classification (II): Practical Considerations
Unsupervised Learning
Presentation transcript:

CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course

Outline Clustering High-Dimensional Data 2  Introduction  Solution Techniques  Feature/Attribute Transformation  Feature/Attribute Selection  Subspace Clustering  CLIQUE: A Dimension-Growth Subspace Clustering Method  Major steps  Example  Strength and Weakness

Introduction Clustering High-Dimensional Data 3  Most clustering methods are designed for clustering low-dimensional data and encounter challenges when the dimensionality of the data grows really high (say, over 10 dimensions, or even over thousands of dimensions for some tasks)  Issues:  Noise  Distance measure meaningless

What happen when dimensionality increases? Clustering High-Dimensional Data 4  Only a small number of dimensions are relevant to certain clusters  producing noise and masking the real clusters.  Data become increasingly sparse because the data points are likely located in different dimensional subspaces  data points can be considered as all equally distanced  the distance measure, which is essential for cluster analysis, becomes meaningless.

Solution Techniques Clustering High-Dimensional Data 5  Feature/Attribute Transformation  Feature/Attribute Selection  Subspace Clustering

Feature Transformation Clustering High-Dimensional Data 6  Examples:  Principal component analysis  Singular value decomposition  Transform the data onto a smaller space while preserving the original relative distance between objects.  They summarize data by creating linear combinations of the attributes

Feature Transformation Issues Clustering High-Dimensional Data 7  They do not remove any of the original attributes from analysis.  The irrelevant information may mask the real clusters, even after transformation.  The transformed features (attributes) are often difficult to interpret, making the clustering results less useful.  Thus, feature transformation is only suited to data sets where most of the dimensions are relevant to the clustering task.  Unfortunately, real-world data sets tend to have many highly correlated, or redundant, dimensions.

Feature Selection Clustering High-Dimensional Data 8  It is commonly used for data reduction by removing irrelevant or redundant dimensions (or attributes).  Given a set of attributes, attribute subset selection finds the subset of attributes that are most relevant to the data mining task.  Attribute subset selection involves searching through various attribute subsets and evaluating these subsets using certain criteria.  Supervised learning: the most relevant set of attributes are found with respect to the given class labels.  Unsupervised process: such as entropy analysis, which is based on the property that entropy tends to be low for data that contain tight clusters.

Subspace Clustering Clustering High-Dimensional Data 9  It is an extension to attribute subset selection that has shown its strength at high-dimensional clustering.  It is based on the observation that different subspaces may contain different, meaningful clusters.  Subspace clustering searches for groups of clusters within different subspaces of the same data set.  The problem becomes how to find such subspace clusters effectively and efficiently.

High-Dimensional Data Clustering Approaches Clustering High-Dimensional Data 10  Dimension-Growth Subspace Clustering  CLIQUE (CLustering InQUEst)  Dimension-Reduction Projected Clustering  PROCLUS (PROjected CLUStering)  Frequent Pattern Based Clustering  pCluster

CLIQUE: A Dimension-Growth Subspace Clustering Method 11 Clustering High-Dimensional Data

CLIQUE Overview 12  CLIQUE is used for the clustering of high- dimensional data present in large tables. By high- dimensional data we mean records that have many attributes.  CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering. Clustering High-Dimensional Data

Terminology 13  Unit : After forming a grid structure on the space, each rectangular cell is called a Unit.  Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter.  Cluster: A cluster is defined as a maximal set of connected dense units. Clustering High-Dimensional Data

How Does CLIQUE Work? 14  To cluster a set of records in terms of n-attributes (n- dimensional space).  MAJOR STEPS :  CLIQUE partitions each subspace that has dimension 1 into the same number of equal length intervals.  Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units.  CLIQUE finds dense units of higher dimensionality by finding the dense units in the subspaces. Clustering High-Dimensional Data

CLIQUE: Major Steps (cont.) 15  For example (in 3-dimensional space), CLIQUE finds the dense units in the 3 related PLANES (2- dimensional subspaces.)  It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist. Clustering High-Dimensional Data

CLIQUE: Major Steps (cont.) 16  Each maximal set of connected dense units is considered a cluster.  Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces.  The information of the subspaces is then used to find clusters in the n-dimensional space.  It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells. Clustering High-Dimensional Data

Example 17  Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age.  The data space for the this data would be 3- dimensional. Clustering High-Dimensional Data age salary vacation

Example (Cont.) 18  After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length.  Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle.  Now, our goal is to find the dense 3-D rectangular units. Clustering High-Dimensional Data

Example (Cont.) 19  To do this, we find the dense units of the subspaces of this 3-d space.  So, we find the dense units with respect to age for salary. This means that we look at the salary-age plane and find all the 2-D rectangular units that are dense.  We also find the dense 2-D rectangular units for the vacation-age plane. Clustering High-Dimensional Data

Example (Cont.) Clustering High-Dimensional Data 20

Example (Cont.) 21  Now let us try to visualize the dense units of the two planes on the following 3D figure : Clustering High-Dimensional Data

Example (Cont.) 22  We can extend the dense areas in the vacation-age plane inwards.  We can extend the dense areas in the salary-age plane upwards.  The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist.  We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units. Clustering High-Dimensional Data

Example (Cont.) 23  Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3D dense units.  So, What was the main idea?  We used the dense units in subspaces in order to find the dense units in the 3-dimensional space.  After finding the dense units, it is very easy to find clusters. Clustering High-Dimensional Data

Reflecting upon CLIQUE 24  Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces?  Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned.  The property for CLIQUE says that if a k- dimensional unit is dense then so are its projections in the (k-1) dimensional space. Clustering High-Dimensional Data

Strength and Weakness of CLIQUE 25  Strength  It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces.  It is quite efficient.  It is insensitive to the order of records in input and does not presume some canonical data distribution.  It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases.  Weakness  Obtaining meaningful clustering results is dependent on proper tuning of the grid size (which is a stable structure here) and the density threshold.  The accuracy of the clustering result may be degraded. Clustering High-Dimensional Data

Summary Clustering High-Dimensional Data 26  Introduction  Solution Techniques  Feature/Attribute Transformation  Feature/Attribute Selection  Subspace Clustering  CLIQUE: A Dimension-Growth Subspace Clustering Method  Major steps  Example  Strength and Weakness