Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.

Slides:



Advertisements
Similar presentations
Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.
Advertisements

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
Spatial and Temporal Data Mining
Cluster Analysis Part III. Learning Objectives Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary.
Clustering Prof. Navneet Goyal BITS, Pilani
Data Mining Techniques: Clustering
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Clustering IV 1.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Cluster Analysis.
K nearest neighbor and Rocchio algorithm
Principal Component Analysis
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering II.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
What is Cluster Analysis
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
University at BuffaloThe State University of New York WaveCluster A multi-resolution clustering approach qApply wavelet transformation to the feature space.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Dimensionality Reduction
Clustering IV. Outline Impossibility theorem for clustering Density-based clustering and subspace clustering Bi-clustering or co-clustering.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Clustering Part2 BIRCH Density-based Clustering --- DBSCAN and DENCLUE
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
October 20, 2015Data Mining: Concepts and Techniques1 Jianlin Cheng Department of Computer Science University of Missouri, Columbia Customized and Revised.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Author:Rakesh Agrawal
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
11/1/2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 11 — Jiawei Han, Micheline Kamber, and Jian Pei.
Han/Eick: Clustering II 1 Clustering Part2 continued 1. BIRCH skipped 2. Density-based Clustering --- DBSCAN and DENCLUE 3. GRID-based Approaches --- STING.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
11/21/ Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 11 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Presented by Ho Wai Shing
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
December 23, 2015Data Mining: Concepts and Techniques1 Jianlin Cheng Department of Computer Science University of Missouri, Columbia Customized and Revised.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
February 29, 2016Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — UNIT III — (CLUSTERING) Nitin Sharma Asstt. Profffesor in.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar GNET 713 BCB Module Spring 2007 Wei Wang.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
What Is Cluster Analysis?
Data Mining Soongsil University
©Jiawei Han and Micheline Kamber Department of Computer Science
Subspace Clustering/Biclustering
CSE572, CBS598: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
What Is Good Clustering?
CS 485G: Special Topics in Data Mining
Group 9 – Data Mining: Data
An Efficient Method for Projected Clustering
CSE572: Data Mining by H. Liu
Presentation transcript:

Clustering High-Dimensional Data

Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many irrelevant dimensions may mask clusters Distance measure becomes meaningless—due to equi-distance Clusters may exist only in some subspaces Methods – Feature transformation: only effective if most dimensions are relevant PCA & SVD useful only when features are highly correlated/redundant – Feature selection: wrapper or filter approaches useful to find a subspace where the data have nice clusters – Subspace-clustering: find clusters in all the possible subspaces CLIQUE, ProClus, and frequent pattern-based clustering

The Curse of Dimensionality (graphs adapted from Parsons et al. KDD Explorations 2004) Data in only one dimension is relatively packed Adding a dimension “stretch” the points across that dimension, making them further apart Adding more dimensions will make the points further apart—high dimensional data is extremely sparse Distance measure becomes meaningless— due to equi-distance

Why Subspace Clustering? (adapted from Parsons et al. SIGKDD Explorations 2004) Clusters may exist only in some subspaces Subspace-clustering: find clusters in all the subspaces

CLIQUE (Clustering In QUEst) Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98) Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space CLIQUE can be considered as both density-based and grid-based – It partitions each dimension into the same number of equal length interval – It partitions an m-dimensional data space into non-overlapping rectangular units – A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter – A cluster is a maximal set of connected dense units within a subspace

CLIQUE: The Major Steps Partition the data space and find the number of points that lie inside each cell of the partition. Identify the subspaces that contain clusters using the Apriori principle Identify clusters – Determine dense units in all subspaces of interests – Determine connected dense units in all subspaces of interests. Generate minimal description for the clusters – Determine maximal regions that cover a cluster of connected dense units for each cluster – Determination of minimal cover for each cluster

Salary (10,000) age age Vacation (week) age Vacation Salary 3050  = 3

Strength and Weakness of CLIQUE Strength – automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces – insensitive to the order of records in input and does not presume some canonical data distribution – scales linearly with the size of input and has good scalability as the number of dimensions in the data increases Weakness – The accuracy of the clustering result may be degraded at the expense of simplicity of the method

Frequent Pattern-Based Approach Clustering high-dimensional space (e.g., clustering text documents, microarray data) – Projected subspace-clustering: which dimensions to be projected on? CLIQUE, ProClus – Feature extraction: costly and may not be effective? – Using frequent patterns as “features” “Frequent” are inherent features Mining freq. patterns may not be so expensive Typical methods – Frequent-term-based document clustering – Clustering by pattern similarity in micro-array data (pClustering)

Clustering by Pattern Similarity (p-Clustering) Right: The micro-array “raw” data shows 3 genes and their values in a multi- dimensional space – Difficult to find their patterns Bottom: Some subsets of dimensions form nice shift and scaling patterns

Why p-Clustering? Microarray data analysis may need to – Clustering on thousands of dimensions (attributes) – Discovery of both shift and scaling patterns Clustering with Euclidean distance measure? — cannot find shift patterns Clustering on derived attribute A ij = a i – a j ? — introduces N(N-1) dimensions Bi-cluster using transformed mean-squared residue score matrix (I, J) – Where – A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0 Problems with bi-cluster – No downward closure property, – Due to averaging, it may contain outliers but still within δ-threshold

p-Clustering: Clustering by Pattern Similarity Given object x, y in O and features a, b in T, pCluster is a 2 by 2 matrix A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T), pScore(X) ≤ δ for some δ > 0 Properties of δ-pCluster – Downward closure – Clusters are more homogeneous than bi-cluster (thus the name: pair- wise Cluster) Pattern-growth algorithm has been developed for efficient mining For scaling patterns, one can observe, taking logarithmic on will lead to the pScore form