Clustering Categorical Data

Slides:



Advertisements
Similar presentations
PARTITIONAL CLUSTERING
Advertisements

WEI-MING CHEN k-medoid clustering with genetic algorithm.
Clustering Cost function Pasi Fränti Clustering methods: Part 4 Speech and Image Processing Unit School of Computing University of Eastern Finland
Data Clustering Methods
Clustering II.
Clustering.
An Introduction to Clustering
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.
Segmentação (Clustering) (baseado nos slides do Han)
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis.
K-means clustering CS281B Winter02 Yan Wang and Lihua Lin.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
What is Cluster Analysis?
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.
2013 Teaching of Clustering
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Cluster Analysis Part I
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
Data mining and machine learning A brief introduction.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
11/15/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Clustering.
Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
FAST DYNAMIC QUANTIZATION ALGORITHM FOR VECTOR MAP COMPRESSION Minjie Chen, Mantao Xu and Pasi Fränti University of Eastern Finland.
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
1 Clustering Sunita Sarawagi
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern.
Clustering Sequential Data: Research Paper Review Presented by Glynis Hawley April 28, 2003 On the Optimal Clustering of Sequential Data by Cheng-Ru Lin.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Clustering.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering Analysis CS 685: Special Topics in Data Mining Jinze Liu.
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Balaji Rajagopalan Mark W. Isken 國立雲林科技大學 National Yunlin University.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
How to cluster data Algorithm review Extra material for DAA Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Agglomerative clustering (AC)
Clustering different types of data
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Random Swap algorithm Pasi Fränti
Clustering Uncertain Taxi data
Machine Learning University of Eastern Finland
Dr. Unnikrishnan P.C. Professor, EEE
Ke Chen Reading: [7.3, EA], [9.1, CMB]
PARTITIONAL CLUSTERING
K-means properties Pasi Fränti
What Is Good Clustering?
Clustering Wei Wang.
Pasi Fränti and Sami Sieranoja
Clustering The process of grouping samples so that the samples are similar within each group.
Mean-shift outlier detection
Clustering methods: Part 10
Presentation transcript:

Clustering Categorical Data Pasi Fränti 18.2.2016

K-means clustering

Definitions and data Set of N data points: Partition of the data: X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pM}, Set of M cluster prototypes (centroids): C={c1, c2, …, cM},

Distance and cost function Euclidean distance of data vectors: Mean square error:

Clustering result as partition Partition of data Cluster prototypes Illustrated by Voronoi diagram Illustrated by Convex hulls

Duality of partition and centroids Partition of data Cluster prototypes Partition by nearest prototype mapping Centroids as prototypes

Categorical data

Categorical clustering Three attributes director actor genre t1 (Godfather II) Coppola De Niro Crime t2 (Good Fellas) Scorsese t3 (Vertigo) Hitchcock Stewart Thriller t4 (N by NW) Grant t5 (Bishop's Wife) Koster Comedy t6 (Harvey)

Categorical clustering Sample 2-d data: color and shape Model A Model B Model C

Hamming Distance (Binary and categorical data) Number of different attribute values. Distance of (1011101) and (1001001) is 2. Distance (2143896) and (2233796) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

Histogram-based methods: K-means variants Methods: Histogram-based methods: k-modes k-medoids k-distributions k-histograms k-populations k-representatives

Entropy-based cost functions Category utility: Entropy of data set: Entropies of the clusters relative to the data:

Iterative algorithms

K-modes clustering Distance function

K-modes clustering Prototype of cluster

K-medoids clustering Prototype of cluster Vector with minimal total distance to every other 3 Medoid: 2 2 A C E B C F B D G B C F 2+3=5 2+2=4 2+3=5

K-medoids Example

K-medoids Calculation

K-histograms D 2/3 F 1/3

K-distributions Cost function with ε addition

Example of cluster allocation Change of entropy

Problem of non-convergence Non-convergence

Results with Census dataset

Literature Modified k-modes + k-histograms: M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507, March, 2007. ACE: K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005. ROCK: S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp. 345-366, 200x. K-medoids: L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes: Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998. K-distributions: Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007. K-histograms: Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.