Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Clustering.
WEI-MING CHEN k-medoid clustering with genetic algorithm.
Qiang Yang Adapted from Tan et al. and Han et al.
Data Mining Techniques: Clustering
1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: J.W. Han, I. Witten, E. Frank.
Clustering (slide from Han and Kamber)
Clustering.
An Introduction to Clustering
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Instructor: Qiang Yang
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis.
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Lecture 09 Clustering-based Learning
What is Cluster Analysis?
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Cluster Analysis Part I
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type.
Presented by Tienwei Tsai July, 2005
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
RazorFish Data Exploration-KModes Data Exploration utilizing the K-Modes Clustering algorithm Performed By: Hilbert G Locklear.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extensions of vector quantization for incremental clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :
Fuzzy C-Means Clustering
Flat clustering approaches
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Efficient Rule-Based Attribute-Oriented Induction for Data Mining Authors: Cheung et al. Graduate: Yu-Wei Su Advisor: Dr. Hsu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Intro. ANN & Fuzzy Systems Lecture 20 Clustering (1)
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Mining Top-n Local Outliers in Large Databases Author: Wen Jin, Anthony K. H. Tung, Jiawei Han Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Clustering Categorical Data
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Ching-Lung Chen Author : Pabitra Mitra Student Member 國立雲林科技大學 National Yunlin University.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
CSCI N317 Computation for Scientific Applications Unit Weka
Clustering Wei Wang.
Presentation transcript:

Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Outline Motivation Objective Research Review Notation K-means Algorithm K-mode Algorithm K-prototype Algorithm Experiment Conclusion Personal opinion 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Motivation K-means methods are efficient for processing large data sets K-means is limited to numeric data Numeric and categorical data are mixed with million objects in real world 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Objective Extending K-means to categorical domains and domains with mixed numeric and categorical values 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Research review Partition methods Partitioning algorithm organizes the objects into K partition(K<N) K-means[ MacQueen, 1967] K-medoids[ Kaufman and Rousseeuw, 1990] CLARANS[ Ng and Han, 1994] 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Notation [A1,A2,…..Am] means attribute numbers ,each Ai describes a domains of values, denoted by DOM(Ai) X={X1,X2,…..,Xn} be a set of n objects,object Xi is represented as [Xi,1,Xi,2,…..,Xi,m} Xi=Xk if Xi,j =Xk,j for 1<=j<=m [ ], the first p elements are numeric values, the rest are categorical values 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS K-means Algorithm Problem P minimise ,1<=i<=n Subject to ,1<=i<=n, 1<=l<=k K is clustering numbers, n is objects number W is an nxk partition matrix, Q={Q1,Q2,…Qk} is a set of objects in the same object domain d(.,.) is the Euclidean distance between two objects 2001/11/06 The Lab of Intelligent Database System, IDS

K-means Algorithm (cont.) Problem P can be solved by iteratively solving the following two problems: Problem P1: fix Q= , reduced problem P(W, ) wi,l=1 if d(Xi,Ql) <= d(Xi,Qt), for 1 <= t <= k wi,t=0 for t <> l Problem P2: fix W= , reduced problem P( ,Q) ,1 <= l <= k, and 1<= j <= m 2001/11/06 The Lab of Intelligent Database System, IDS

K-means Algorithm (cont.) Choose an initial and solve P(W, ) to obtain . Set t=0 Let = and solve P( ,Q) to obtain . if P( , )=P( , ), output , and stop; otherwise, go to 3 Let = and solve P(W, ) to obtain . if P( , )=P( , ), output , and stop; otherwise, let t=t+1 and go to 2 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS K-mode Algorithm Using a simple matching dissimilarity measure for categorical objects Replacing means of clusters by modes Using a frequency-based method to find the modes 2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.) Dissimilarity measure where Mode of a set A mode of X ={X1,X2,…..,Xn} is a vector Q=[q1,q2,…,qm] minimise 2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.) Find a mode for a set let be the number of objects having the Kth category in attribute the relative frequency of category in X Theorem 1 D(X,Q) is minimised iff for qj <> for all j=1,…,m 2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.) Two initial mode selection methods Select the first K distinct records from the data sets as the K modes Select the K modes by frequency-based method 2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.) where and To calculate the total cost P against the whole data set each time when a new Q or W is obtained 2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.) Select K initial modes, one for each cluster Allocate an object to the cluster whose mode is the nearest to it . Update the mode of the cluster after each allocation according to theorem 1 2001/11/06 The Lab of Intelligent Database System, IDS

K-mode Algorithm( cont.) After all objects have been allocated to clusters, retest the dissimilarity of objects against the current modes if an object is found its nearest mode belongs to another cluster, reallocate the object to that cluster and update the modes of both clusters Repeat 3 until no objects has changed clusters 2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm To integrate the k-means and k-modes algorithms and to cluster the mixed-type objects ,m is the attribute numbers the first p means numeric data, the rest means categorical data 2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm( cont.) The first term is the Euclidean distance measure on the numeric attributes and the second term is the simple matching dissimilarity measure on the categorical attributes The weight is used to avoid favouring either type of attribute 2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm( cont.) Cost function Minimise 2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm( cont.) Choose clusters Modify the mode 2001/11/06 The Lab of Intelligent Database System, IDS

K-prototypes Algorithm( cont.) Modify the mode 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Experiment K-modes the data set was the soybean disease data set, with 4 diseases 47 instances: {D=10,C=10,R=10,p=17}, 21 attributes K-prototype the second data was the credit approval data set, with 2 class 666 instances { approval=299, reject=367}, 6 numeric and 9 categorical attributes 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Experiment( cont.) 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Experiment( cont.) 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Experiment( cont.) 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Conclusion The k-modes algorithm is faster than the k-means and k-prototypes algorithm because it needs less iterations to converge How many clusters are in the data? The weight adds an additional problem 2001/11/06 The Lab of Intelligent Database System, IDS

The Lab of Intelligent Database System, IDS Personal opinion Conceptual inclusion relationships Outlier problem Massive data sets cause efficient problem 2001/11/06 The Lab of Intelligent Database System, IDS