4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)

Slides:

Advertisements

Similar presentations

Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm

PARTITIONAL CLUSTERING

CS690L: Clustering References:

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Chapter 3: Cluster Analysis

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: J.W. Han, I. Witten, E. Frank.

Cluster Analysis.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Instructor: Qiang Yang

Cluster Analysis.

Cluster Analysis: Basic Concepts and Algorithms

1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.

What is Cluster Analysis?

What is Cluster Analysis?

Birch: An efficient data clustering method for very large databases

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

DATA MINING LECTURE 8 Clustering The k-means algorithm

Clustering Unsupervised learning Generating “classes”

Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.

9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Presented by Ho Wai Shing

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Other Clustering Techniques

CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.

Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

Data Mining: Basic Cluster Analysis

DATA MINING Spatial Clustering

Data Mining Soongsil University

CS 685: Special Topics in Data Mining Jinze Liu

CSE572, CBS598: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일

CS 485G: Special Topics in Data Mining

DATA MINING Introductory and Advanced Topics Part II - Clustering

CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

Clustering Wei Wang.

Text Categorization Berlin Chen 2003 Reference:

CSE572: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Presentation transcript:

4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB) Density-based (DBSCAN, CLIQUE) Large size data (STING, BIRCH, CURE) Spring 2003 Data Mining by H. Liu, ASU

Concepts of Clustering Clusters Different ways of representing clusters Division with boundaries Venn diagram or spheres Probabilistic Dendrograms Trees Rules 1 2 3 I1 I2 … In 0.5 0.2 0.3 Spring 2003 Data Mining by H. Liu, ASU

Clustering vs. classification About clusters Inter-clusters distance  maximization Intra-clusters distance  minimization Clustering vs. classification Which one is more difficult? Why? Various possible ways of clustering, which way is the best? Spring 2003 Data Mining by H. Liu, ASU

Major Categories Partitioning: Divide into k partitions (k fixed); repartition to get better clustering. Hierarchical: Divide into different number of partitions in layers - merge (bottom-up) or divide (top-down). Density-based: Continue to grow a cluster as long as the density of the cluster exceeds a threshold Grid-based: First divide space into grids, then perform clustering on the grids. Spring 2003 Data Mining by H. Liu, ASU

k-Means Algorithm How good the clusters are Given k Randomly pick k instances as the initial centers Assign the rest instances to closest one of k clusters Recalculate the mean of each cluster Repeat 3 & 4 until means don’t change How good the clusters are Initial and final clusters Within-cluster variation diff(x,mean)^2 Why don’t we consider inter-cluster distance? We hope in k-Means that by minimizing within-cluster variation, we can maximize inter-cluster distance as well. Just hope. Spring 2003 Data Mining by H. Liu, ASU

Example For simplicity, 1 dimensional objects and k=2. K-means: Randomly select 5 and 6 as initial centroids; => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5 Spring 2003 Data Mining by H. Liu, ASU

Discussions Limitations: Variants of k-means exist: Means cannot be defined for categorical attributes; Choice of k; Sensitive to outliers; Crisp clustering Variants of k-means exist: Using modes to deal with categorical attributes How about distance measures Is it similar to or different from k-NN? With and without learning Hamming distance Spring 2003 Data Mining by H. Liu, ASU

k-Medoids k-Means algorithm is sensitive to outliers Is this true? How to prove it? Medoid – the most centrally located point in a cluster, as a representative point of the cluster. In contrast, a centroid is not necessarily inside a cluster. An example For the first cluster, y = 1, 2, 6, x = 3 Initial Medoids Spring 2003 Data Mining by H. Liu, ASU

Partition Around Medoids PAM: Given k Randomly pick k instances as initial medoids Assign each instance to the nearest medoid x Calculate the objective function the sum of dissimilarities of all instances to their nearest medoids Randomly select an instance y Swap x by y if the swap reduces the objective function Repeat (3-6) until no change Spring 2003 Data Mining by H. Liu, ASU

k-Means and k-Medoids The key difference lies in how they update means or medoids Both require distance calculation and reassignment of instances Time complexity Which one is more costly? Dealing with outliers Outlier (100 unit away) Black dot is the medoid, it’s 1 away from its closest neighbors. Spring 2003 Data Mining by H. Liu, ASU

EM (Expectation and Maximization) Moving away from crisp clusters as in k-Means by allowing an instance to belong to several clusters Finite mixtures – a statistical clustering model A mixture is a set of k probability distributions, representing k clusters The simplest finite mixture: one feature with a Gaussian When k=2, we need to estimate 5 parameters: 2 pairs of μ and σ and pA, where pB = 1- pA EM Estimate using instances Maximize the overall likelihood that data came from this data set Some details can be found in Witten and Frank’s book Spring 2003 Data Mining by H. Liu, ASU

Agglomerative Each object is viewed as a cluster (bottom up). Repeat until the number of clusters is small enough Choose a closest pair of clusters Merge the two into one Defining “closest”: Centroid (mean of cluster) distance, (average) sum of pairwise distance, … Refer to the Evaluation part A dendrogram is a tree that shows clustering process. Spring 2003 Data Mining by H. Liu, ASU

Dendrogram Cluster 1, 2, 4, 5, 6, 7 into two clusters (centriod distance) 1 2 4 5 6 7 Spring 2003 Data Mining by H. Liu, ASU

An example to show different Links Single link Merge the nearest clusters measured by the shortest edge between the two (((A B) (C D)) E) Complete link Merge the nearest clusters measured by the longest edge between the two (((A B) E) (C D)) Average link Merge the nearest clusters measured by the average edge length between the two A B C D E 1 2 3 4 5 A B This example is from M. Dunham’s book (see the bib) E C D Spring 2003 Data Mining by H. Liu, ASU

Divisive All instances belong to one cluster (top-down) To find an optimal division at each layer (especially the top one) is computationally prohibitive. One heuristic method is based on the Minimum Spanning Tree (MST) algorithm Connecting all instances with MST (O(N2)) Repeatedly cut out the longest edges at each iteration until some stopping criterion is met or until one instance remains in each cluster. Spring 2003 Data Mining by H. Liu, ASU

COBWEB Building a conceptual hierarchy incrementally Category Utility: kijP(fi=vij)P(fi=vij|ck)P(ck|fi=vij) All categories ck, all features fi, all feature values vij It attempts to maximize both the probability that two objects in the same category have values in common and the probability that objects in different categories will have different property values Spring 2003 Data Mining by H. Liu, ASU

Processing one instance at a time by evaluating Placing the instance in the best existing category Adding a new category containing only the instance Merging of two existing categories into a new one and adding the instance to that category Splitting of an existing category into two and placing the instance in the best new resulting category Grandparent Grandparent Split Parent Merge Child 1 Child 2 Child 1 Child 2 Spring 2003 Data Mining by H. Liu, ASU

Density-based BBSCAN –Density-Based Clustering of Applications with Noise It grows regions with sufficiently high density into clusters and can discover clusters of arbitrary shape in spatial databases with noise. Many existing clustering algorithms find spherical shapes of clusters DEBSCAN defines a cluster as a maximal set of density-connected points. Fig 8.9 J. Han and M. Kamber Spring 2003 Data Mining by H. Liu, ASU

Defining density and connection -neighborhood of an object x (core object) (M, P, O) MinPts of objects within -neighborhood (say, 3) directly density-reachable (Q from M, M from P) density-reachable (Q from P, P not from Q) [asymmetric] density-connected (O, R, S) [symmetric] for border points What is the relationship between DR and DC? Han & Kamber2001 Q M S R O Spring 2003 Data Mining by H. Liu, ASU

Clustering with DBSCAN Search for clusters by checking the -neighborhood of each instance x If the -neighborhood of x contains more than MinPts, create a new cluster with x as a core object Iteratively collect directly density-reachable objects from these core object and merge density-reachable clusters Terminate when no new point can be add to any cluster DBSCAN is sensitive to the thresholds of density, but it is many folds faster than CLARANS Time complexity O(N log N) if a spatial index is used, O(N2) otherwise Spring 2003 Data Mining by H. Liu, ASU

Dealing with Large Data Key ideas Reducing the number of instances yet to maintain the distribution Identifying relevant subspaces where clusters possibly exist Using summarized information to avoid repeated data access Sampling CLARA (Clustering LARge Applications) working on samples instead of the whole data CLARANS (Clustering Large Applications based on RANdomized Search) Spring 2003 Data Mining by H. Liu, ASU

Grid: STING (STatistical INformation Grid) Statistical parameters of higher-level cells can easily be computed from those of lower-level cells Attribute-independent: count Attribute-dependent: mean, standard deviation, min, max Type of distribution: normal, uniform, exponential, or unknown Irrelevant cells can be removed Spring 2003 Data Mining by H. Liu, ASU

Representatives BIRCH using Clustering Feature (CF) and CF tree A cluster feature is a triplet about sub-clusters of instances (N, LS, SS) N - the number of instances, LS – linear sum, SS – square sum Two thresholds: branching factor and the max number of children per non-leaf node Two phases Build an initial in-memory CF tree Apply a clustering algorithm to cluster the leaf nodes in CF tree CURE (Clustering Using REpresentitives) is another example Spring 2003 Data Mining by H. Liu, ASU

Taking advantage of the property of density If it’s dense in higher dimensional subspaces, it should be dense in some lower dimensional subspaces CLIQUE (CLustering In QUEst) With high dimensional data, there are many void subspaces Using the property identified, we can start with dense lower dimensional data CLIQUE is a density-based method that can automatically find subspaces of the highest dimensionality such that high-density clusters exist in those subspaces Spring 2003 Data Mining by H. Liu, ASU

Chameleon A hierarchical Clustering Algorithm Using Dynamic Modeling Observations on the weakness of CURE and ROCK Han-Kamber 2001 Spring 2003 Data Mining by H. Liu, ASU

Summary There are many clustering algorithms Good clustering algorithms maximize inter-cluster dissimilarity and intra-cluster similarity Without prior knowledge, it is difficult to choose the best clustering algorithm. Clustering is an important tool for outlier analysis. Spring 2003 Data Mining by H. Liu, ASU

Bibliography I.H. Witten and E. Frank. Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. 2000. Morgan Kaufmann. M. Kantardzic. Data Mining – Concepts, Models, Methods, and Algorithms. 2003. IEEE. J. Han and M. Kamber. Data Mining – Concepts and Techniques. 2001. Morgan Kaufmann. M. H. Dunham. Data Mining – Introductory and Advanced Topics. Spring 2003 Data Mining by H. Liu, ASU