CSE572, CBS572: Data Mining by H. Liu

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

K-Means Clustering Algorithm Mining Lab
Clustering II.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
Cluster Analysis.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Other Clustering Techniques
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
Semi-Supervised Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Data Mining Soongsil University
CSE 5243 Intro. to Data Mining
Data Mining K-means Algorithm
Clustering in Ratemaking: Applications in Territories Clustering
CS 685: Special Topics in Data Mining Jinze Liu
数据挖掘 Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
CSE 5243 Intro. to Data Mining
The University of Adelaide, School of Computer Science
K-means and Hierarchical Clustering
CSE572, CBS598: Data Mining by H. Liu
Revision (Part II) Ke Chen
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
Revision (Part II) Ke Chen
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

CSE572, CBS572: Data Mining by H. Liu Clustering Basic concepts with simple examples Categories of clustering methods Challenges 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu What is clustering? The process of grouping a set of physical or abstract objects into classes of similar objects. It is also called unsupervised learning. It is a common and important task that finds many applications Examples where we need clustering? http://en.wikipedia.org/wiki/Data_clustering 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

Clusters and representations Examples of clusters Different ways of representing clusters Division with boundaries Venn diagrams or spheres Probabilistic Dendrograms Trees Rules 1 2 3 I1 I2 … In 0.5 0.2 0.3 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

Differences from Classification How different? Which one is more difficult as a learning problem? Do we perform clustering in daily activities? How do we cluster? How to measure the results of clustering? With/without class labels Between classification and clustering Semi-supervised clustering 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

Major clustering methods Partitioning methods k-Means (and EM), k-Medoids Hierarchical methods agglomerative, divisive, BIRCH Similarity and dissimilarity of points in the same cluster and from different clusters Distance measures between clusters minimum, maximum Means of clusters Average between clusters 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu How to evaluate Without labeled data, how can one know one clustering result is good? Basic or intuitive idea of clustering for clustered data points Within a cluster - Between clusters – The relationship between the two? Evaluation methods Labeled data – another assumption: instances in the same clusters are of the same class Is it reasonable to use class labels in evaluation? Unlabeled data – we will see below 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Clustering -- Example 1 For simplicity, 1-dimension objects and k=2. Objects: 1, 2, 5, 6,7 K-means: Randomly select 5 and 6 as centroids; => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Issues with k-means A heuristic method Sensitive to outliers How to prove it? Determining k Trial and error X-means, PCA-based Crisp clustering EM, Fuzzy c-means Should not be confused with k-NN X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) Dan Pelleg, Andrew Moore   C-means, http://www.mathworks.com/access/helpdesk/help/toolbox/fuzzy/fp43419.html 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu k-Medoids Medoid – the most centrally located point in a cluster, as a representative point of the cluster. In contrast, a centroid is not necessarily inside a cluster. An example For the first cluster, y = 1, 2, 6, x = 3 Initial Medoids 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

Partition Around Medoids PAM: Given k Randomly pick k instances as initial medoids Assign each instance to the nearest medoid x Calculate the objective function the sum of dissimilarities of all instances to their nearest medoids Randomly select an instance y Swap x by y if the swap reduces the objective function for all x Repeat (3-6) until no change 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu k-Means and k-Medoids The key difference lies in how they update means or medoids k-medoids and (N-k) instances pairwise comparison Both require distance calculation and reassignment of instances Time complexity Which one is more costly? Dealing with outliers Outlier (100 unit away) Black dot is the medoid, it’s 1 away from its closest neighbors. 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Agglomerative Each object is viewed as a cluster (bottom up). Repeat until the number of clusters is small enough Choose a closest pair of clusters Merge the two into one Defining “closest”: Centroid (mean of cluster) distance, (average) sum of pairwise distance, … Refer to the Evaluation part A dendrogram is a tree that shows clustering process. 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Clustering -- Example 2 For simplicity, we still use 1-dimension objects. Objects: 1, 2, 5, 6,7 agglomerative clustering – a very frequently used algorithm How to cluster: find two closest objects and merge; => {1,2}, so we have now {1.5,5, 6,7}; => {1,2}, {5,6}, so {1.5, 5.5,7}; => {1,2}, {{5,6},7}. 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

Issues with dendrograms How to find proper clusters An alternative: divisive algorithms Top down Comparing with bottom-up, which is more efficient What’s the time complexity? How to efficiently divide the data A heuristic – Minimum Spanning Tree http://en.wikipedia.org/wiki/Minimum_spanning_tree Time complexity – fastest is about O(e) where e - edges 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Distance measures Single link Measured by the shortest edge between the two clusters Complete link Measured by the longest edge Average link Measured by the average edge length An example is shown next. 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

An example to show different Links Single link Merge the nearest clusters measured by the shortest edge between the two (((A B) (C D)) E) Complete link Merge the nearest clusters measured by the longest edge between the two (((A B) E) (C D)) Average link Merge the nearest clusters measured by the average edge length between the two A B C D E 1 2 3 4 5 A B This example is from M. Dunham’s book (see the bib) E C D 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Other Methods Density-based methods DBSCAN: a cluster is a maximal set of density-connected points Core points defined using epsilon-neighborhood and minPts Apply directly density reachable (e.g., P and Q, Q and M) and density-reachable (P and M, assuming so are P and N), and density-connected (any density reachable points, P, Q, M, N) form clusters Grid-based methods STING: the lowest level is the original data statistical parameters of higher-level cells are computed from the parameters of the lower-level cells (count, mean, standard deviation, min, max, distribution) Model-based methods Conceptual clustering: COBWEB Category utility Intraclass similarity Interclass dissimilarity 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Density-based DBSCAN – Density-Based Clustering of Applications with Noise It grows regions with sufficiently high density into clusters and can discover clusters of arbitrary shape in spatial databases with noise. Many existing clustering algorithms find spherical shapes of clusters DEBSCAN defines a cluster as a maximal set of density-connected points. Density is defined by an area and # of points Fig 8.9 J. Han and M. Kamber 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Defining density and connection -neighborhood of an object x (core object) (M, P, O) MinPts of objects within -neighborhood (say, 3) directly density-reachable (Q from M, M from P) Only core objects are mutually density reachable density-reachable (Q from P, P not from Q) [asymmetric] density-connected (O, R, S) [symmetric] for border points What is the relationship between DR and DC? Han & Kamber2001 Q M P S R O 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Clustering with DBSCAN Search for clusters by checking the -neighborhood of each instance x If the -neighborhood of x contains more than MinPts, create a new cluster with x as a core object Iteratively collect directly density-reachable objects from these core object and merge density-reachable clusters Terminate when no new point can be added to any cluster DBSCAN is sensitive to the thresholds of density, but it is fast Time complexity O(N log N) if a spatial index is used, O(N2) otherwise 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Grid: STING (STatistical INformation Grid) Statistical parameters of higher-level cells can easily be computed from those of lower-level cells Attribute-independent: count Attribute-dependent: mean, standard deviation, min, max Type of distribution: normal, uniform, exponential, or unknown Irrelevant cells can be removed 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu BIRCH using Clustering Feature (CF) and CF tree A cluster feature is a triplet about sub-clusters of instances (N, LS, SS) N - the number of instances, LS – linear sum, SS – square sum Two thresholds: branching factor and the max number of children per non-leaf node Two phases Build an initial in-memory CF tree Apply a clustering algorithm to cluster the leaf nodes in CF tree CURE (Clustering Using REpresentitives) is another example, allowing multiple centroids in a cluster Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Taking advantage of the property of density If it’s dense in higher dimensional subspaces, it should be dense in some lower dimensional subspaces How to use this property? CLIQUE (CLustering In QUEst) With high dimensional data, there are many void subspaces Using the property identified, we can start with dense lower dimensional data CLIQUE is a density-based method that can automatically find subspaces of the highest dimensionality such that high-density clusters exist in those subspaces 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Chameleon A hierarchical Clustering Algorithm Using Dynamic Modeling Observations on the weakness of CURE and ROCK CURE: clustering using representatives ROCK: clustering categorical attributes Based on k-nn and dynamic modeling Han-Kamber 2001 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

Graph-based clustering Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. The nearest neighbors of a point tend to belong to the same class as the point itself. This reduces the impact of noise and outliers and sharpens the distinction between clusters. From Tang-etal 2006 Chapter 9 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Neural networks Self-organizing feature maps (SOMs) Subspace clustering Clique: if a k-dimensional unit space is dense, then so are its (k-1)-d subspaces More will be discussed later Semi-supervised clustering http://www.cs.utexas.edu/~ml/publication/unsupervised.html http://www.cs.utexas.edu/users/ml/risc/ 1/18/2019 CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu Challenges Scalability Dealing with different types of attributes Clusters with arbitrary shapes Automatically determining input parameters Dealing with noise (outliers) Order insensitivity of instances presented to learning High dimensionality Interpretability and usability 1/18/2019 CSE572, CBS572: Data Mining by H. Liu