CSE572, CBS598: Data Mining by H. Liu

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

K-Means Clustering Algorithm Mining Lab
Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Cluster Analysis.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering.
Presented by Ho Wai Shing
Clustering.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining and Text Mining. The Standard Data Mining process.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
Semi-Supervised Clustering
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Data Mining Soongsil University
CSE 5243 Intro. to Data Mining
©Jiawei Han and Micheline Kamber Department of Computer Science
CS 685: Special Topics in Data Mining Jinze Liu
CSE 5243 Intro. to Data Mining
The University of Adelaide, School of Computer Science
K-means and Hierarchical Clustering
John Nicholas Owen Sarah Smith
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

CSE572, CBS598: Data Mining by H. Liu Clustering Basic concepts with simple examples Categories of clustering methods Challenges 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS598: Data Mining by H. Liu What is clustering? The process of grouping a set of physical or abstract objects into classes of similar objects. It is also called unsupervised learning. It is a common and important task that finds many applications Examples of clusters Examples where we need clustering 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

Differences from Classification How different? Which one is more difficult as a learning problem? How do we cluster? How to measure the results of clustering? With/without class labels 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

Major clustering methods Partitioning methods k-Means (and EM), k-Medoids Hierarchical methods agglomerative, divisive, BIRCH Similarity and dissimilarity of points in the same cluster and from different clusters Distance measures between clusters minimum, maximum Means of clusters Average between clusters 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS598: Data Mining by H. Liu Clustering -- Example 1 For simplicity, 1-dimension objects and k=2. Objects: 1, 2, 5, 6,7 K-means: Randomly select 5 and 6 as centroids; => Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 => {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 => no change. Aggregate dissimilarity = 0.5^2 + 0.5^2 + 1^2 + 1^2 = 2.5 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS598: Data Mining by H. Liu Issues with k-means A heuristic method Sensitive to outliers How to prove it Determining k Crisp clustering EM Don’t be confused with k-NN 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS598: Data Mining by H. Liu Clustering -- Example 2 For simplicity, we still use 1-dimension objects. Objects: 1, 2, 5, 6,7 agglomerative clustering – a very frequently used algorithm How to cluster: find two closest objects and merge; => {1,2}, so we have now {1.5,5, 6,7}; => {1,2}, {5,6}, so {1.5, 5.5,7}; => {1,2}, {{5,6},7}. 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

Issues with dendrograms How to find proper clusters An alternative: divisive algorithms Top down Comparing with bottom-up, which is more efficient What’s the complexity? How to divide the data A heuristic – Minimum Spanning Tree 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS598: Data Mining by H. Liu Distance measures Single link Measured by the shortest edge between the two clusters Complete link Measured by the longest edge Average link Measured by the average edge length An example 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS598: Data Mining by H. Liu Other Methods Density-based methods DBSCAN: a cluster is a maximal set of density-connected points Core points defined using epsilon-neighborhood and minPts Apply directly density reachable (e.g., P and Q, Q and M) and density-reachable (P and M, assuming so are P and N), and density-connected (any density reachable points, P, Q, M, N) form clusters Grid-based methods STING: the lowest level is the original data statistical parameters of higher-level cells are computed from the parameters of the lower-level cells (count, mean, standard deviation, min, max, distribution Model-based methods Conceptual clustering: COBWEB Category utility Intraclass similarity Interclass dissimilarity 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS598: Data Mining by H. Liu Neural networks Self-organizing feature maps (SOMs) Subspace clustering Clique: if a k-dimensional unit space is dense, then so are its (k-1)-d subspaces 11/27/2018 CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS598: Data Mining by H. Liu Challenges Scalability Dealing with different types of attributes Clusters with arbitrary shapes Automatically determining input parameters Dealing with noise (outliers) Order insensitivity of instances presented to learning High dimensionality Interpretability and usability 11/27/2018 CSE572, CBS598: Data Mining by H. Liu