Clustering.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Data Mining Techniques: Clustering
Introduction to Bioinformatics
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Lecture 09 Clustering-based Learning
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Clustering Bamshad Mobasher DePaul University.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
More on Microarrays Chitta Baral Arizona State University.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Fuzzy C-Means Clustering
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
Data Mining K-means Algorithm
Clustering.
Clustering Techniques and IR
CSE572, CBS598: Data Mining by H. Liu
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering The process of grouping samples so that the samples are similar within each group.
CSE572: Data Mining by H. Liu
Hierarchical Clustering
Presentation transcript:

Clustering

Introduction

Clustering Summarization of large data Data organization Understand the large customer data Data organization Manage the large customer data Outlier detection Find unusual customer data

Clustering Previous process before classification/association Find useful grouping for classes Association rules within a particular cluster

Problem Description Given Task A data set of N data items with each have a d-dimensional data feature vector Task Determine a natural, useful partitioning of the data set into a number of clusters (k) and noise

Measure of closeness: similarity Dice’s Coefficient Simple Matching Cosine Coefficient Jaccard’s Coefficient

Measure of closeness: disimilarity Distance Measure Distance = dissimilarity Manhattan distance Euclidean distance Minkowski metric Mahalnobis distance

Similarity Matrix Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix:

Similarity Matrix - Example Term-Term Similarity Matrix

Similarity Thresholds A similarity threshold is used to mark pairs that are “sufficiently” similar The threshold value is application and collection dependent Using a threshold value of 10 in the previous example

Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods

Partitioning methods K-means Choose k objects as the initial cluster centers; set i=0 Loop For each object Assign data points to their nearest centroid Compute mean of cluster as centre

Partitioning methods I : centroid II

Partitioning Method (Iterative method) The basic algorithm: 1. select M cluster representatives (centroids) 2. for i = 1 to N, assign Di to the most similar centroid 3. for j = 1 to M, recalculate the cluster centroid Cj 4. repeat steps 2 and 3 until these is (little or) no change in clusters Example: Initial (arbitrary) assignment: C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Cluster Centroids

Partitioning Method (Iterative method) Example (continued) Now using simple similarity measure, compute the new cluster-term similarity matrix Now compute new cluster centroids using the original document-term matrix The process is repeated until no further changes are made to the clusters

Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods

Hierarchical methods Group objects into a tree of clusters Types Agglomerative bottom-up approach Single-linkage Complete-linkage Group-linkage Centroid-linkage Ward’s method Divisive top-down approach Use of K-means clustering

Hierarchical methods a a b b a b c d e c c d e e d e d 4step 3step

Hierarchical methods Ward’s method at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares

Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods

Graph Representation The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T1 T3 T4 T6 T8 T5 T2 T7

Clustering Algorithms (Graph-based) Basic clustering techniques try to determine which object belong to the same class Clique Method (complete link) all items within a cluster must be within the similarity threshold of all other items in that cluster clusters may overlap generally produces small but very tight clusters Single Link Method any item in a cluster must be within the similarity threshold of at least one other item in that cluster produces larger but weaker clusters Other methods star method - start with an item and place all related items in that cluster string method - star with an item; place one related item in that cluster; then place anther item related to the last item entered, and so on

Clustering Algorithms (Graph-based) Clique Method a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster T1 T3 Maximal cliques (and therefore the clusters) in the previous example are: {T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal. T5 T4 T2 T7 T6 T8

Clustering Algorithms (Graph-based) Single Link Method selected a item not in a cluster and place it in a new cluster place all other related item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered T1 T3 In this case the single link method produces only two clusters: {T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items. T5 T4 T2 T7 T6 T8

Clustering Algorithms (Graph-based) Star method {t1, t3, t4, t5, t6} {t2, t8} {t7} String method {t1, t3, t4, t2, t6, t8} {t5} T1 T3 T5 T4 T2 T7 T6 T8

Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods

Density-based methods Clusters: density-connected sets DBSCAN algorithm

Density-based methods Based on a set of density distribution functions

Density-based methods Based on a set of density distribution functions Density function

Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods

Grid-based methods Organize the data space as a grid file Determines clusters as density-connected components of the grid Approximate clusters found by DBSCAN

Clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based methods Graph-based methods

Model-based methods Optimize the fit between the given data and some mathematical model N-dim. Centroid vector …

Self-Organizing Map (SOM) A sample data vector X is randomly chosen BMU: best matching unit The map unit with centroid closest to X Update the centroid vector Neighborhood kernel function Learning rate

Self-Organizing Map Output layer Input sample After updating Before updating

Self-Organizing Map SOM