CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Introduction to Bioinformatics
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering.
What is Cluster Analysis?
1 Kunstmatige Intelligentie / RuG KI2 - 7 Clustering Algorithms Johan Everts.
CLUSTERING (Segmentation)
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Lecture 09 Clustering-based Learning
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Clustering Unsupervised learning Generating “classes”
Clustering Algorithms Mu-Yu Lu. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other.
Hierarchical clustering & Graph theory
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
DATA MINING CLUSTERING K-Means.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Fuzzy C-Means Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning
DATA MINING WITH CLUSTERING AND CLASSIFICATION
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Data Clustering Michael J. Watts
K-means and Hierarchical Clustering
Clustering.
Prepared by: Mahmoud Rafeek Al-Farra
Information Organization: Clustering
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
CSCI N317 Computation for Scientific Applications Unit Weka
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

CLUSTERING

Overview Definition of Clustering Existing clustering methods Clustering examples

Definition Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Clustering is “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Why clustering? A few good reasons... Simplifications Pattern detection Useful in data concept construction Unsupervised learning process

Where to use clustering? Data mining Information retrieval text mining Web analysis medical diagnostic

Major Existing clustering methods Distance-based Hierarchical Partitioning Probabilistic

Measuring Similarity Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective.

Hierarchical clustering Agglomerative (bottom up) 1.start with 1 point (singleton) 2.recursively add two or more appropriate clusters 3.Stop when k number of clusters is achieved. Divisive (top down) 1.Start with a big cluster 2.Recursively divide into smaller clusters 3.Stop when k number of clusters is achieved.

general steps of hierarchical clustering Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this: Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into K number of clusters

Exclusive vs. non exclusive clustering In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi-dimensional plane. On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership.

Partitioning clustering 1.Divide data into proper subset 2.recursively go through each subset and relocate points between clusters (opposite to visit-once approach in Hierarchical approach) This recursive relocation= higher quality cluster

Probabilistic clustering 1.Data are picked from mixture of probability distribution. 2.Use the mean, variance of each distribution as parameters for cluster 3.Single cluster membership

Single-Linkage Clustering(hierarchical) The N*N proximity matrix is D = [d(i,j)] The clusterings are assigned sequence numbers 0,1,......, (n-1) L(k) is the level of the kth clustering A cluster with sequence number m is denoted (m) The proximity between clusters (r) and (s) is denoted d [(r),(s)]

The algorithm is composed of the following steps: Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0. Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering.

The algorithm is composed of the following steps:(cont.) Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)] Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way: d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)] If all objects are in one cluster, stop. Else, go to step 2.

Hierarchical clustering example Let’s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian cities. The method used is single-linkage. Input distance matrix (L = 0 for all the clusters):

The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1. Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and so on.

After merging MI with TO we obtain the following matrix:

min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM L(NA/RM) = 219 m = 2

min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM L(BA/NA/RM) = 255 m = 3

min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM L(BA/FI/NA/RM) = 268 m = 4

Finally, we merge the last two clusters at level 295. The process is summarized by the following hierarchical tree:

K-mean algorithm 1.It accepts the number of clusters to group data into, and the dataset to cluster as input values. 2.It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. For Example, if there are 10,000 rows of data in the dataset and 3 clusters need to be formed, then the first K=3 initial clusters will be created by selecting 3 records randomly from the dataset as the initial clusters. Each of the 3 initial clusters formed will have just one row of data.

3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, their is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for students in a University, where a record P in the dataset S is represented by a Height, Weight and Age measurement, then P = {Age, Height, Weight}. Then a record containing the measurements of a student John, would be represented as John = {20, 170, 80} where John's Age = 20 years, Height = 1.70 meters and Weight = 80 Pounds. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20, 170, 80}.

4.Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure. 5.K-Means re-assigns each record in the dataset to the most similar cluster and re- calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the arithmetic mean P mean is represented as P mean = {Age mean, Height mean, Weight mean ). Age mean = ( )/2, Height mean = ( )/2 and Weight mean = ( )/2. The arithmetic mean of this cluster = {25, 165, 100}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.

6.K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 7.The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.