Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Data Mining Techniques: Clustering
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
Clustering II.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Cluster Analysis.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Text mining.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Hierarchical Clustering
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Clustering.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Hierarchical Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Hierarchical Clustering
K-means and Hierarchical Clustering
Clustering.
Text Categorization Berlin Chen 2003 Reference:
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Presentation transcript:

Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering (CACTUS, STIRR) …… STC QDC

Hierarchical clustering Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, 1. Start by assigning each item to its own cluster 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Iwona Białynicka-Birula - Clustering Web Search Results Agglomerative hierarchical clustering

Iwona Białynicka-Birula - Clustering Web Search Results Clustering result: dendrogram

Iwona Białynicka-Birula - Clustering Web Search Results AHC variants Various ways of calculating cluster similarity single-link (minimum) complete-link (maximum) Group-average (average)

Strength and weakness Can find clusters of arbitrary shapes Single link has a chaining problem Complete link is sensitive to outliers Computation complexities and space requirements 6

Data Clustering K-means Partitional clustering Initial number of clusters k

K-means 1.Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. 8

Example by Andrew W. Moore 9

10

K-means 11

Iwona Białynicka-Birula - Clustering Web Search Results K-means clustering (k=3)

Strengths and Weaknesses Only applicable to data sets where the notion of the mean is defined Need to now the number of clusters K in advance Sensitive to outliers Sensitive to initial seeds Not suitable for discovering clusters that are not hyper-ellipsoids (e.g. L shape) 13

14

15

Iwona Białynicka-Birula - Clustering Web Search Results Single-pass threshold

Document Clustering: k-means k-means: distance-based flat clustering Advantage: linear time complexity works relatively well in low dimension space Drawback: distance computation in high dimension space centroid vector may not well summarize the cluster documents initial k clusters affect the quality of clusters 0. Input: D::={d 1,d 2,…d n }; k::=the cluster number; 1. Select k document vectors as the initial centriods of k clusters 2. Repeat 3. Select one vector d in remaining documents 4. Compute similarities between d and k centroids 5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change 7. Output:k clusters of documents

Document Clustering: HAC Hierarchic agglomerative clustering(HAC):distance-based hierarchic clustering Advantage: producing better quality clusters works relatively well in low dimension space Drawback: distance computation in high dimension space quadratic time complexity 0. Input: D::={d 1,d 2,…d n }; 1. Calculate similarity matrix SIM[i,j] 2. Repeat 3. Merge the most similar two clusters, K and L, to form a new cluster KL 4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j] 5. Until there is a single(or specified number) cluster 6. Output: dendogram of clusters