Clustering 10/9/2002. Idea and Applications Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects.

Slides:



Advertisements
Similar presentations
Cluster Analysis: Basic Concepts and Algorithms
Advertisements

1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
PARTITIONAL CLUSTERING
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering 2/26/04 Homework 2 due today Midterm date: 3/11/04 Project part B assigned.
Text Clustering (followed by Text Classification)
Clustering.
Clustering II.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
Clustering. Idea and Applications Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. –It is.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Clustering Note: Next class we will discuss the two Google papers. The discussion will be initiated by you Reading the papers before the class mandatory.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
What is Cluster Analysis?
Clustering Talk by Zaiqing Nie 210 tomorrow
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Clustering (Modified from Stanford CS276 Slides - Lecture 17 Clustering)
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Unsupervised Learning: Clustering
Data Mining K-means Algorithm
Instance Based Learning
CSE572, CBS598: Data Mining by H. Liu
Information Organization: Clustering
CSE572, CBS572: Data Mining by H. Liu
CS 391L: Machine Learning Clustering
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Presentation transcript:

Clustering 10/9/2002

Idea and Applications Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. –It is also called unsupervised learning. –It is a common and important task that finds many applications. Applications in Search engines: –Structuring search results –Suggesting related pages –Automatic directory construction/update –Finding near identical/duplicate pages

When & From What Clustering can be done at: –Indexing time –At query time Applied to documents Applied to snippets Clustering can be based on: URL source Put pages from the same server together Text Content -Polysemy (“bat”, “banks”) -Multiple aspects of a single topic Links -Look at the connected components in the link graph (A/H analysis can do it)

Concepts in Clustering –“Defining distance between points Cosine distance (which you already know) Overlap distance –A good clustering is one where (Intra-cluster distance) the sum of distances between objects in the same cluster are minimized, (Inter-cluster distance) while the distances between different clusters are maximized Objective to minimize: F(Intra,Inter) –Clusters can be evaluated with “internal” as well as “external” measures Internal measures are related to the inter/intra cluster distance External measures are related to how representative are the current clusters to “true” classes –See entropy and F-measure in [Steinbach et. Al.]

Inter/Intra Cluster Distances Intra-cluster distance (Sum/Min/Max/Avg) the (absolute/squared) distance between -All pairs of points in the cluster OR -Between the centroid and all points in the cluster OR -Between the “medoid” and all points in the cluster Inter-cluster distance Sum the (squared) distance between all pairs of clusters Where distance between two clusters is defined as: -distance between their centroids/medoids -(Spherical clusters) -Distance between the closest pair of points belonging to the clusters -(Chain shaped clusters)

Lecture of 10/14

How hard is clustering? One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties Suppose we are given n points, and would like to cluster them into k-clusters –How many possible clusterings? Too hard to do it brute force or optimally Solution: Iterative optimization algorithms –Start with a clustering, iteratively improve it (eg. K-means)

Classical clustering methods Partitioning methods –k-Means (and EM), k-Medoids Hierarchical methods –agglomerative, divisive, BIRCH Model-based clustering methods

K-means Works when we know k, the number of clusters we want to find Idea: –Randomly pick k points as the “centroids” of the k clusters –Loop: For each point, put the point in the cluster to whose centroid it is closest Recompute the cluster centroids Repeat loop (until there is no change in clusters between two consecutive iterations.) Iterative improvement of the objective function: Sum of the squared distance from each point to the centroid of its cluster

K-means Example For simplicity, 1-dimension objects and k=2. –Numerical difference is used as the distance Objects: 1, 2, 5, 6,7 K-means: –Randomly select 5 and 6 as centroids; –=> Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5 –=> {1,2}, {5,6,7}; meanC1=1.5, meanC2=6 –=> no change. –Aggregate dissimilarity (sum of squares of distanceeach point of each cluster from its cluster center--(intra-cluster distance) – = = 2.5 |1-1.5| 2

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! [From Mooney]

Example of K-means in operation [From Hand et. Al.]

Time Complexity Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knm). Computing centroids: Each instance vector gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(Iknm). Linear in all relevant factors, assuming a fixed number of iterations, –more efficient than O(n 2 ) HAC (to come next)

Problems with K-means Need to know k in advance –Could try out several k? Unfortunately, cluster tightness increases with increasing K. The best intra-cluster tightness occurs when k=n (every point in its own cluster) Tends to go to local minima that are sensitive to the starting centroids –Try out multiple starting points Disjoint and exhaustive –Doesn’t have a notion of “outliers” Outlier problem can be handled by K-medoid or neighborhood-based algorithms Assumes clusters are spherical in vector space –Sensitive to coordinate changes, weighting etc. In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Example showing sensitivity to seeds

Variations on K-means Recompute the centroid after every (or few) changes (rather than after all the points are re-assigned) –Improves convergence speed Starting centroids (seeds) change which local minima we converge to, as well as the rate of convergence –Use heuristics to pick good seeds Can use another cheap clustering over random sample –Run K-means M times and pick the best clustering that results Bisecting K-means takes this idea further… Lowest aggregate Dissimilarity (intra-cluster distance)

Bisecting K-means For I=1 to k-1 do{ –Pick a leaf cluster C to split –For J=1 to ITER do{ Use K-means to split C into two sub-clusters, C 1 and C 2 Choose the best of the above splits and make it permanent} } Can pick the largest Cluster or the cluster With lowest average similarity Hybrid method 1 Divisive hierarchical clustering method uses K-means

Class of 16th October Midterm on October 23rd. In class.

Hierarchical Clustering Techniques Generate a nested (multi- resolution) sequence of clusters Two types of algorithms – Divisive Start with one cluster and recursively subdivide Bisecting K-means is an example! –Agglomerative (HAC) Start with data points as single point clusters, and recursively merge the closest clusters “Dendogram”

Hierarchical Agglomerative Clustering Example {Put every point in a cluster by itself. For I=1 to N-1 do{ let C 1 and C 2 be the most mergeable pair of clusters Create C 1,2 as parent of C 1 and C 2 } Example: For simplicity, we still use 1-dimensional objects. –Numerical difference is used as the distance Objects: 1, 2, 5, 6,7 agglomerative clustering: –find two closest objects and merge; –=> {1,2}, so we have now {1.5,5, 6,7}; –=> {1,2}, {5,6}, so {1.5, 5.5,7}; –=> {1,2}, {{5,6},7}

Single Link Example

Properties of HAC Creates a complete binary tree (“Dendogram”) of clusters Various ways to determine mergeability –“Single-link”—distance between closest neighbors –“Complete-link”—distance between farthest neighbors –“Group-average”—average distance between all pairs of neighbors –“Centroid distance”—distance between centroids is the most common measure Deterministic (modulo tie-breaking) Runs in O(N 2 ) time People used to say this is better than K- means But the Stenbach paper says K-means and bisecting K- means are actually better

Impact of cluster distance measures “Single-Link” (inter-cluster distance= distance between closest pair of points) “Complete-Link” (inter-cluster distance= distance between farthest pair of points) [From Mooney]

Complete Link Example

Bisecting K-means For I=1 to k-1 do{ –Pick a leaf cluster C to split –For J=1 to ITER do{ Use K-means to split C into two sub-clusters, C 1 and C 2 Choose the best of the above splits and make it permanent} } Can pick the largest Cluster or the cluster With lowest average similarity Hybrid method 1 Divisive hierarchical clustering method uses K-means

Buckshot Algorithm Combines HAC and K-Means clustering. First randomly take a sample of instances of size  n Run group-average HAC on this sample, which takes only O(n) time. Use the results of HAC as initial seeds for K-means. Overall algorithm is O(n) and avoids problems of bad seed selection. Hybrid method 2 Uses HAC to bootstrap K-means Cut where You have k clusters

Text Clustering HAC and K-Means have been applied to text in a straightforward way. Typically use normalized, TF/IDF-weighted vectors and cosine similarity. Optimize computations for sparse vectors. Applications: –During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall. –Clustering of results of retrieval to present more organized results to the user (à la Northernlight folders). –Automated production of hierarchical taxonomies of documents for browsing purposes (à la Yahoo & DMOZ).

Which of these are the best for text? Bisecting K-means and K-means seem to do better than Agglomerative Clustering techniques for Text document data [Steinbach et al] –“Better” is defined in terms of cluster quality Quality measures: –Internal: Overall Similarity –External: Check how good the clusters are w.r.t. user defined notions of clusters

Challenges/Other Ideas High dimensionality –Most vectors in high-D spaces will be orthogonal –Do LSI analysis first, project data into the most important m-dimensions, and then do clustering E.g. Manjara Phrase-analysis –Sharing of phrases may be more indicative of similarity than sharing of words (For full WEB, phrasal analysis was too costly, so we went with vector similarity. But for top 100 results of a query, it is possible to do phrasal analysis ) Suffix-tree analysis Shingle analysis Using link-structure in clustering A/H analysis based idea of connected components Co-citation analysis Sort of the idea used in Amazon’s collaborative filtering Scalability –More important for “global” clustering –Can’t do more than one pass; limited memory –See the paper –Scalable techniques for clustering the webScalable techniques for clustering the web –Locality sensitive hashing is used to make similar documents collide to same buckets

Phrase-analysis based similarity (using suffix trees)

Other (general clustering) challenges Dealing with noise (outliers) “Neighborhood” methods “An outlier is one that has less than  points within  distance” ( ,  pre-specified thresholds) Need efficient data structures for keeping track of neighborhood R-trees Dealing with different types of attributes –Hard to define distance over categorical attributes