Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Slides:



Advertisements
Similar presentations
Lecture 15(Ch16): Clustering
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Hierarchical Clustering
Albert Gatt Corpora and Statistical Methods Lecture 13.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Unsupervised learning
Introduction to Bioinformatics
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Lecture 13: Clustering (continued) May 12, 2010
CS276A Text Retrieval and Mining
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Flat Clustering Adapted from Slides by
Unsupervised Learning: Clustering 1 Lecture 16: Clustering Web Search and Mining.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
ITCS 6265 Information Retrieval & Web Mining Lecture 15 Clustering.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 16: Clustering.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Clustering (Modified from Stanford CS276 Slides - Lecture 17 Clustering)
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Lecture 12: Clustering May 5, Clustering (Ch 16 and 17)  Document clustering  Motivations  Document representations  Success criteria  Clustering.
Text Clustering Hongning Wang
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Slides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Introduction to Information Retrieval Introduction to Information Retrieval Clustering Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Hierarchical Clustering & Topic Models
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sampath Jayarathna Cal Poly Pomona
Unsupervised Learning: Clustering
Machine Learning Clustering: K-means Supervised Learning
Machine Learning Lecture 9: Clustering
本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17
CS276B Text Information Retrieval, Mining, and Exploitation
CS 391L: Machine Learning Clustering
Text Categorization Berlin Chen 2003 Reference:
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Supervised vs. Unsupervised Learning Two Fundamental Methods in Machine Learning Supervised Learning (“learn from my example”) Goal: A program that performs a task as good as humans. TASK – well defined (the target function) EXPERIENCE – training data provided by a human PERFORMANCE – error/accuracy on the task Unsupervised Learning (“see what you can find”) Goal: To find some kind of structure in the data. TASK – vaguely defined No EXPERIENCE No PERFORMANCE (but, there are some evaluations metrics) 2

What is Clustering? The most common form of Unsupervised Learning Clustering is the process of grouping a set of physical or abstract objects into classes (“clusters”) of similar objects It can be used in IR: To improve recall in search For better navigation of search results

Ex1: Cluster to Improve Recall Cluster hypothesis: Documents with similar text are related Thus, when a query matches a document D, also return other documents in the cluster containing D. 4

Ex2: Cluster for Better Navigation 5

Clustering Characteristics Flat Clustering vs Hierarchical Clustering Flat: just dividing objects in groups (clusters) Hierarchical: organize clusters in a hierarchy Evaluating Clustering Internal Criteria  The intra-cluster similarity is high (tightness)  The inter-cluster similarity is low (separateness) External Criteria  Did we discover the hidden classes? (we need gold standard data for this evaluation) 6

Clustering for Web IR Representation for clustering Document representation Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small 7

Recall: Documents as vectors Each doc j is a vector of tf.idf values, one component for each term. Can normalize to unit length. Vector space terms are axes - aka features N docs live in this space even with stemming, may have 20,000+ dimensions What makes documents related? 8

Intuition for relatedness 9 t 1 D2 D1 D3 D4 t 2 x y Documents that are “close together” in vector space talk about the same things.

What makes documents related? Ideal: semantic similarity. Practical: statistical similarity We will use cosine similarity. We will describe algorithms in terms of cosine similarity. 10 This is known as the “normalized inner product”.

Clustering Algorithms Hierarchical algorithms Bottom-up, agglomerative clustering Partitioning “flat” algorithms Usually start with a random (partial) partitioning Refine it iteratively The famous k-means partitioning algorithm: Given: a set of n documents and the number k Compute: a partition of k clusters that optimizes the chosen partitioning criterion 11

K-means Assumes documents are real-valued vectors. Clusters based on centroids of points in a cluster, c (= the center of gravity or mean) : Reassignment of instances to clusters is based on distance to the current cluster centroids. See AnimationAnimation 12

K-Means Algorithm 13 Let d be the distance measure between instances. Select k random instances {s 1, s 2,… s k } as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to the cluster c j such that d(x i, s j ) is minimal. (Update the seeds to the centroid of each cluster) For each cluster c j s j =  (c j )

K-means: Different Issues When to stop? When a fixed number of iterations is reached When centroid positions do not change Seed Choice Results can vary based on random seed selection. Try out multiple starting points 14 Example showing sensitivity to seeds A B DE C F If you start with centroids: B and E you converge to If you start with centroids D and F you converge to:

Hierarchical clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. 15 animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Hierarchical Agglomerative Clustering We assume there is a similarity function that determines the similarity of two instances. 16 Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most similar. Replace c i and c j with a single cluster c i  c j Algorithm: Watch animation of HAC

What is the most similar cluster? Single-link Similarity of the most cosine-similar (single-link) Complete-link Similarity of the “furthest” points, the least cosine-similar Group-average agglomerative clustering Average cosine between pairs of elements Centroid clustering Similarity of clusters’ centroids 17

Single link clustering 18 1) Use maximum similarity of pairs: 2) After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is:

Complete link clustering 19 1) Use minimum similarity of pairs: 2) After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is:

Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need a concise label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues.  Often done by hand, a posteriori. 20

How to Label Clusters Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases But harder to scan 21

Further issues Complexity: Clustering is computationally expensive. Implementations need careful balancing of needs. How to decide how many clusters are best? Evaluating the “goodness” of clustering There are many techniques, some focus on implementation issues (complexity/time), some on the quality of 22