Unsupervised learning

Slides:



Advertisements
Similar presentations
Lecture 15(Ch16): Clustering
Advertisements

Text Clustering David Kauchak cs160 Fall 2009 adapted from:
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Cluster Analysis: Basic Concepts and Algorithms
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Clustering Beyond K-means
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Data Mining Techniques: Clustering
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Cluster Analysis: Basic Concepts and Algorithms
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
What is Cluster Analysis?
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
What is Cluster Analysis?
Unsupervised Learning: Clustering 1 Lecture 16: Clustering Web Search and Mining.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 16: Clustering.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Clustering.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Clustering (Modified from Stanford CS276 Slides - Lecture 17 Clustering)
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Data Mining and Text Mining. The Standard Data Mining process.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sampath Jayarathna Cal Poly Pomona
CSC 4510/9010: Applied Machine Learning
Semi-Supervised Clustering
Constrained Clustering -Semi Supervised Clustering-
Data Mining K-means Algorithm
Unsupervised Learning
K-means and Hierarchical Clustering
Information Organization: Clustering
Machine Learning on Data Lecture 9b- Clustering
Text Categorization Berlin Chen 2003 Reference:
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Unsupervised learning David Kauchak CS 451 – Fall 2013

Administrative Final project No office hours today

Supervised learning Supervised learning: given labeled examples label model/ predictor label3 label4 label5 Supervised learning: given labeled examples

Unsupervised learning Unupervised learning: given data, i.e. examples, but no labels

Unsupervised learning Given some example without labels, do something!

Unsupervised learning applications learn clusters/groups without any label customer segmentation (i.e. grouping) image compression bioinformatics: learn motifs find important features …

Unsupervised learning: clustering Raw data features f1, f2, f3, …, fn f1, f2, f3, …, fn Clusters f1, f2, f3, …, fn f1, f2, f3, …, fn group into classes/clusters extract features f1, f2, f3, …, fn No “supervision”, we’re only given data and want to find natural groupings

Unsupervised learning: modeling Most frequently, when people think of unsupervised learning they think clustering Another category: learning probabilities/parameters for models without supervision Learn a translation dictionary Learn a grammar for a language Learn the social graph

Clustering Clustering: the process of grouping a set of objects into classes of similar objects Applications?

Gene expression data Data from Garber et al. PNAS (98), 2001.

Face Clustering

Face clustering

Search result clustering

Google News

Clustering in search advertising Find clusters of advertisers and keywords Keyword suggestion Performance estimation bids Bidded Keyword Advertiser ~10M nodes

Clustering applications Find clusters of users Targeted advertising Exploratory analysis Clusters of the Web Graph Distributed pagerank computation Who-messages-who IM/text/twitter graph ~100M nodes

Data visualization Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia [Mountain height = cluster size]

A data set with clear cluster structure What are some of the issues for clustering? What clustering algorithms have you seen/used?

Issues for clustering Representation for clustering How do we represent an example features, etc. Similarity/distance between examples Flat clustering or hierarchical Number of clusters Fixed a priori Data driven?

Clustering Algorithms Flat algorithms Usually start with a random (partial) partitioning Refine it iteratively K means clustering Model based clustering Spectral clustering Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive

Hard vs. soft clustering Hard clustering: Each example belongs to exactly one cluster Soft clustering: An example can belong to more than one cluster (probabilistic) Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes

K-means Most well-known and popular clustering algorithm: Start with some initial cluster centers Iterate: Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster

K-means: an example

K-means: Initialize centers randomly

K-means: assign points to nearest center

K-means: readjust centers

K-means: assign points to nearest center

K-means: readjust centers

K-means: assign points to nearest center

K-means: readjust centers

K-means: assign points to nearest center No changes: Done

K-means Iterate: How do we do this? Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster How do we do this?

K-means Iterate: Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster iterate over each point: - get distance to each cluster center - assign to closest center (hard cluster)

K-means Iterate: What distance measure should we use? Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster iterate over each point: - get distance to each cluster center - assign to closest center (hard cluster) What distance measure should we use?

Distance measures Euclidean: good for spatial data

Clustering documents (e.g. wine data) One feature for each word. The value is the number of times that word occurs. Documents are points or vectors in this space

When Euclidean distance doesn’t work Which document is closest to q using Euclidian distance? Which do you think should be closer?

Issues with Euclidian distance the Euclidean distance between q and d2 is large but, the distribution of terms in the query q and the distribution of terms in the document d2 are very similar This is not what we want!

cosine similarity correlated with the angle between two vectors

cosine distance cosine similarity is a similarity between 0 and 1, with things that are similar 1 and not 0 We want a distance measure, cosine distance: good for text data and many other “real world” data sets is computationally friendly since we only need to consider features that have non-zero values both examples

K-means Where are the cluster centers? Iterate: Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster Where are the cluster centers?

K-means How do we calculate these? Iterate: Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster How do we calculate these?

K-means Assign/cluster each example to closest center Iterate: Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster Mean of the points in the cluster: where:

K-means loss function K-means tries to minimize what is called the “k-means” loss function: that is, the sum of the squared distances from each point to the associated cluster center

Minimizing k-means loss Iterate: 1. Assign/cluster each example to closest center 2. Recalculate centers as the mean of the points in a cluster Does each step of k-means move towards reducing this loss function (or at least not increasing)?

Minimizing k-means loss Iterate: 1. Assign/cluster each example to closest center 2. Recalculate centers as the mean of the points in a cluster This isn’t quite a complete proof/argument, but: Any other assignment would end up in a larger loss The mean of a set of values minimizes the squared error

Minimizing k-means loss Iterate: 1. Assign/cluster each example to closest center 2. Recalculate centers as the mean of the points in a cluster Does this mean that k-means will always find the minimum loss/clustering?

Minimizing k-means loss Iterate: 1. Assign/cluster each example to closest center 2. Recalculate centers as the mean of the points in a cluster NO! It will find a minimum. Unfortunately, the k-means loss function is generally not convex and for most problems has many, many minima We’re only guaranteed to find one of them

K-means variations/parameters Start with some initial cluster centers Iterate: Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster What are some other variations/parameters we haven’t specified?

K-means variations/parameters Initial (seed) cluster centers Convergence A fixed number of iterations partitions unchanged Cluster centers don’t change K!

K-means: Initialize centers randomly What would happen here? Seed selection ideas?

Seed choice Results can vary drastically based on random seed selection Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings Common heuristics Random centers in the space Randomly pick examples Points least similar to any existing center (furthest centers heuristic) Try out multiple starting points Initialize with the results of another clustering method

Furthest centers heuristic μ1 = pick random point for i = 2 to K: μi = point that is furthest from any previous centers point with the largest distance to any previous center smallest distance from x to any previous center

K-means: Initialize furthest from centers Pick a random point for the first center

K-means: Initialize furthest from centers What point will be chosen next?

K-means: Initialize furthest from centers Furthest point from center What point will be chosen next?

K-means: Initialize furthest from centers Furthest point from center What point will be chosen next?

K-means: Initialize furthest from centers Furthest point from center Any issues/concerns with this approach?

Furthest points concerns If k = 4, which points will get chosen?

Furthest points concerns yes and no we will get different centers, but the three outliers will always get chosen! If we do a number of trials, will we get different centers?

Furthest points concerns Doesn’t deal well with outliers

K-means++ μ1 = pick random point for k = 2 to K: How does this help? for i = 1 to N: si = min d(xi, μ1…k-1) // smallest distance to any center μk = randomly pick point proportionate to s How does this help?

K-means++ μ1 = pick random point for k = 2 to K: for i = 1 to N: si = min d(xi, μ1…k-1) // smallest distance to any center μk = randomly pick point proportionate to s Makes it possible to select other points if #points >> #outliers, we will pick good points Makes it non-deterministic, which will help with random runs Nice theoretical guarantees!