Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

Slides:



Advertisements
Similar presentations
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Advertisements

Similarity/Clustering 인공지능연구실 문홍구 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding.
K Means Clustering , Nearest Cluster and Gaussian Mixture
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Self Organization: Competitive Learning
Segmentation and Fitting Using Probabilistic Methods
K-means clustering Hongning Wang
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Clustering Color/Intensity
Machine Learning CMPT 726 Simon Fraser University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
ECE 5984: Introduction to Machine Learning
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Language Modeling Approaches for Information Retrieval Rong Jin.
Lecture 09 Clustering-based Learning
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Radial Basis Function Networks
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Text Clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
First topic: clustering and pattern recognition Marc Sobel.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Chapter 23: Probabilistic Language Models April 13, 2004.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
CSC2535 Lecture 5 Sigmoid Belief Nets
Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
KNN & Naïve Bayes Hongning Wang
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Classification of unlabeled data:
Clustering (3) Center-based algorithms Fuzzy k-means
Introduction to Statistical Modeling
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Text Categorization Berlin Chen 2003 Reference:
CS590I: Information Retrieval
Presentation transcript:

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft” : documents belong to clusters, with fractional scores Termination When assignment of documents to clusters ceases to change much OR When cluster centroids move negligibly over successive iterations

How to Find Good Clustering? Minimize the sum of distance within clusters C1C1 C2C2 C3C3 C4C4 C6C6

How to Efficiently Clustering Data?

K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers

K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers

K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers

K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations

K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns

K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns

Problem with K-means (Sensitive to the Initial Cluster Centroids)

 We are using distance to measure the similarity so far in k-means  Other similarity measures are possible, e.g., kernel functions

Problem with K-means Binary cluster membership

Improve Soft Membership  l 2 indicates the importance of each feature

Self-Organization Map (SOM) Like soft k-means Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialized even if eventually many are to remain devoid of documents

Self-Organization Map (SOM) Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function between clusters and

SOM : Update Rule Like Neural network Data item d activates neuron (closest cluster) as well as the neighborhood neurons Eg Gaussian neighborhood function Update rule for node under the influence of d is: where is the ndb width and is the learning rate parameter

SOM : Example I SOM computed from over a million documents taken from 80 Usenet newsgroups. Light areas have a high density of documents.

SOM: Example II Another example of SOM at work: the sites listed in the Open Directory have been organized within a map of Antarctica at

Multidimensional Scaling(MDS) Goal: Represent documents as points in a low- dimensional space such that the Euclidean distance between any pair of points is as close as possible to the distance between them specified by the input. Given a priori (user-defined) measure of distance or dissimilarity between documents i and j, Let be the Euclidean distance between doc. i and j picked by our MDS algorithm

Minimize the Stress  The stress of the embedding is given by:  Iterative stress relation is the most used strategy to minimize the stress

Important Issues Stress not easy to optimize Iterative hill climbing 1. Points (documents) assigned random coordinates by external heuristic 2. Points moved by small distance in direction of locally decreasing stress For n documents Each takes time to be moved Totally time per relaxation

A Probabilistic Framework for Information Retrieval Three fundamental questions What statistics  should be chosen to describe the characteristics of documents ? How to estimate this statistics ? How to compute the likelihood of generating queries given the statistics  ?

Multivariate Binary Model  A document event is just a bit-vector in the vocabulary W  The bit corresponding to a term t is flipped on with probability  Assume that:  Term occurrences are independent event  Term counts are unimportant  The probability of generating d is given by

Multinomial Model  Takes term counts into account, but does NOT fix the term-independence assumption  The length of document is determined by a r.v. from a suitable distribution { all parameters needed to capture the length of distribution and all }

Mixture Models  Suppose there are m topics (clusters) of the corpus with probability distribution:  For the given topic, the documents are generated by binary/multinomial distribution with parameter set  For a document belonging to topic, we would expect that

Unigram Language Model Observation: d={tf 1, tf 2, …, tf n } Unigram language model  ={p(w 1 ), p(w 2 ), …, p(w n )} Maximum likelihood estimation

Unigram Language Model Probabilities for single word p(w)  ={p(w) for any word w in vocabulary V} Estimate an unigram language model Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|

Statistical Inference C1: h, h, h, t, h  bias b1 = 5/6 C2: t, t, h, t, h, h  bias b2 = 1/2 C3: t, h, t, t, t, h  bias b3 = 1/3 Why counting provide a good estimate of coin bias?

Maximum Likelihood Estimation (MLE) Observation o={o 1, o 2, …, o n } Maximum likelihood estimation E.g.: o={h, h, h, t, h,h} Pr(o|b) = b 5 (1-b)

Unigram Language Model Observation: d={tf 1, tf 2, …, tf n } Unigram language model  ={p(w 1 ), p(w 2 ), …, p(w n )} Maximum likelihood estimation

Maximum A Posterior Estimation Consider a special case: we only toss each coin twice C1: h, t  b1=1/2 C2: h, h  b2=1 C3: t, t  b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !