Presentation is loading. Please wait.

Presentation is loading. Please wait.

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

Similar presentations


Presentation on theme: "Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”"— Presentation transcript:

1 Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft” : documents belong to clusters, with fractional scores Termination When assignment of documents to clusters ceases to change much OR When cluster centroids move negligibly over successive iterations

2 How to Find Good Clustering? Minimize the sum of distance within clusters C1C1 C2C2 C3C3 C4C4 C6C6

3 How to Efficiently Clustering Data?

4 K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers

5 K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers

6 K-means for Clustering K-means Start with a random guess of cluster centers Determine the membership of each data points Adjust the cluster centers

7 K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

8 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations

9 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

10 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns

11 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns

12 Problem with K-means (Sensitive to the Initial Cluster Centroids)

13

14

15  We are using distance to measure the similarity so far in k-means  Other similarity measures are possible, e.g., kernel functions

16 Problem with K-means Binary cluster membership

17 Improve Soft Membership  l 2 indicates the importance of each feature

18 Self-Organization Map (SOM) Like soft k-means Determine association between clusters and documents Associate a representative vector with each cluster and iteratively refine Unlike k-means Embed the clusters in a low-dimensional space right from the beginning Large number of clusters can be initialized even if eventually many are to remain devoid of documents

19 Self-Organization Map (SOM) Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function between clusters and

20 SOM : Update Rule Like Neural network Data item d activates neuron (closest cluster) as well as the neighborhood neurons Eg Gaussian neighborhood function Update rule for node under the influence of d is: where is the ndb width and is the learning rate parameter

21 SOM : Example I SOM computed from over a million documents taken from 80 Usenet newsgroups. Light areas have a high density of documents.

22 SOM: Example II Another example of SOM at work: the sites listed in the Open Directory have been organized within a map of Antarctica at http://antarcti.ca/.

23 Multidimensional Scaling(MDS) Goal: Represent documents as points in a low- dimensional space such that the Euclidean distance between any pair of points is as close as possible to the distance between them specified by the input. Given a priori (user-defined) measure of distance or dissimilarity between documents i and j, Let be the Euclidean distance between doc. i and j picked by our MDS algorithm

24 Minimize the Stress  The stress of the embedding is given by:  Iterative stress relation is the most used strategy to minimize the stress

25 Important Issues Stress not easy to optimize Iterative hill climbing 1. Points (documents) assigned random coordinates by external heuristic 2. Points moved by small distance in direction of locally decreasing stress For n documents Each takes time to be moved Totally time per relaxation

26 A Probabilistic Framework for Information Retrieval Three fundamental questions What statistics  should be chosen to describe the characteristics of documents ? How to estimate this statistics ? How to compute the likelihood of generating queries given the statistics  ?

27 Multivariate Binary Model  A document event is just a bit-vector in the vocabulary W  The bit corresponding to a term t is flipped on with probability  Assume that:  Term occurrences are independent event  Term counts are unimportant  The probability of generating d is given by

28 Multinomial Model  Takes term counts into account, but does NOT fix the term-independence assumption  The length of document is determined by a r.v. from a suitable distribution { all parameters needed to capture the length of distribution and all }

29 Mixture Models  Suppose there are m topics (clusters) of the corpus with probability distribution:  For the given topic, the documents are generated by binary/multinomial distribution with parameter set  For a document belonging to topic, we would expect that

30 Unigram Language Model Observation: d={tf 1, tf 2, …, tf n } Unigram language model  ={p(w 1 ), p(w 2 ), …, p(w n )} Maximum likelihood estimation

31 Unigram Language Model Probabilities for single word p(w)  ={p(w) for any word w in vocabulary V} Estimate an unigram language model Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|

32 Statistical Inference C1: h, h, h, t, h  bias b1 = 5/6 C2: t, t, h, t, h, h  bias b2 = 1/2 C3: t, h, t, t, t, h  bias b3 = 1/3 Why counting provide a good estimate of coin bias?

33 Maximum Likelihood Estimation (MLE) Observation o={o 1, o 2, …, o n } Maximum likelihood estimation E.g.: o={h, h, h, t, h,h} Pr(o|b) = b 5 (1-b)

34 Unigram Language Model Observation: d={tf 1, tf 2, …, tf n } Unigram language model  ={p(w 1 ), p(w 2 ), …, p(w n )} Maximum likelihood estimation

35 Maximum A Posterior Estimation Consider a special case: we only toss each coin twice C1: h, t  b1=1/2 C2: h, h  b2=1 C3: t, t  b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !


Download ppt "Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”"

Similar presentations


Ads by Google