Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l

Similar presentations


Presentation on theme: "©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l"— Presentation transcript:

1 ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l l l (610)

2 ©2012 Paula Matuszek Document Clustering l Clustering documents: group documents into “similar” categories l Categories are not predefined l Examples –Process news articles to see what today’s stories are –Review descriptions of calls to a help line to see what kinds of problems people are having –Get an overview of search results –Infer a set of labels for future training of a classifier l Unsupervised learning: no training examples and no test set to evaluate

3 Some Examples l search.carrot2.org/stable/search search.carrot2.org/stable/search l l news.google.com news.google.com

4 Clustering Basics l Collect documents l Compute similarity among documents l Group documents together such that –documents within a cluster are similar –documents in different clusters are different l Summarize each cluster l Sometimes -- assign new documents to the most similar cluster

5 3. Clustering Example Based on:

6 Measures of Similarity l Semantic similarity -- but this is hard. l Statistical similarity: BOW –Number of words in common –count –Weighted words in common –words get additional weight based on inverse document frequency: more similar if two documents share an infrequent word –With document as a vector: –Euclidean Distance –Cosine Similarity

7 Euclidean Distance l Euclidean distance: distance between two measures summed across each feature –Squared differences to give more weight to larger difference dist(x i, x j ) = sqrt((x i1 -x j1 ) 2 +(x i2 -x j2 ) (x in -x jn ) 2 )

8 Based on home.iitk.ac.in/~mfelixor/Files/non-numeric-Clustering-seminar.pp Cosine similarity measurement l Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. l The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. l As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.

9 Clustering Algorithms l Hierarchical –Bottom up (agglomerative) –Top-down (not common in text) l Flat –K means l Probabilistic –Expectation Maximumization (E-M)

10 7 Hierarchical Agglomerative Clustering (HAC) l Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. l The history of merging forms a binary tree or hierarchy. Based on:

11 8 HAC Algorithm Based on: l Start with all instances in their own cluster. l Until there is only one cluster: – Among the current clusters, determine the two clusters, c i and c j, that are most similar. –Replace c i and c j with a single cluster c i ∪ c j

12 9 Cluster Similarity l How to compute similarity of two clusters each possibly containing multiple instances? –Single Link: Similarity of two most similar members. –Can result in “straggly” (long and thin) clusters due to chaining effect. –Complete Link: Similarity of two least similar members. –Makes more “tight,” spherical clusters that are typically preferable –More sensitive to outliers –Group Average: Average similarity between members. Based on:

13 11 Single Link Example Based on:

14 13 Complete Link Example Based on:

15 16 Group Average Agglomerative Clustering l Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. l Compromise between single and complete link. l Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters (to encourage tighter final clusters). Based on:

16 Dendogram: Hierarchical Clustering l Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

17 Partitioning (Flat) Algorithms l Partitioning method: Construct a partition of n documents into a set of K clusters l Given: a set of documents and the number K l Find: a partition of K clusters that optimizes the chosen partitioning criterion –Globally optimal: exhaustively enumerate all partitions –Effective heuristic methods: K-means algorithm.

18 18 Non-Hierarchical Clustering l Typically must provide the number of desired clusters, k. l Randomly choose k instances as seeds, one per cluster. l Form initial clusters based on these seeds. l Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. l Stop when clustering converges or after a fixed number of iterations. Based on:

19 19 K-Means l Assumes instances are real-valued vectors. l Clusters based on centroids, center of gravity, or mean of points in a cluster, c: l Reassignment of instances to clusters is based on distance to the current cluster centroids. Based on:

20 21 K-Means Algorithm Based on: l Let d be the distance measure between instances. l Select k random instances {s1, s2,… sk} as seeds. l Until clustering converges or other stopping criterion: –For each instance xi: –Assign xi to the cluster cj such that d(xi, sj) is minimal. –(Update the seeds to the centroid of each cluster) –For each cluster cj, sj = μ(cj)

21 22 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Based on:

22 Termination conditions l Several possibilities, e.g., –A fixed number of iterations. –Doc partition unchanged. –Centroid positions don’t change.

23 Seed Choice l Results can vary based on random seed selection. l Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. –Select good seeds using a heuristic (e.g., doc least similar to any existing mean) –Try out multiple starting points –Initialize with the results of another method. In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Example showing sensitivity to seeds A B DE C F

24 How Many Clusters? l Number of clusters K is given –Partition n docs into predetermined # of clusters l Finding the “right” number of clusters is part of the problem l K not specified in advance –Solve an optimization problem: penalize having lots of clusters –application dependent, e.g., compressed summary of search results list. –Tradeoff between having more clusters (better focus within each cluster) and having too many cluster

25 Weaknesses of k-means l The algorithm is only applicable to numeric data l The user needs to specify k. l The algorithm is sensitive to outliers –Outliers are data points that are very far away from other data points. –Outliers could be errors in the data recording or some special data points with very different values.

26 Strengths of k-means l Strengths: –Simple: easy to understand and to implement –Efficient: Time complexity: O(tkn), –where n is the number of data points, –k is the number of clusters, and –t is the number of iterations. –Since both k and t are small. k-means is considered a linear algorithm. l K-means is the most popular clustering algorithm. l In practice, performs well on text.

27 26 Buckshot Algorithm l Combines HAC and K-Means clustering. l First randomly take a sample of instances of size √n l Run group-average HAC on this sample, which takes only O(n) time. l Use the results of HAC as initial seeds for K-means. l Overall algorithm is O(n) and avoids problems of bad seed selection. Based on:

28 27 Text Clustering l HAC and K-Means have been applied to text in a straightforward way. l Typically use normalized, TF/IDF-weighted vectors and cosine similarity. l Optimize computations for sparse vectors. l Applications: –During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall. –Clustering of results of retrieval to present more organized results to the user. –Automated production of hierarchical taxonomies of documents for browsing purposes. Based on:

29 28 Soft Clustering l Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster. l Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. l Soft clustering gives probabilities that an instance belongs to each of a set of clusters. l Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1). Based on:

30 29 Expectation Maximumization (EM) l Probabilistic method for soft clustering. l Direct method that assumes k clusters:{c1, c2,… ck} l Soft version of k-means. l Assumes a probabilistic model of categories that allows computing P(ci | E) for each category, ci, for a given example, E. l For text, typically assume a naïve-Bayes category model. Based on:

31 30 EM Algorithm l Iterative method for learning probabilistic categorization model from unsupervised data. l Initially assume random assignment of examples to categories. l Learn an initial probabilistic model by estimating model parameters θ from this randomly labeled data. l Iterate following two steps until convergence: –Expectation (E-step): Compute P(ci | E) for each example given the current model, and probabilistically re-label the examples based on these posterior probability estimates. –Maximization (M-step): Re-estimate the model parameters, θ, from the probabilistically re-labeled data. Based on:

32 31 EM Unlabeled Examples + − + − + − + − − + Assign random probabilistic labels to unlabeled data Initialize: Based on:

33 32 EM Prob. Learner + − + − + − + − − + Give soft-labeled training data to a probabilistic learner Initialize: Based on:

34 33 EM Prob. Learner Prob. Classifier + − + − + − + − − + Produce a probabilistic classifier Initialize: Based on:

35 34 EM Prob. Learner Prob. Classifier Relabel unlabled data using the trained classifier + − + − + − + − − + E Step: Based on:

36 35 EM Prob. Learner + − + − + − + − − + Prob. Classifier Continue EM iterations until probabilistic labels on unlabeled data converge. Retrain classifier on relabeled data M step: Based on:

37 46 Issues in Clustering l How to evaluate clustering? –Internal: –Tightness and separation of clusters (e.g. k-means objective) –Fit of probabilistic model to data –External –Compare to known class labels on benchmark data (purity) l Improving search to converge faster and avoid local minima. l Overlapping clustering. Based on:

38 How to Label Text Clusters Show titles of typical documents –Titles are easy to scan –Authors create them for quick scanning! –But you can only show a few titles which may not fully represent cluster l Show words/phrases prominent in cluster –More likely to fully represent cluster –Use distinguishing words/phrases –But harder to scan 22 cs.wellesley.edu/~cs315/315_PPTs/L17-Clustering/lecture_17.ppt

39 Summary: Text Clustering l Clustering can be useful in seeing “shape” of a document corpus l Unsupervised learning; training examples not needed –easier and faster –more appropriate when you don’t know the categories l Goal is to create clusters with similar documents –typically using BOW with TF*IDF as representation –typically using cosine-similarity as metric l Difficult to evaluate results –useful? –expert opinion? –In practice, only moderately successful


Download ppt "©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l"

Similar presentations


Ads by Google