Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.

Similar presentations


Presentation on theme: "1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering."— Presentation transcript:

1 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering

2 Text clustering most often separates the entire corpus of documents into mutually exclusive clusters – each document belongs to one and only one cluster (i.e., hard clustering) whereas Topic Extraction assigns a document to multiple topics (i.e., soft clustering). Text Clustering 2

3 Similarity-based Clustering A common approach to text clustering is to group documents which are similar. In the vector space model of textual data, there are several popular similarity metrics: –Correlation-based metrics: Often used in document search and retrieval e.g. Cosine angle –Distance-based metrics: Provides a ‘magnitude’ of similarity e.g. Euclidean Distance -- distance(x,y) = {∑ i (x i - y i ) 2 } ½ –Association-based measures (not always metrics): Often used for nominal attributes e.g. Jaccard coefficient 3 Doc 1Doc 2Doc 3…Doc N apple100…2 cat311…4 dog221…3 farm100…1 ……………… White House034…0 Senate024…0

4 Distribution-based clustering –Assumes a distribution model of the values, and try to fit the observations. –e.g. Gaussian Mixture Model (using Expectation- Maximization algorithm) Other Clustering Approaches Wikipedia, Cluster Analysis, https://en.wikipedia.org/wiki/Cluster_analysis 4 Density-based clustering –Clusters are defined as areas of higher density. –Observations in sparse areas are considered noise (but separate clusters).

5 5 Unigrams vs. Reduced Dimensions for Text Clustering Just like for text topics, you can apply clustering directly on a doc*term (or term*doc) matrix, or a matrix obtained after reducing dimensions (e.g. by SVD). When SVD is applied to a term*doc matrix, –Documents are represented by the column vector of the matrix V. –Terms are represented by the row vectors of the multiplication of U and S matrices. SVD dimensions U:U: A:A: S:S: V:V:

6 Clustering Algorithms Each document is assigned to the/one cluster to which the membership/similarity is the strongest. Broadly speaking, clustering algorithms can be divided into four groups: 1.Hierarchical – top-down (divisive) or bottom-up (aggromelative) 2.Non-hierarchical – partitioning algorithm such as Kmeans 3.Probabilistic – identifies dense regions of the data space 4.Neural Network – typically Kohonen Self Organizing Map (SOM) Reference on various clustering algorithms: http://condor.depaul.edu/ntomuro/courses/578/notes/notes-Clustering.html (my old CSC 578 lecture note) http://condor.depaul.edu/ntomuro/courses/578/notes/notes-Clustering.html –K-means clustering –Hierarchical clustering –Expectation-Maximization (EM) clustering – model-based, generative technique 6

7 Cluster Assignment 7

8 Coursera, Text Mining and Analytics, ChengXiang Zhai 8

9 Interpretation of Clusters Descriptive terms or Centroids 9 Descriptive Terms in SAS Enterprise Miner: The Text Cluster node uses a descriptive terms algorithm to describe the contents of both EM clusters and hierarchical clusters. If you specify to display m descriptive terms for each cluster, then the top 2*m most frequently occurring terms in each cluster are used to compute the descriptive terms. For each of the 2*m terms, a binomial probability for each cluster is computed. The probability of assigning a term to cluster j is prob=F(k|N, p). Here, F is the binomial cumulative distribution function, k is the number of times that the term appears in cluster j, N is the number of documents in cluster j, p is equal to (sum-k)/(total-N), sum is the total number of times that the term appears in all the clusters, and total is the total number of documents. The m descriptive terms are those that have the highest binomial probabilities. Descriptive terms must have a keep status of Y and must occur at least twice (by default) in a cluster.

10 Coursera, Text Mining and Analytics, ChengXiang Zhai 10

11 Coursera, Text Mining and Analytics, ChengXiang Zhai 11


Download ppt "1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering."

Similar presentations


Ads by Google