1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.

Slides:



Advertisements
Similar presentations
Hierarchical Clustering, DBSCAN The EM Algorithm
Advertisements

Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Dimensionality Reduction PCA -- SVD
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Unsupervised learning
K-means clustering Hongning Wang
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Cluster Analysis (1).
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
What is Cluster Analysis?
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Clustering.
Clustering C.Watters CS6403.
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 30 Nov 11, 2005 Nanjing University of Science & Technology.
Clustering.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Text Clustering Hongning Wang
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
CLUSTERING PARTITIONING METHODS Elsayed Hemayed Data Mining Course.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
KNN & Naïve Bayes Hongning Wang
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Data Mining K-means Algorithm
CSC 594 Topics in AI – Natural Language Processing
Clustering Evaluation The EM Algorithm
CSE572, CBS598: Data Mining by H. Liu
Information Organization: Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Presentation transcript:

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering

Text clustering most often separates the entire corpus of documents into mutually exclusive clusters – each document belongs to one and only one cluster (i.e., hard clustering) whereas Topic Extraction assigns a document to multiple topics (i.e., soft clustering). Text Clustering 2

Similarity-based Clustering A common approach to text clustering is to group documents which are similar. In the vector space model of textual data, there are several popular similarity metrics: –Correlation-based metrics: Often used in document search and retrieval e.g. Cosine angle –Distance-based metrics: Provides a ‘magnitude’ of similarity e.g. Euclidean Distance -- distance(x,y) = {∑ i (x i - y i ) 2 } ½ –Association-based measures (not always metrics): Often used for nominal attributes e.g. Jaccard coefficient 3 Doc 1Doc 2Doc 3…Doc N apple100…2 cat311…4 dog221…3 farm100…1 ……………… White House034…0 Senate024…0

Distribution-based clustering –Assumes a distribution model of the values, and try to fit the observations. –e.g. Gaussian Mixture Model (using Expectation- Maximization algorithm) Other Clustering Approaches Wikipedia, Cluster Analysis, 4 Density-based clustering –Clusters are defined as areas of higher density. –Observations in sparse areas are considered noise (but separate clusters).

5 Unigrams vs. Reduced Dimensions for Text Clustering Just like for text topics, you can apply clustering directly on a doc*term (or term*doc) matrix, or a matrix obtained after reducing dimensions (e.g. by SVD). When SVD is applied to a term*doc matrix, –Documents are represented by the column vector of the matrix V. –Terms are represented by the row vectors of the multiplication of U and S matrices. SVD dimensions U:U: A:A: S:S: V:V:

Clustering Algorithms Each document is assigned to the/one cluster to which the membership/similarity is the strongest. Broadly speaking, clustering algorithms can be divided into four groups: 1.Hierarchical – top-down (divisive) or bottom-up (aggromelative) 2.Non-hierarchical – partitioning algorithm such as Kmeans 3.Probabilistic – identifies dense regions of the data space 4.Neural Network – typically Kohonen Self Organizing Map (SOM) Reference on various clustering algorithms: (my old CSC 578 lecture note) –K-means clustering –Hierarchical clustering –Expectation-Maximization (EM) clustering – model-based, generative technique 6

Cluster Assignment 7

Coursera, Text Mining and Analytics, ChengXiang Zhai 8

Interpretation of Clusters Descriptive terms or Centroids 9 Descriptive Terms in SAS Enterprise Miner: The Text Cluster node uses a descriptive terms algorithm to describe the contents of both EM clusters and hierarchical clusters. If you specify to display m descriptive terms for each cluster, then the top 2*m most frequently occurring terms in each cluster are used to compute the descriptive terms. For each of the 2*m terms, a binomial probability for each cluster is computed. The probability of assigning a term to cluster j is prob=F(k|N, p). Here, F is the binomial cumulative distribution function, k is the number of times that the term appears in cluster j, N is the number of documents in cluster j, p is equal to (sum-k)/(total-N), sum is the total number of times that the term appears in all the clusters, and total is the total number of documents. The m descriptive terms are those that have the highest binomial probabilities. Descriptive terms must have a keep status of Y and must occur at least twice (by default) in a cluster.

Coursera, Text Mining and Analytics, ChengXiang Zhai 10

Coursera, Text Mining and Analytics, ChengXiang Zhai 11