1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Image Modeling & Segmentation
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Information retrieval – LSI, pLSI and LDA
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Biointelligence Laboratory, Seoul National University
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Mixture Language Models and EM Algorithm
Visual Recognition Tutorial
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Chapter 7 Retrieval Models.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Unsupervised Learning
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Visual Recognition Tutorial
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Language Modeling Approaches for Information Retrieval Rong Jin.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
EM and expected complete log-likelihood Mixture of Experts
2010 © University of Michigan Latent Semantic Indexing SI650: Information Retrieval Winter 2010 School of Information University of Michigan 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
SINGULAR VALUE DECOMPOSITION (SVD)
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Latent Dirichlet Allocation
CSC 594 Topics in AI – Text Mining and Analytics
Annotating Gene List From Literature Xin He Department of Computer Science UIUC.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Automatic Labeling of Multinomial Topic Models
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
A Study of Poisson Query Generation Model for Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Bayesian data analysis
Hidden Markov Models (HMMs)
CSC 594 Topics in AI – Natural Language Processing
Latent Variables, Mixture Models and EM
Hidden Markov Models (HMMs)
Probabilistic Models with Latent Variables
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Topic Models in Text Processing
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
CS590I: Information Retrieval
INF 141: Information Retrieval
Restructuring Sparse High Dimensional Data for Effective Retrieval
Language Models for TR Rong Jin
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Topic Analysis.
Presentation transcript:

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction

Word association, represented by concept links, is useful in understanding the relationships between terms (as concepts). The same idea can be applied to understand the association between documents associated to a topic. Text Topics 2

Problems with “Term as Topic” Using single term to define a topic is problematic. –Lack of expressive power Can only represent simple topics Cannot represent complicated topics –Incompleteness in vocabulary coverage Cannot capture variations of vocabulary (e.g. related terms) –Ambiguous word Many words have more than one meaning/sense. 3

Multiple Terms as Topic A solution is to use multiple terms to define a topic. –Topic = {word1, word2,.. } –A weight assigned to each term represents the importance/relevance of the term in the topic. –Every document in the corpus can be given a score that represents the strength of association to a topic. –A document can contain zero, one or many topics. 4

Approach (1): Probabilistic Topic Mining Coursera, Text Mining and Analytics, ChengXiang Zhai 5

Topic as Word Distribution Coursera, Text Mining and Analytics, ChengXiang Zhai 6

Probabilistic Topic Mining Coursera, Text Mining and Analytics, ChengXiang Zhai 7

Techniques for Probabilistic Topic Mining Several techniques have been used in probabilistic topic mining to extract topics. –Maximum Likelihood –Bayesian –Mixture Model (where parameters are estimated typically using the Expectation Maximization (EM) algorithm) 8

Mixture Model for Topic Extraction (1) Coursera, Text Mining and Analytics, ChengXiang Zhai 9

Mixture Model for Topic Extraction (2) Coursera, Text Mining and Analytics, ChengXiang Zhai 10

Mixture Model as a Generative Model Coursera, Text Mining and Analytics, ChengXiang Zhai 11

Mixture of Two Unigram Language Models Coursera, Text Mining and Analytics, ChengXiang Zhai 12

Coursera, Text Mining and Analytics, ChengXiang Zhai 13

Coursera, Text Mining and Analytics, ChengXiang Zhai 14

Coursera, Text Mining and Analytics, ChengXiang Zhai 15

Expectation-Maximization (EM) Algorithm Coursera, Text Mining and Analytics, ChengXiang Zhai 16

Coursera, Text Mining and Analytics, ChengXiang Zhai 17

18 Approach (2): Dimensionality Reduction for Topics Extraction Reduced dimensions can also be considered topics. Singular Value Decomposition derives eigenvectors (SVD dimensions/Principal Components)  Topics. D1: “I love iPad.” D2: “iPad is great for kids.” D3: “Kids love to play soccer.” D4: “I play soccer at OSU.”

19 Example: Topics extracted by SAS Enterprise Miner for the yelp data

20 Term topic weight – relevance of the term in the topic Each term is assigned a weight corresponding to each topic. Since each topic is an SVD dimension, the term topic weights for a term are the coordinates of the term in the SVD space. The Term cutoff is used to determine whether a term belongs to a topic. Document topic weight – relevance of the document to the topic Every document in the corpus is assigned a weight corresponding to each topic. The document topic weight of a document towards a topic is the normalized sum of the TF*IDF weights for each term in the document multiplied by their term topic weights. The Document cutoff is used to determine whether a document belongs to a topic.

21 Interpretability of Extracted Topics A topic as a collection of weighted terms provides precise information about the topic. But some analysts find the binary topics are easier to understand.