Topic Models in Text Processing

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.
Basics of Statistical Estimation
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Information retrieval – LSI, pLSI and LDA
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Visual Recognition Tutorial
Latent Dirichlet Allocation a generative model for text
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.
Generative learning methods for bags of features
British Museum Library, London Picture Courtesy: flickr.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Visual Recognition Tutorial
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Language Modeling Approaches for Information Retrieval Rong Jin.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Probabilistic Topic Models
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture model Based on Minimum Message Length by Nizar Bouguila.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
An Introduction to Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Web-Mining Agents Topic Analysis: pLSI and LDA
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hierarchical Clustering & Topic Models
Topic Modeling for Short Texts with Auxiliary Word Embeddings
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
The topic discovery models
Parameter Estimation 主講人:虞台文.
The topic discovery models
Language Models for Information Retrieval
Probabilistic Topic Models.
Latent Dirichlet Analysis
Probabilistic Models with Latent Variables
The topic discovery models
Topic Modeling Nick Jordan.
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
CS590I: Information Retrieval
Language Models for TR Rong Jin
Presentation transcript:

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei

Topic Models in Text Processing Introduction Basic Topic Models Probabilistic Latent Semantic Indexing Latent Dirichlet Allocation Correlated Topic Models Variations and Applications

Overview Motivation: Basic Assumptions: Applications Model the topic/subtopics in text collections Basic Assumptions: There are k topics in the whole collection Each topic is represented by a multinomial distribution over the vocabulary (language model) Each document can cover multiple topics Applications Summarizing topics Predict topic coverage for documents Model the topic correlations

Basic Topic Models Unigram model Mixture of unigrams Probabilistic LSI LDA Correlated Topic Models

w: a document in the collection Unigram Model There is only one topic in the collection Estimation: Maximum likelihood estimation Application: LMIR, N: Number of words in w w: a document in the collection wn: a word in w M: Number of Documents

Mixture of unigrams Estimation: MLE and EM Algorithm There is k topics in the collection, but each document only cover one topic Estimation: MLE and EM Algorithm Application: simple document clustering

Probabilistic Latent Semantic Indexing (Thomas Hofmann ’99): each document is generated from more than one topics, with a set of document specific mixture weights {p(z|d)} over k topics. These mixture weights are considered as fixed parameters to be estimated. Also known as aspect model. No prior knowledge about topics required, context and term co-occurrences are exploited

PLSI (cont.) Assume uniform p(d) Parameter Estimation: d: document Assume uniform p(d) Parameter Estimation: π: {p(z|d)}; : {p(w|z)} Maximizing log-likelihood using EM algorithm

Problem of PLSI Mixture weights are considered as document specific, thus no natural way to assign probability to a previously unseen document. Number of parameters to be estimated grows with size of training set, thus overfits data, and suffers from multiple local maxima. Not a fully generative model of documents.

Latent Dirichlet Allocation (Blei et al ‘03) Treats the topic mixture weights as a k-parameter hidden random variable (a multinomial) and places a Dirichlet prior on the multinomial mixing weights. This is sampled once per document. The weights for word multinomial distributions are still considered as fixed parameters to be estimated. For a fuller Bayesian approach, can place a Dirichlet prior to these word multinomial distributions to smooth the probabilities. (like Dirichlet smoothing)

Dirichlet Distribution Bayesian Theory: Everything is not fixed, everything is random. Choose prior for the multinomial distribution of the topic mixture weights: Dirichlet distribution is conjugate to the multinomial distribution, which is natural to choose. A k-dimensional Dirichlet random variable  can take values in the (k-1)-simplex, and has the following probability density on this simplex:

The Graph Model of LDA

Generating Process of LDA Choose For each of the N words : Choose a topic Choose a word from , a multinomial probability conditioned on the topic . β is a k by n matrix parameterized with the word probabilities.

Inference and Parameter Estimation Parameters: Corpus level: α, β: sampled once in the corpus creating generative model. Document level: , z: sampled once to generate a document Inference: estimation of document-level parameters. However, it is intractable to compute, which needs approximate inference.