Topic Models in Text Processing

Slides:

Advertisements

Similar presentations

Topic models Source: Topic models, David Blei, MLSS 09.

Advertisements

Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.

Basics of Statistical Estimation

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Information retrieval – LSI, pLSI and LDA

An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.

Probabilistic Clustering-Projection Model for Discrete Data

Statistical Topic Modeling part 1

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.

Visual Recognition Tutorial

Latent Dirichlet Allocation a generative model for text

CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )

Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.

Generative learning methods for bags of features

British Museum Library, London Picture Courtesy: flickr.

Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)

Visual Recognition Tutorial

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Language Modeling Approaches for Information Retrieval Rong Jin.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Probabilistic Topic Models

High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture model Based on Minimum Message Length by Nizar Bouguila.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Integrating Topics and Syntax -Thomas L

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

An Introduction to Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation

Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

Automatic Labeling of Multinomial Topic Models

Web-Mining Agents Topic Analysis: pLSI and LDA

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Hierarchical Clustering & Topic Models

Topic Modeling for Short Texts with Auxiliary Word Embeddings

B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,

The topic discovery models

Parameter Estimation 主講人：虞台文.

The topic discovery models

Language Models for Information Retrieval

Probabilistic Topic Models.

Latent Dirichlet Analysis

Probabilistic Models with Latent Variables

The topic discovery models

Topic Modeling Nick Jordan.

Bayesian Inference for Mixture Language Models

Stochastic Optimization Maximization for Latent Variable Models

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

Latent Dirichlet Allocation

CS246: Latent Dirichlet Analysis

Junghoo “John” Cho UCLA

Topic models for corpora and for graphs

CS590I: Information Retrieval

Language Models for TR Rong Jin

Presentation transcript:

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei

Topic Models in Text Processing Introduction Basic Topic Models Probabilistic Latent Semantic Indexing Latent Dirichlet Allocation Correlated Topic Models Variations and Applications

Overview Motivation: Basic Assumptions: Applications Model the topic/subtopics in text collections Basic Assumptions: There are k topics in the whole collection Each topic is represented by a multinomial distribution over the vocabulary (language model) Each document can cover multiple topics Applications Summarizing topics Predict topic coverage for documents Model the topic correlations

Basic Topic Models Unigram model Mixture of unigrams Probabilistic LSI LDA Correlated Topic Models

w: a document in the collection Unigram Model There is only one topic in the collection Estimation: Maximum likelihood estimation Application: LMIR, N: Number of words in w w: a document in the collection wn: a word in w M: Number of Documents

Mixture of unigrams Estimation: MLE and EM Algorithm There is k topics in the collection, but each document only cover one topic Estimation: MLE and EM Algorithm Application: simple document clustering

Probabilistic Latent Semantic Indexing (Thomas Hofmann ’99): each document is generated from more than one topics, with a set of document specific mixture weights {p(z|d)} over k topics. These mixture weights are considered as fixed parameters to be estimated. Also known as aspect model. No prior knowledge about topics required, context and term co-occurrences are exploited

PLSI (cont.) Assume uniform p(d) Parameter Estimation: d: document Assume uniform p(d) Parameter Estimation: π: {p(z|d)}; : {p(w|z)} Maximizing log-likelihood using EM algorithm

Problem of PLSI Mixture weights are considered as document specific, thus no natural way to assign probability to a previously unseen document. Number of parameters to be estimated grows with size of training set, thus overfits data, and suffers from multiple local maxima. Not a fully generative model of documents.

Latent Dirichlet Allocation (Blei et al ‘03) Treats the topic mixture weights as a k-parameter hidden random variable (a multinomial) and places a Dirichlet prior on the multinomial mixing weights. This is sampled once per document. The weights for word multinomial distributions are still considered as fixed parameters to be estimated. For a fuller Bayesian approach, can place a Dirichlet prior to these word multinomial distributions to smooth the probabilities. (like Dirichlet smoothing)

Dirichlet Distribution Bayesian Theory: Everything is not fixed, everything is random. Choose prior for the multinomial distribution of the topic mixture weights: Dirichlet distribution is conjugate to the multinomial distribution, which is natural to choose. A k-dimensional Dirichlet random variable  can take values in the (k-1)-simplex, and has the following probability density on this simplex:

The Graph Model of LDA

Generating Process of LDA Choose For each of the N words : Choose a topic Choose a word from , a multinomial probability conditioned on the topic . β is a k by n matrix parameterized with the word probabilities.

Inference and Parameter Estimation Parameters: Corpus level: α, β: sampled once in the corpus creating generative model. Document level: , z: sampled once to generate a document Inference: estimation of document-level parameters. However, it is intractable to compute, which needs approximate inference.