Statistical Topic Modeling part 1

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Generative learning methods for bags of features
Probabilistic Clustering-Projection Model for Discrete Data
Segmentation and Fitting Using Probabilistic Methods
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
2. Introduction Multiple Multiplicative Factor Model For Collaborative Filtering Benjamin Marlin University of Toronto. Department of Computer Science.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Generative Topic Models for Community Analysis
Overview Full Bayesian Learning MAP learning
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Statistical Models for Networks and Text Jimmy Foulds UCI Computer Science PhD Student Advisor: Padhraic Smyth.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Latent Dirichlet Allocation a generative model for text
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Generative learning methods for bags of features
Modeling User Rating Profiles For Collaborative Filtering
British Museum Library, London Picture Courtesy: flickr.
A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Online Learning for Latent Dirichlet Allocation
Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Probabilistic Topic Models
27. May Topic Models Nam Khanh Tran L3S Research Center.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Topic Modeling using Latent Dirichlet Allocation
An Introduction to Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Expectation-Maximization (EM) Algorithm & Monte Carlo Sampling for Inference and Approximation.
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
Techniques for Dimensionality Reduction
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Link Distribution on Wikipedia [0407]KwangHee Park.
Web-Mining Agents Topic Analysis: pLSI and LDA
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Analysis of Social Media MLD , LTI William Cohen
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
Online Multiscale Dynamic Topic Models
The topic discovery models
The topic discovery models
Latent Dirichlet Analysis
The topic discovery models
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Topic Models in Text Processing
Presentation transcript:

Statistical Topic Modeling part 1 Andrea Tagarelli Univ. of Calabria, Italy

Statistical topic modeling (1/3) Key assumption: text data represented as a mixture of topics, i.e., probability distributions over terms Generative model for documents: document features as being generated by latent variables Topic modeling vs. vector-space text modeling (Latent) Semantic aspects underlying correlations between words Document topical structure

Statistical topic modeling (2/3) Training on (large) corpus to learn: Per-topic word distributions Per-document topic distributions [Blei, CACM, 2012]

Statistical topic modeling (3/3) Graphical “Plate” notation Standard representation for generative models Rectangles (plates) represent repeated areas of the model number of times the variable(s) is repeated [Hofmann, SIGIR, 1999] More formally, a basic model works as follows: Select a document d_j with probability P(d_j) Pick a latent class (topic) z_k with probability P(z_k|d_j) Generate a word w_i with probability P(w_i|z_k) Plate notation is a standard representation for probabilistic generative models, to depict graphically the intricacies of some models. In this notation, rectangles (plates) represent repeated areas of themodel. The number in the lower right corner of the plate denotes the number of times the included variables are repeated. Shaded and un-shaded variables indicate observed and unobserved (latent) variables, resp. A directed edge express conditional dependency of head node on tail node The two figures both represent the same model in which M words are sampled from a distribution β. However, the representation on the right is more compact due to the use of plate notation.

Observed and latent variables Observed variable: we know the current value Latent variable: a variable whose state cannot be observed Estimation problem: Estimate values for a set of distribution parameters that can best explain a set of observations Most likely values of parameters: maximum likelihood of a model Likelihood impossible to calculate in full Approximation through Expectation-maximization algorithm: an iterative method to estimate the probability of unobserved, latent variables. Until a local optimum is obtained Gibbs sampling: update parameters sample-wise Variational inference: approximate the model by a simpler one

Probabilistic LSA PLSA [Hofmann, 2001] Probabilistic version of LSA conceived to better handling problems of term polysemy d z w M N

PLSA training (1/2) Joint probability model: Likelihood With n(d, w) be the frequency of w in d Training: maximizing the likelihood function

PLSA training (2/2) Training with EM: Initialization of the per-topic word distributions and per-document topic distributions E-step: M-step: Initial estimates of the parameters P(wijzk) and P(zkjdj) E-step: posterior probabilities are computed for the latent variables zk, based on the current estimates of the parameters M-step: parameters are re-estimated in order to maximize the likelihood function. Iterating the E-step and M-step denes a converging procedure that approaches a local maximum

Latent Dirichlet Allocation (1/2) LDA [Blei et al., 2003] Adds a Dirichlet prior on the per-document topic distribution 3-level scheme: corpus, documents, and terms Terms are the only observed variables Per-topic word distribution Topic assignment to a word at position i in doc dj PLSA is not a proper generative model for new documents. Moreover, the #latent vars to learn grows linearly with the #docs. For each word position in a doc of length M For each doc in a collection of N docs Per-document topic distribution Word token at position i in doc dj [Moens and Vulic, Tutorial @WSDM 2014]

Latent Dirichlet Allocation (2/2) Meaning of Dirichlet priors θ ~ Dir(α1, …, αK) Each αk is a prior observation count for the no. of times a topic zk is sampled in a document prior to word observations Analogously for ηi, with β ~ Dir(η1, …, ηV) Inference for a new document: Given α, β, η, infer θ Exact inference problem is intractable: training through Gibbs sampling Variational inference