 # Statistical Topic Modeling part 1

## Presentation on theme: "Statistical Topic Modeling part 1"— Presentation transcript:

Statistical Topic Modeling part 1
Andrea Tagarelli Univ. of Calabria, Italy

Statistical topic modeling (1/3)
Key assumption: text data represented as a mixture of topics, i.e., probability distributions over terms Generative model for documents: document features as being generated by latent variables Topic modeling vs. vector-space text modeling (Latent) Semantic aspects underlying correlations between words Document topical structure

Statistical topic modeling (2/3)
Training on (large) corpus to learn: Per-topic word distributions Per-document topic distributions [Blei, CACM, 2012]

Statistical topic modeling (3/3)
Graphical “Plate” notation Standard representation for generative models Rectangles (plates) represent repeated areas of the model number of times the variable(s) is repeated [Hofmann, SIGIR, 1999] More formally, a basic model works as follows: Select a document d_j with probability P(d_j) Pick a latent class (topic) z_k with probability P(z_k|d_j) Generate a word w_i with probability P(w_i|z_k) Plate notation is a standard representation for probabilistic generative models, to depict graphically the intricacies of some models. In this notation, rectangles (plates) represent repeated areas of themodel. The number in the lower right corner of the plate denotes the number of times the included variables are repeated. Shaded and un-shaded variables indicate observed and unobserved (latent) variables, resp. A directed edge express conditional dependency of head node on tail node The two figures both represent the same model in which M words are sampled from a distribution β. However, the representation on the right is more compact due to the use of plate notation.

Observed and latent variables
Observed variable: we know the current value Latent variable: a variable whose state cannot be observed Estimation problem: Estimate values for a set of distribution parameters that can best explain a set of observations Most likely values of parameters: maximum likelihood of a model Likelihood impossible to calculate in full Approximation through Expectation-maximization algorithm: an iterative method to estimate the probability of unobserved, latent variables. Until a local optimum is obtained Gibbs sampling: update parameters sample-wise Variational inference: approximate the model by a simpler one

Probabilistic LSA PLSA [Hofmann, 2001]
Probabilistic version of LSA conceived to better handling problems of term polysemy d z w M N

PLSA training (1/2) Joint probability model: Likelihood
With n(d, w) be the frequency of w in d Training: maximizing the likelihood function

PLSA training (2/2) Training with EM:
Initialization of the per-topic word distributions and per-document topic distributions E-step: M-step: Initial estimates of the parameters P(wijzk) and P(zkjdj) E-step: posterior probabilities are computed for the latent variables zk, based on the current estimates of the parameters M-step: parameters are re-estimated in order to maximize the likelihood function. Iterating the E-step and M-step denes a converging procedure that approaches a local maximum

Latent Dirichlet Allocation (1/2)
LDA [Blei et al., 2003] Adds a Dirichlet prior on the per-document topic distribution 3-level scheme: corpus, documents, and terms Terms are the only observed variables Per-topic word distribution Topic assignment to a word at position i in doc dj PLSA is not a proper generative model for new documents. Moreover, the #latent vars to learn grows linearly with the #docs. For each word position in a doc of length M For each doc in a collection of N docs Per-document topic distribution Word token at position i in doc dj [Moens and Vulic, 2014]

Latent Dirichlet Allocation (2/2)
Meaning of Dirichlet priors θ ~ Dir(α1, …, αK) Each αk is a prior observation count for the no. of times a topic zk is sampled in a document prior to word observations Analogously for ηi, with β ~ Dir(η1, …, ηV) Inference for a new document: Given α, β, η, infer θ Exact inference problem is intractable: training through Gibbs sampling Variational inference