Junghoo “John” Cho UCLA

Slides:

Advertisements

Similar presentations

Information retrieval – LSI, pLSI and LDA

Advertisements

Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.

An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.

Probabilistic Clustering-Projection Model for Discrete Data

Statistical Topic Modeling part 1

Latent Dirichlet Allocation a generative model for text

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.

1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.

Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.

Popularity-Aware Topic Model for Social Graphs Junghoo “John” Cho UCLA.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

27. May Topic Models Nam Khanh Tran L3S Research Center.

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Integrating Topics and Syntax -Thomas L

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Latent Dirichlet Allocation

Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.

Lecture #9: Introduction to Markov Chain Monte Carlo, part 3

Techniques for Dimensionality Reduction

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Web-Mining Agents Topic Analysis: pLSI and LDA

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Analysis of Social Media MLD , LTI William Cohen

Latent Dirichlet Allocation (LDA)

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Extracting Mobile Behavioral Patterns with the Distant N-Gram Topic Model Lingzi Hong Feb 10th.

B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,

MCMC Output & Metropolis-Hastings Algorithm Part I

Document Clustering Based on Non-negative Matrix Factorization

Gibbs sampling.

Classification of unlabeled data:

Statistical Models for Automatic Speech Recognition

Neural Language Model CS246 Junghoo “John” Cho.

Language Models for Information Retrieval

CAP 5636 – Advanced Artificial Intelligence

Latent Dirichlet Analysis

Statistical Models for Automatic Speech Recognition

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Ch13 Empirical Methods.

Bayesian Inference for Mixture Language Models

CS 188: Artificial Intelligence

John Lafferty, Chengxiang Zhai School of Computer Science

Topic models for corpora and for graphs

Markov Chain Monte Carlo

Michal Rosen-Zvi University of California, Irvine

Latent Dirichlet Allocation

CS246: Latent Dirichlet Analysis

Topic models for corpora and for graphs

Lecture 13: Singular Value Decomposition (SVD)

Topic Models in Text Processing

CS 430: Information Discovery

CS590I: Information Retrieval

14.0 Linguistic Processing and Latent Topic Analysis

Markov Networks.

Language Models for TR Rong Jin

CS249: Neural Language Model

Presentation transcript:

Junghoo “John” Cho UCLA CS246: LDA Inference Junghoo “John” Cho UCLA

LDA Document Generation Model For each topic z Pick the word probability vector 𝑃(𝑤|𝑧)’s by taking a random sample from Dir(β,…, β) For every document d The user decides its topic vector 𝑃(𝑧|𝑑)’s by taking a random sample from Dir(⍺,…, ⍺) For each word in d The user selects a topic z with probability 𝑃(𝑧|𝑑) The user selects a word w with probability 𝑃(𝑤|𝑧) At the end, we have 𝑃(𝑤|𝑧): topic-word vector for each topic 𝑃(𝑧|𝑑): document-topic vector for each document Topic assignment to every word in each document

LDA as Topic Inference Given a corpus 𝑑1: 𝑤11, 𝑤12, …, 𝑤1𝑚 … 𝑑𝑁: 𝑤 𝑁 1 ,𝑤 𝑁2 ,…,𝑤𝑁𝑚 Find 𝑃(𝑧|𝑑), 𝑃(𝑤|𝑧), 𝑧𝑖𝑗 that are most “consistent” with the given corpus Q: What does “consistent” mean? A: MLE. Find the values that maximizes the corpus probability 𝑃 𝐶 = 𝑖=1 𝑁 𝑗=1 𝑀 𝑃( 𝑤 𝑖,𝑗 | 𝑧 𝑖,𝑗 )𝑃( 𝑧 𝑖,𝑗 | 𝑑 𝑖 ) Q: How can we compute such 𝑃(𝑧|𝑑), 𝑃(𝑤|𝑧), 𝑧𝑖𝑗? A: Solving optimization problem. Use Monte Carlo method together with Gibbs sampling

Monte Carlo Method (1) Class of methods that compute a number through repeated random sampling of certain event(s). Q: How can we compute 𝜋?

Monte Carlo Method (2) Define the domain of possible events Generate the events randomly from the domain using a certain probability distribution Perform a deterministic computation using the events Aggregate the results of the individual computation into the final result Q: How can we take random samples from a particular distribution?

Gibbs Sampling Q: How can we take a random sample 𝑥 from the distribution 𝑓(𝑥)? Q: How can we take a random sample (𝑥, 𝑦) from the distribution 𝑓(𝑥, 𝑦)? Gibbs sampling Given current sample ( 𝑥 1 , …, 𝑥 𝑛 ), pick a random dimension 𝑥 𝑖 , and take a random value for 𝑥 𝑖 assuming the current values for all other dimensions 𝑥 1 , …, 𝑥 𝑛 In practice we sequentially iterate over each dimension

Markov-Chain Monte-Carlo Method (MCMC) Gibbs sampling is in the class of Markov Chain sampling Next sample depends only on the current sample Markov-Chain Monte-Carlo Method Generate random events using Markov-Chain sampling and apply Monte- Carlo method to compute the result

Applying MCMC to LDA Let us apply Monte Carlo method to estimate LDA parameters. Q: How can we map the LDA inference problem to random events? A: Focus on assigning topic 𝑧𝑖𝑗 to each word 𝑤𝑖𝑗. Event: Assignment of the topics {𝑧𝑖𝑗} to all 𝑤𝑖𝑗’s. The assignment should be done according to the probability 𝑃({𝑧𝑖𝑗}|𝐶) of the LDA model Q: How can we sample according to the probability distribution of 𝑃({𝑧𝑖𝑗}|𝐶) of the LDA model?

Gibbs Sampling for LDA Start with initial random assignment of 𝑧 𝑖𝑗 For each 𝑧 𝑖𝑗 : Sample a new 𝑧 𝑖𝑗 value randomly according to 𝑃(𝑧𝑖𝑗|{ 𝑧 −𝑖𝑗 },𝐶) Repeat many times Q: What is 𝑃 𝑧 𝑖𝑗 𝑧 −𝑖𝑗 ,𝐶)?

𝑃 𝑧 𝑖𝑗 =𝑧 {𝑧 −𝑖𝑗 },𝐶)? 𝑃 𝑧 𝑖𝑗 =𝑧 { 𝑧 −𝑖𝑗 },𝐶)= 𝑛 𝑤 𝑖𝑗 𝑧 + 𝛽 𝑤=1 𝑊 (𝑛 𝑤𝑧 +𝛽) 𝑛 𝑑 𝑖 𝑧 +𝛼 𝑧=1 𝑇 ( 𝑛 𝑑 𝑖 𝑧 +𝛼) 𝑛𝑤𝑧: how many times the word w has been assigned to the topic z 𝑛𝑑𝑧: how many words in the document d have been assigned to the topic z Q: What is the meaning of each factor?

LDA with Gibbs Sampling For each word wij Assign to topic t with probability 𝑛 𝑤 𝑖𝑗 𝑧 + 𝛽 𝑤=1 𝑊 (𝑛 𝑤𝑧 +𝛽) 𝑛 𝑑 𝑖 𝑧 +𝛼 𝑧=1 𝑇 ( 𝑛 𝑑 𝑖 𝑧 +𝛼) For the prior topic 𝑧 𝑝 of wij, decrease 𝑛 𝑤 𝑖𝑗 𝑧 𝑝 and 𝑛 𝑑 𝑖 𝑧 𝑝 by 1 For the new topic 𝑧 𝑛 of wij, increase 𝑛 𝑤 𝑖𝑗 𝑧 𝑛 and 𝑛 𝑑 𝑖 𝑧 𝑛 by 1 Repeat the process many times At least hundreds of times Once the process is over, we have zij for every wij nwz and ndz 𝑃 𝑤 𝑧 = 𝑛 𝑤𝑧 +𝛽 𝑖=1 𝑊 ( 𝑛 𝑤 𝑖 𝑧 +𝛽) 𝑃 𝑧 𝑑 = 𝑛 𝑑𝑧 +𝛼 𝑧=1 𝑇 ( 𝑛 𝑑𝑧 +𝛼)

Example Result from LDA TASA corpus 37,000 text passages from educational materials collected by Touchstone Applied Science Associates Set T=300 (300 topics)

Inferred Topics

Word Topic Assignments

LDA Algorithm Simulation Two topics: River, Money Five words: “river”, “stream”, “bank”, “money”, “loan” Generate 16 documents by randomly mixing the two topics and using the LDA model river stream bank money loan River 1/3 Money

Generated Documents and Initial Topic Assignment before Inference First 6 and the last 3 documents are purely from one topic. Others are mixture White dot: “River”. Black dot: “Money”

Topic Assignment After LDA Inference First 6 and the last 3 documents are purely from one topic. Others are mixture After 64 iterations

Inferred Topic-Term Matrix Model parameter Estimated parameter Not perfect, but very close especially given the small data size river stream bank money loan River 0.33 Money river stream bank money loan River 0.25 0.4 0.35 Money 0.32 0.29 0.39

LSI vs LDA X = Both perform the following decomposition SVD views this as matrix approximation LDA views this as probabilistic inference based on a generative model Each entry corresponds to “probability”: better interpretability term topic term topic X = doc doc

LDA as Soft Classification Soft vs hard clustering/classification After LDA, every document is assigned to a small number of topics with some weights Documents are not assigned exclusively to a topic Soft clustering

LDA: Application to IR [Wei & Croft 2006] Smooth document unigram language model 𝑃 𝑤 𝑑 with Corpus language model: 𝑃 𝑤 𝐶 = 𝐷 𝐹 𝑤 𝑁 LDA-based model: 𝑃 𝐿𝐷𝐴 𝑤 𝑑 = 𝑧=1 𝑇 𝑃 𝑤 𝑧 𝑃(𝑧|𝑑) 𝑃 𝑤 𝑑 = 1−𝜆−𝜇 𝑇 𝐹 𝑤,𝑑 𝑑 +𝜆 𝐷 𝐹 𝑤 𝑁 +𝜇 𝑧=1 𝑇 𝑃 𝑤 𝑧 𝑃(𝑧|𝑑) “Expand” set of relevant terms through related topics Compared to corpus-smoothing only, 10-20% improvement reported

pLSI and NMF In general, pLSI can be viewed as matrix factorization with constraints that factored matrices may have values between [0, 1] only Nonnegative matrix factorization (NMF): many algorithms exist term topic term topic X = doc doc

Summary Probabilistic Topic Model Generative model of documents Latent Dirichlet Analysis (LDA) Nonnegative matrix factorization Statistical parameter estimation for LDA Multinomial distribution and Dirichlet distribution Monte Carlo method Gibbs sampling Markov-Chain class of sampling Language model “smoothing” through LDA model

References [Wei & Croft 2006] Xing Wei and W. Bruce Croft: LDA-Based Document Models for Ad-hoc Retrieval in SIGIR 2006