An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.

Slides:

Advertisements

Similar presentations

Part 2: Unsupervised Learning

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

Information retrieval – LSI, pLSI and LDA

Expectation Maximization

Title: The Author-Topic Model for Authors and Documents

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Probabilistic Clustering-Projection Model for Discrete Data

Statistical Topic Modeling part 1

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.

Bayesian Nonparametric Matrix Factorization for Recorded Music Reading Group Presenter: Shujie Hou Cognitive Radio Institute Friday, October 15, 2010 Authors:

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Generative Topic Models for Community Analysis

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Lecture 5: Learning models using EM

Latent Dirichlet Allocation a generative model for text

Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”

British Museum Library, London Picture Courtesy: flickr.

Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)

Visual Recognition Tutorial

LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

EM and expected complete log-likelihood Mixture of Experts

Online Learning for Latent Dirichlet Allocation

2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Probabilistic Topic Models

27. May Topic Models Nam Khanh Tran L3S Research Center.

Integrating Topics and Syntax -Thomas L

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Topic Modeling using Latent Dirichlet Allocation

An Introduction to Latent Dirichlet Allocation (LDA)

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Link Distribution on Wikipedia [0407]KwangHee Park.

Web-Mining Agents Topic Analysis: pLSI and LDA

Analysis of Social Media MLD , LTI William Cohen

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Learning Deep Generative Models by Ruslan Salakhutdinov

Online Multiscale Dynamic Topic Models

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

Multimodal Learning with Deep Boltzmann Machines

Bayesian Models in Machine Learning

Bayesian Inference for Mixture Language Models

Stochastic Optimization Maximization for Latent Variable Models

Topic models for corpora and for graphs

Michal Rosen-Zvi University of California, Irvine

Latent Dirichlet Allocation

Junghoo “John” Cho UCLA

Topic models for corpora and for graphs

Topic Models in Text Processing

Presentation transcript:

An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica

Reference D. M. Blei et al., “Latent Dirichlet allocation,” Journal of Machine Learning Research, 3, pp. 993–1022, January D. Blei and J. Lafferty, “Topic models,” in A. Srivastava and M. Sahami, (eds.), Text Mining: Theory and Applications. Taylor and Francis, T. Hoffmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, 42, pp. 177–196, T. Griffiths and M. Steyvers, ”Finding scientific topics,” in Proc. of the National Academy of Sciences, X. Wei and W.B. Croft, ”LDA-based document models for ad-hoc retrieval,” in Proc. of ACM SIGIR,

Outline A Briefly Review of Mixture Models –Unigram Model –Mixture of Unigrams –Probabilistic Latent Semantic Analysis –Latent Dirichlet Allocation LDA Tools –GibbsLDA++ –VB-EM source code from Blei Examples 3

Unigram Model & Mixture of Unigrams Unigram model –Under the unigram model, the words of every document are drawn independently from a single multinomial distribution: Mixture of unigrams –Under this mixture model, each document is generated by first choosing a topic and then generating words independently from the conditional multinomial: 4

Probabilistic Latent Semantic Analysis Probabilistic latent semantic analysis (PLSA/PLSI) –The PLSA model attempts to relax the simplifying assumption made in the mixture of unigrams model that each document is generated from only one topic serves as the mixture weights of the topics for a particular document 5

Latent Dirichlet Allocation The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words LDA assumes the following generative process for each document in a corpus : 1.Choose 2.Choose 3.For each of the N words : a)Choose a topic b)Choose a word from, a multinomial probability conditioned on the topic 6

Latent Dirichlet Allocation Several simplifying assumptions are made: –The dimensionality of Dirichlet distribution is assumed known and fixed –The word probabilities are parameterized by a matrix, which we treat as a fixed quantity that is to be estimated –The Poisson assumption is not critical to anything Note that document length is independent of all the other data generating variables ( and ) 7

Latent Dirichlet Allocation Given the parameters and, the joint distribution of a topic mixture, a set of topics, and a set of words is given by: Integrating over and summing over, we obtain the marginal distribution of a document: Obtain the probability of a corpus: 8

Latent Dirichlet Allocation The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document : –Unfortunately, this distribution is intractable to compute in general –Although the posterior distribution is intractable for exact inference, a wide variety of approximate inference algorithms can be considered for LDA 9

Latent Dirichlet Allocation - VBEM The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed 10

Latent Dirichlet Allocation - VBEM This family is characterized by the following variational distribution: The desideratum of finding a tight lower bound on the log likelihood translates directly into the following optimization problem: –by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior 11

GibbsLDA++ GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference The main page of GibbsLDA++ is: We can download this tool from: It needs to be compiled on Linux/Cygwin environment 12

GibbsLDA++ Extract “GibbsLDA tar.gz” Run cygwin Switch current directory to “/GibbsLDA++-0.2” Execute the commands Then, we have an executable file “lda.exe” in the “/GibbsLDA++-0.2/src” directory 13 make clean make all

An Example of GibbsLDA++ Format of the training corpus … … … … … …... Total document number Doc1 Doc2 word1word2

An Example of GibbsLDA++ LDA Parameter Estimation –Command – Parameter Settings 15 lda.exe –est –dfile Gibbs_TDT2_Text.txt –alpha 6.25 –beta 0.1 –ntopics 8 –niters dfile: the input training data -alpha: the hyper-parameter of LDA -beta: the hyper-parameter of LDA -ntopics: the number of latent topics -niters: the number of iterations

An Example of GibbsLDA++ Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files: –model.others: This file contains some parameters of LDA model –model.phi: This file contains the word-topic distributions (topic-by-word matrix) –model.theta: This file contains the topic-document distributions (document-by-topic) –model.tassign: This file contains the topic assignments for words in training data –Wordmap.txt: This file contains the maps between words and word's IDs (integer) 16

VB-EM source code from Blei Blei implement the Latent Dirichlet Allocation (LDA) by using VB-EM for parameter estimation and inference The main page of the source code is: We can download this tool from: It needs to be compiled on Linux/Cygwin environment 17

VB-EM source code from Blei Extract “lda-c-dist.tgz” Run cygwin Switch current directory to “/lda-c-dist” Execute the commands Then, we have an executable file “lda.exe” in the “/lda-c- dist” directory 18 make

An Example of LDA Format of the training corpus :1 596:3 612:2 709:1 713:1 … :2 596:5 597:1 653:1 657:3 … :1 508:1 572:2 596:6 795:1 … :1 508:1 596:2 657:1 732:1 … :4 341:1 457:1 596:1 657:1 …... number of unique words word-id appeared times Doc1 Doc2

An Example of LDA LDA Parameter Estimation –The input format can be expressed as: [alpha]: The hyper-parameter of LDA [k]: The number of latent topics [settings]: The settings file [data]: The input training data [initialization]: Specify how the topics will be initialized [directory]: The output directory –Command 20 lda.exe est /settings.txt Blei_TDT2_Text.txt random./ lda.exe est [alpha] [k] [settings] [data] [initialization] [directory]

An Example of LDA The settings file contain several experimented values: –var max iter: The maximum number of iterations for a single document –var convergence: The convergence criteria for inference –em max iter: The maximum number of iterations of VB-EM –em convergence: The convergence criteria for VB-EM –alpha: set “fixed” or “estimate” 21 var max iter 20 var convergence 1e-6 em max iter 100 em convergence 1e-4 alpha estimate

An Example of LDA The saved models are in three files: –.other: This file contains alpha and some other statistical information of LDA model –.beta: This file contains the log of the topic distribution over words (topic-by-word matrix) –.gamma: This file contains the variational posterior Dirichlets of each document (document-by-topic matrix) 22