Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.

Similar presentations


Presentation on theme: "An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica."— Presentation transcript:

1 An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica

2 Reference D. M. Blei et al., “Latent Dirichlet allocation,” Journal of Machine Learning Research, 3, pp. 993–1022, January 2003. D. Blei and J. Lafferty, “Topic models,” in A. Srivastava and M. Sahami, (eds.), Text Mining: Theory and Applications. Taylor and Francis, 2009. T. Hoffmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, 42, pp. 177–196, 2001. T. Griffiths and M. Steyvers, ”Finding scientific topics,” in Proc. of the National Academy of Sciences, 2004. X. Wei and W.B. Croft, ”LDA-based document models for ad-hoc retrieval,” in Proc. of ACM SIGIR, 2006. 2

3 Outline A Briefly Review of Mixture Models –Unigram Model –Mixture of Unigrams –Probabilistic Latent Semantic Analysis –Latent Dirichlet Allocation LDA Tools –GibbsLDA++ –VB-EM source code from Blei Examples 3

4 Unigram Model & Mixture of Unigrams Unigram model –Under the unigram model, the words of every document are drawn independently from a single multinomial distribution: Mixture of unigrams –Under this mixture model, each document is generated by first choosing a topic and then generating words independently from the conditional multinomial: 4

5 Probabilistic Latent Semantic Analysis Probabilistic latent semantic analysis (PLSA/PLSI) –The PLSA model attempts to relax the simplifying assumption made in the mixture of unigrams model that each document is generated from only one topic serves as the mixture weights of the topics for a particular document 5

6 Latent Dirichlet Allocation The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words LDA assumes the following generative process for each document in a corpus : 1.Choose 2.Choose 3.For each of the N words : a)Choose a topic b)Choose a word from, a multinomial probability conditioned on the topic 6

7 Latent Dirichlet Allocation Several simplifying assumptions are made: –The dimensionality of Dirichlet distribution is assumed known and fixed –The word probabilities are parameterized by a matrix, which we treat as a fixed quantity that is to be estimated –The Poisson assumption is not critical to anything Note that document length is independent of all the other data generating variables ( and ) 7

8 Latent Dirichlet Allocation Given the parameters and, the joint distribution of a topic mixture, a set of topics, and a set of words is given by: Integrating over and summing over, we obtain the marginal distribution of a document: Obtain the probability of a corpus: 8

9 Latent Dirichlet Allocation The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document : –Unfortunately, this distribution is intractable to compute in general –Although the posterior distribution is intractable for exact inference, a wide variety of approximate inference algorithms can be considered for LDA 9

10 Latent Dirichlet Allocation - VBEM The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood A simple way to obtain a tractable family of lower bound is to consider simple modifications of the original graph model in which some of the edges and nodes are removed 10

11 Latent Dirichlet Allocation - VBEM This family is characterized by the following variational distribution: The desideratum of finding a tight lower bound on the log likelihood translates directly into the following optimization problem: –by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior 11

12 GibbsLDA++ GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference The main page of GibbsLDA++ is: http://gibbslda.sourceforge.net/ http://gibbslda.sourceforge.net/ We can download this tool from: http://sourceforge.net/projects/gibbslda/ http://sourceforge.net/projects/gibbslda/ It needs to be compiled on Linux/Cygwin environment 12

13 GibbsLDA++ Extract “GibbsLDA++-0.2.tar.gz” Run cygwin Switch current directory to “/GibbsLDA++-0.2” Execute the commands Then, we have an executable file “lda.exe” in the “/GibbsLDA++-0.2/src” directory 13 make clean make all

14 An Example of GibbsLDA++ Format of the training corpus 14 2265 40889 44022 10092 2471 9800…. 31677 653 657 17998 1788…... 1521 15820 3015 48825 2690….. 42763 7680 38280 2913 42763….. 42763 2997 732 42472 3844….. 2572 1583 2584 44400 3015…... Total document number Doc1 Doc2 word1word2

15 An Example of GibbsLDA++ LDA Parameter Estimation –Command – Parameter Settings 15 lda.exe –est –dfile Gibbs_TDT2_Text.txt –alpha 6.25 –beta 0.1 –ntopics 8 –niters 2000 -dfile: the input training data -alpha: the hyper-parameter of LDA -beta: the hyper-parameter of LDA -ntopics: the number of latent topics -niters: the number of iterations

16 An Example of GibbsLDA++ Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files: –model.others: This file contains some parameters of LDA model –model.phi: This file contains the word-topic distributions (topic-by-word matrix) –model.theta: This file contains the topic-document distributions (document-by-topic) –model.tassign: This file contains the topic assignments for words in training data –Wordmap.txt: This file contains the maps between words and word's IDs (integer) 16

17 VB-EM source code from Blei Blei implement the Latent Dirichlet Allocation (LDA) by using VB-EM for parameter estimation and inference The main page of the source code is: http://www.cs.princeton.edu/~blei/lda-c/index.html http://www.cs.princeton.edu/~blei/lda-c/index.html We can download this tool from: http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz It needs to be compiled on Linux/Cygwin environment 17

18 VB-EM source code from Blei Extract “lda-c-dist.tgz” Run cygwin Switch current directory to “/lda-c-dist” Execute the commands Then, we have an executable file “lda.exe” in the “/lda-c- dist” directory 18 make

19 An Example of LDA Format of the training corpus 19 77 508:1 596:3 612:2 709:1 713:1 ….. 72 508:2 596:5 597:1 653:1 657:3 ….. 88 457:1 508:1 572:2 596:6 795:1 ….. 62 457:1 508:1 596:2 657:1 732:1 ….. 53 336:4 341:1 457:1 596:1 657:1 …... number of unique words word-id appeared times Doc1 Doc2

20 An Example of LDA LDA Parameter Estimation –The input format can be expressed as: [alpha]: The hyper-parameter of LDA [k]: The number of latent topics [settings]: The settings file [data]: The input training data [initialization]: Specify how the topics will be initialized [directory]: The output directory –Command 20 lda.exe est 6.25 8./settings.txt Blei_TDT2_Text.txt random./ lda.exe est [alpha] [k] [settings] [data] [initialization] [directory]

21 An Example of LDA The settings file contain several experimented values: –var max iter: The maximum number of iterations for a single document –var convergence: The convergence criteria for inference –em max iter: The maximum number of iterations of VB-EM –em convergence: The convergence criteria for VB-EM –alpha: set “fixed” or “estimate” 21 var max iter 20 var convergence 1e-6 em max iter 100 em convergence 1e-4 alpha estimate

22 An Example of LDA The saved models are in three files: –.other: This file contains alpha and some other statistical information of LDA model –.beta: This file contains the log of the topic distribution over words (topic-by-word matrix) –.gamma: This file contains the variational posterior Dirichlets of each document (document-by-topic matrix) 22


Download ppt "An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica."

Similar presentations


Ads by Google