An Overview of Topic Modeling

An Overview of Topic Modeling
Weifeng Li1,2 and Hsinchun Chen1 1 Artificial Intelligence Laboratory, The University of Arizona 2 University of Georgia

Acknowledgements Many of the pictures, results, and other materials are taken from: David Blei, Princeton University The Stanford Natural Language Processing Group

Outline Introduction and Motivation Latent Dirichlet Allocation
Probabilistic Modeling Overview LDA Assumptions Inference Evaluation Research Example: LDA Application in Profiling Underground Economy Sellers LDA Variants Relaxing the Assumptions of LDA Incorporating Metadata Coupling with Deep Learning Generalizing to Other Kinds of Data Future Directions Tools & Implementation Details

Introduction and Motivation
As more information is becoming easily available, it is difficult to find and discover what we need. Topic models are a suite of algorithms for discovering the main themes that pervade a large and other wise unstructured collection of documents. Among these algorithms, Latent Dirichlet Allocation (LDA), a technique based in Bayesian Modeling, is the most commonly used nowadays. Topic models can be applied to massive collections of documents to automatically organize, understand, search, and summarize large electronic archives. Especially relevant in today’s “Big Data” environment.

Each topic is a distribution of words; each document is a mixture of corpus-wide topics; and each word is drawn from one of those topics.

In reality, we only observe documents. The other structures are hidden variables. Our goal to infer the hidden variables.

Introduction and Motivation: 100-topic LDA, 17,000 Science articles
The resulting output from an LDA model would be sets of topics containing keywords which would then be manually labeled. On the left are the inferred topic proportions for the example article from the pervious figure.

Use Cases of Topic Modeling
Topic models have been used to: Annotate documents and images Organize and browse large corpora Model topic evolution Categorize source code archives Discover influential articles

Probabilistic Modeling Overview
Modeling: treat the data as arising from a generative process that includes hidden variables. This defines a joint distribution over both the observed and the hidden variables. Inference: infer the conditional distribution (posterior) of the hidden variables given the observed variables. Analysis: check the fit of the model; make prediction based on new data; explore the properties of the hidden variables. Modeling Inference Analysis Blei

Latent Dirichlet Allocation: Assumptions
LDA is a generative Bayesian model for topic modeling, which is built on the following assumptions: Assumptions on all variables: Word: the basic unit of discrete data Document: a collection of words (exchangeability assumption) Corpus: a collection of documents Topic (hidden): a distribution over words & the number of topics 𝐾 is known. Assumptions on how texts are generated: Dirichlet Dist. (next slide) For each topic 𝑘, draw a multinomial over words 𝛽 𝑘 ~𝐷𝑖𝑟 𝜂 For each document 𝑑, Draw a document topic proportion 𝜽 𝑑 ~𝐷𝑖𝑟 𝛼 For each word 𝑤 𝑑,𝑛 : Draw a topic 𝑧 𝑑,𝑛 ~𝑀𝑢𝑙𝑡𝑖 𝜽 𝑑 Draw a word 𝑤 𝑑,𝑛 ~𝑀𝑢𝑙𝑡𝑖( 𝛽 𝑧 𝑑,𝑛 )

Dirichlet Distribution: Dir(𝜶)
Named after Peter G. L. Dirichlet and often denoted as Dir(𝜶); A family of continuous multivariate probability distributions parameterized by a vector 𝜶 of positive reals. 𝑝 𝜽 𝜶 = Γ( 𝑘 𝛼 𝑘 ) 𝑘 Γ( 𝛼 𝑘 ) 𝑘 𝜃 𝑘 𝛼 𝑘 −1 Dir(𝜶) is the multivariate generalization of the beta distribution. Dirichlet distributions are often used as prior distributions in Bayesian statistics. Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.( Conjugates distributions: the posterior distributions are in the same family as the prior distribution.)

LDA: Probabilistic Graphical Model
Per-document topics proportions 𝜃 𝑑 is a multinomial distribution, which is generated from Dirichlet distribution parameterized by 𝛼. Smilarly, topics 𝛽 𝑘 is also a multinomial distribution, which is generated from Dirichlet distribution parameterized by 𝜂. For each word 𝑛, its topic 𝑍 𝑑,𝑛 is drawn from document topic proportions 𝜃 𝑑 . Then, we draw the word 𝑊 𝑑,𝑛 from the topic 𝛽 𝑘 , where 𝑘= 𝑍 𝑑,𝑛 .

The Graphical Model for LDA: Joint Distribution
This distribution specifies a number of dependencies that define LDA (as shown in the plate diagram).

Inference Objective: computing the conditional distribution (posteriors) of the topic structure given the observed documents. 𝑝 𝜷,𝜽,𝒛 𝒘,𝛼 = 𝑝(𝜷,𝜽,𝒛,𝒘|𝛼) 𝑝(𝒘|𝛼) 𝑝(𝜷,𝜽,𝒛,𝒘|𝛼): the joint distribution of all the random variables, which is easy to compute 𝑝(𝒘|𝛼): the marginal probability of observations (the probability of seeing the observed corpus under any topic model), which is intractable. In theory, 𝑝(𝒘|𝛼) is computed by summing the joint distribution over every possible combination of 𝜷,𝜽,𝒛, which is exponentially large. Approximation methods: search over the topic structure Sampling-based algorithms attempt to collect samples from the posterior to approximate it with an empirical distribution. Variational algorithms posit a parameterized family of distributions over the hidden structure and then find the member of that family that is closest to the posterior.

More on Approximation Methods
In Sampling-based algorithms, Gibbs sampling is the most commonly used: Approximating the posterior with samples. Construct a Markov chain—a sequence of random variables, each dependent on the previous—whose limiting distribution is the posterior. The Markov chain is defined on the hidden topic variables for a particular corpus, and the algorithm is to run the chain for a long time, collect samples from the limiting distribution, and then approximate the distribution with the collected samples (see Steyers & Griffiths, 2006). Variational algorithms are a deterministic alternative to sampling-based algorithms. Posit a parametrized family of distributions over the hidden structure and then find the member of that family that is closet to the posterior. The inference problem is transformed to an optimization problem. Coordinate ascent variational inference algorithm for LDA (see Blei, Ng, and Jordan, 2003)

Model Evaluation: Perplexity
Perplexity is the most typical evaluation of LDA models (Bao & Datta, 2014; Blei et al., 2003). Perplexity measures the modeling power by calculating the inverse log- likelihood of unobserved documents (an decreasing function). (Higher likelihood, the better model) Better models have lower perplexity, suggesting less uncertainties about the unobserved document. Average log-likelihood of all unobserved document Log-likelihood of each unobserved document Wd: words in document d; Nd: Length of document d The figure compares LDA with other topic modeling approaches. The LDA model is consistently better than all other benchmark approaches. Moreover, as the number of topics go up, the LDA model becomes better (i.e., the perplexity decreases.)

Model Evaluation: Topic Coherence
Topic coherence evaluates the semantic nature of the learned topics. Specifically, it measures the semantic similarity among the top keywords of a topic. Topic coherence has shown to be correlated with human evaluations of topic quality. Topic coherence a topic 𝜷 𝑘 is calculated by: coherence 𝜷 𝑘 = ( 𝑤 𝑖 , 𝑤 𝑗 )∈ 𝑉 𝑛 score( 𝑤 𝑖 , 𝑤 𝑗 ) where 𝑉 𝑛 is the top 𝑛 keywords of the topic 𝜷 𝑘 There are two commonly used score metrics: The Extrinsic UCI Metric (Newman et al. 2010): score 𝑤 𝑖 , 𝑤 𝑗 = log 𝑝( 𝑤 𝑖 , 𝑤 𝑗 ) 𝑝 𝑤 𝑖 𝑝( 𝑤 𝑗 ) where 𝑝( 𝑤 𝑖 , 𝑤 𝑗 ) is the word co-occurrence probability of word pair 𝑤 𝑖 , 𝑤 𝑗 estimated from an external corpus (e.g., Wikipedia) and 𝑝( 𝑤 𝑖 ) is the probability of word 𝑤 𝑖 in the external corpus. The Intrinsic UMass Metric (Mimno et al. 2011): score 𝑤 𝑖 , 𝑤 𝑗 = log (D 𝑤 𝑖 , 𝑤 𝑗 +1) D( 𝑤 𝑗 ) where D( 𝑤 𝑖 , 𝑤 𝑗 ) counts the number of documents word pair 𝑤 𝑖 , 𝑤 𝑗 co-occurred and D( 𝑤 𝑗 ) counts the number of documents containing 𝑤 𝑗 . Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010, June). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp ). Association for Computational Linguistics. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing (pp ). Association for Computational Linguistics. Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp ). ACL. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing (pp ). ACL.

Model Selection: How Many Topics to Choose
The author of LDA suggests to select the number of topics from 50 to 150 (Blei 2012); however, the optimal number usually depends on the size of the dataset. Cross validation on perplexity is often used for selecting the number of topics. Specifically, we propose possible numbers of topics first, evaluate the average perplexity using cross validation, and pick the number of topics that has the lowest perplexity. The following plot illustrates the selection of optimal number of topics for 4 datasets.

LDA Research Example – Profiling Underground Economy Sellers
The underground economy is the online black market for exchanging products/services that relate to cybercrimes. Cyber crime activities have been mostly commoditized in the underground economy.  Sellers impose a growing threat to cyber security. Sellers advertise their products/services by giving details about their resources, payments, contacts, etc. Objective: to profile underground economy sellers to reflect their specialties(characteristics) Li, W., Chen, H., & Nunamaker Jr, J. F. (2016). Identifying and profiling key sellers in cyber carding community: AZSecure text mining system. Journal of Management Information Systems, 33(4),

Input: Original threads from hacker forums Preprocessing: Thread Retrieval: Identifying threads related to the underground economy by conducting snowball sampling-based keywords search Thread Classification: Identifying advertisement threads using MaxEnt classifier Focusing on malware advertisements and stolen card advertisement Can be generalized to other advertisements.

To profile the seller, we seek to identify the major topics in its advertisement. Example input: 23 Seller of stolen data: Rescator Description of the stolen data/service Prices of the stolen data Contact: a dedicated shop and ICQ Payment Options

For LDA model selection, we use perplexity to choose the optimal number of topics for the advertisement corpus. Output: LDA gives the probabilities of each topics associated with the seller. We pick the top-𝐾 topics to profile the seller (𝐾=5 in our example). For each topic, we pick the top-𝐽 keywords to interpret the topic (𝐽=10 in our example). The following table helps us to profile Rescator based on its characteristics in terms of the product, the payment, and the contact. Top Seller Characteristics of Rescator # Top Keywords Interpretation 5 shop, wmz, icq, webmoney, price, dump, Product: CCs, dumps (valid, verified); Payment: wmz, webmoney, bitcoin, lesspay; Contact: shop, register, deposit, , icq, jabber 6 валид(valid), чекер(checker), карты(cards), баланс(balance), карт(cards) 8 shop, good, CCs, bases, update, cards, bitcoin, webmoney, validity, lesspay 11 dollars, dumps, deposit, payment, sell, online, verified 16 , shop, register, icq, account, jabber,

LDA Variants: Relaxing the Assumptions of LDA
Consider the order of the words: words in a document cannot be exchanged Conditioning on the previous word (Wallach 2006) Hidden Markov Model (Griffiths et al. 2005) Consider the order of the documents Dynamic LDA (Blei & Lafferty 2006) Consider previously unseen topics: the number of topics is not fixed Bayesian Nonparametrics (Blei et al. 2010)

Dynamic LDA Motivation: In Dynamic LDA, topic evolves over time.
LDA assumes the order of documents does not matter (Not appropriate for sequential corpora) We want to capture language change over time. In Dynamic LDA, topic evolves over time. Dynamic LDA uses a logistic normal distribution to model topics evolving over time. Topics drift through time Example: Blei, D. M., and Lafferty, J. D “Dynamic topic models,” in Proceedings of the 23rd international conference on Machine learning (ICML 2006), pp. 113–120 (doi: / ).

Bayesian Nonparametric Topic Modeling: Hierarchical Dirichlet Process
Bayesian nonparametrics: the parameter space has infinite imension Nonparametric topic model: a topic model whose topic space has infinite dimension (i.e., an infinite number of topics) Less vulnerable to model overfitting or underfitting caused by the misspecification of topic number A prominent nonparametric topic model is hierarchical Dirichlet Process (HDP) (Teh et al. 2006) HDP is increasingly chosen for modeling topics over LDA for its reliability and flexibility. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association,

Dirichlet Process: the Major Building Block of HDP
Dirichlet process (DP): A probability distribution of discrete distributions over the topic space with probability one. 𝐺 0 ~𝐷𝑃 𝜔,𝐻 𝜔 (concentration parameter): how concentrated the discrete distributions over topics drawn from the DP are; 𝐻 (base distribution): determines the topic space and the expectation of the discrete distributions drawn from the DP. Two nice properties of DP samples that allow for modeling topics: Clustering property: topics previously drawn from the distribution will likely to be drawn again, allowing the words within a document to be clustered under certain topics. Infinity property: once can drawn as many as topics as needed because DP samples are distributions over topics; therefore, the number of topics is unbounded.

Example: 𝐺 0 ~𝐷𝑃 𝜔,𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(0,1)
𝜔=1 𝜔=10 𝜔=100 𝜔=1000 Sample 𝐺 0 #1 Sample 𝐺 0 #2 Sample 𝐺 0 #3 𝐺 0 : Discrete distribution with probability one. Can take an infinite number of values as sampled from 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(0,1). Values sampled from 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(0,1) are often repeated, thus forming clusters (each spike represents one cluster.)

Hierarchical Dirichlet Process (HDP)
For the corpus: 𝐺 0 ~𝐷𝑃 𝜔,𝐻 , For each document: 𝐺 𝑑 ~𝐷𝑃(𝛼, 𝐺 0 ) Corpus Topic Distribution Document Topic Distribution Per-word Topic Observed Word Dirichlet process (DP) allows for modeling an infinite number of topics. 𝐻: base topic distribution (e.g., Dirichlet distribution Dir(𝜶)); 𝜔: corpus topic concentration parameter; 𝛼: document topic concentration parameter

HDP Model Specifications:
Variable Distribution Description 𝚽 Laplace distribution (Taddy 2013): 𝜙 𝑤 ~𝐿𝑎𝑝𝑙𝑎𝑐𝑒(𝜆) Coefficient of the response 𝑦 𝑑 on word 𝑤 𝑮 0 Dirichlet Process (Teh et al. 2006): 𝐺 0 ~𝐷𝑃 𝜔,𝐻 , 𝐻: distribution over topics Corpus topic distribution Collection of all possible topics that will be used in the corpus 𝑮 𝑑 𝐺 𝑑 ~𝐷𝑃(𝛼, 𝐺 0 ) Document topic distribution Collection of all possible topics for the document 𝜷 𝑑,𝑛 Multinomial distribution (Blei et al. 2003): 𝜷 𝑑𝑛 ~ 𝐺 𝑑 Topic of the 𝑛-th word in the 𝑑-th document 𝑤 𝑑,𝑛 Multinomial distribution (Rabinovich & Blei 2014): 𝑤 𝑑𝑛 ~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝜷 𝑑𝑛 Observed 𝑛-th word in the 𝑑-th document Hierarchical Dirichlet Process hide

Latent Dirichlet Allocation vs. Hierarchical Dirichlet Process
Corpus-Level Topics Corpus-Level Topics Topics 1 2 3 4 5 ∞ Topics Document 1 (Topic Prop.) Document 1 (Topic Dist.) Topics 1 2 3 4 5 Prop. .1 .2 .4 Document 2 (Topic Prop.) Topics 1 2 3 4 5 Prop. .3 .2 .1 Document 2 (Topic Dist.) …… …… Document D (Topic Prop.) Topics 1 2 3 4 5 Prop. .3 .2 .1 Document D (Topic Dist.)

Hierarchical Dirichlet Process Performance: Document Modeling on Academic Papers
Perplexity of LDA versus HDP Posterior number of topics in HDP (a) (b) (a): HDP performed as well as the best LDA model, doing so without any form of model selection procedure. (b): The posterior over the number of topics obtained under the HDP model is consistent with this range of the best-fitting LDA models. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association,

LDA Variants: Incorporating Metadata
Account for metadata of the documents (e.g., author, title, geographic location, links, etc.) Author-topic model (Rosen-Zvi et al. 2004) Assumption: The topic proportions are attached to authors. Allows for inferences about authors, for example, author similarity. Relational topic model (Chang & Blei 2010) Documents are linked (e.g., citation, hyperlink) Assumption: links between documents depend on the distance between their topic proportions. Takes into account node attributes (the words of the document) in modeling the network links. Supervised topic model (Blei & McAuliffe 2007) A general purpose method for incorporating metadata into topic models

From Topic Model to Supervised Topic Model (STM)
Documents (e.g., reviews) 𝑲 topics: Responses (e.g., service quality) picture 0.04 video tutorial price delivery 0.04 service model device machine 0.01 cvv vbv ssn pos skimmer 0.02 encrypt 0.01 Descriptive (unsupervised) Predictive (Supervised) 𝒘 𝒚 𝛽 1 𝛽 2 𝛽 3 𝛽 4 𝛽 5 STM can simultaneously Explore the underlying topics Make predictions using the extracted topics More accurate prediction by capturing the underlying topics shared across documents (Blei & Mcauliffe 2008) Existing STMs: Supervised LDA, Dirichlet Multinomial Regression, Discriminative LDA, Dependency-LDA, Inverse Regression Topic Model (IRTM)…

Supervised LDA Supervised LDA are topic models of documents and response variables. They are fit to find topics predictive of the response variable. How many topics? rating 10-topic sLDA model on movie reviews (Pang and Lee, 2005): identifying the topics correspond to ratings Blei, D. M., and Mcauliffe, J. D “Supervised Topic Models,” in Advances in neural information processing systems, pp. 121–128 (doi: /asmb.540).

Parametric vs. Nonparametric
picture 0.04 video tutorial price delivery 0.04 service model device machine 0.01 cvv vbv ssn pos skimmer 0.02 encrypt 0.01 Documents (e.g., reviews) 𝒘 𝒚 𝛽 1 𝛽 2 𝛽 3 𝛽 4 𝛽 5 𝑲 topics: Responses (e.g., service quality) Proportion of predefined number of topics Parametric Approach: picture 0.04 video tutorial price delivery 0.04 service model device machine 0.01 cvv vbv ssn pos skimmer 0.02 encrypt 0.01 Documents (e.g., reviews) 𝒘 𝒚 ∞ topics: Responses (e.g., service quality) Distribution of unlimited number of topics Nonparametric Approach: non parametric flexibility to determine the number of topics potential number of topics is unlimited

Supervised HDP: Nonparametric Supervised Topic Model
Corpus Topic Distribution Document Topic Distribution Per-word Topic Observed Word Response Coefficient Latent variables Observed data

Mean Absolute Error (MAE) Root Mean Square Error (RMSE)
At least 17% improvement over closest technique Better Predictive R-squared Failed to converge Failed to converge Mean Absolute Error (MAE) Failed to converge Failed to converge Better Root Mean Square Error (RMSE) Failed to converge Failed to converge Better Li, W., Yin, J., & Chen, H. (2018). Supervised Topic Modeling Using Hierarchical Dirichlet Process-Based Inverse Regression: Experiments on E-Commerce Applications. IEEE Transactions on Knowledge and Data Engineering, 30(6),

LDA Variants: Coupling with Deep Learning
Neural Topic Model (Miao et al. 2017) Variational autoencoder (VAE): a deep learning technique for approximating posterior distribution using the autoencoder framework. Building upon deep learning’s improved capability of approximating non- linear relationships, VAE generally provides better model inference than traditional variational inference algorithms. Neural topic model: leveraging VAE for model inference Word embedding-based topic modeling (Das et al. 2015) Word embedding: representing word semantics in a continuous vector space Semantically related words tend to be closer to each other in the vector space. Word embedding-based topic: a distribution over the vector space (instead of a multinomial distribution over words) This encourages the model to group words that are a priori known to be semantically related into topics. Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian LDA for Topic Models with Word Embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 795–804). Miao, Y., Grefenstette, E., & Blunsom, P. (2017). Discovering Discrete Latent Topics with Neural Variational Inference. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 2410–2419).

LDA Variants: Generalizing to Other Kinds of Data
LDA is mixed-membership model of grouped data. Rather than associating each group of data with one topic, each group exhibits multiple topics in different proportions. Hence, LDA can be adapted to other kinds of observations with only small changes to the corresponding inference algorithms. Population genetics Application: finding ancestral populations Intuition: each individual’s genotype descends from one or more of the ancestral populations (topics) Computer vision Application: classifying images, connect images and captions, build image hierarchies, etc. Intuition: each image exhibits a combination of visual patterns and that the same visual patterns recur throughout a collection of images

Future Directions Evaluation and model checking
Held-out accuracy may not correspond to better organization or easier interpretation (Amazon Mechanical Turk experiment; see Chang et al., 2009; perplexity is not strongly correlated to human judgement; sometimes even slightly anti-correlated) Which topic model should I use? How can I decide which of the many modeling assumptions are important for my goals? Visualization and user interfaces How to display the topics? How to best display a document with a topic model? How can we best display document connections?

Future Directions Topic models for data discovery Topic interpretation
How can topic models help us form hypothesis about the data? What can we learn about the language based on the topic model posterior? Topic interpretation Topics are distributions over words; therefore, interpreting topics semantics from these distributions becomes important. How to properly interpret a topic? Multilingual topic modeling Whether the same topic can appear in different languages? How to find common topics across different languages?

Topic Modeling - Tools Name Model/Algorithm Language Author Notes
lda-c Latent Dirichlet allocation C D. Blei This implements variational inference for LDA. class-slda Supervised topic models for classification C++ C. Wang Implements supervised topic models with a categorical response. lda R package for Gibbs sampling in many models R J. Chang Implements many models and is fast . Supports LDA, RTMs (for networked documents), MMSB (for network data), and sLDA (with a continuous response). gensim Software Framework for Topic Modelling with Large Corpora Python R. Rehurek, P. Sojka Provides distribution implementation of many models, including LDA, author-topic model, HDP, Dynamic topic model. tmve Topic Model Visualization Engine A. Chaney A package for creating corpus browsers. dtm Dynamic topic models and the influence model S. Gerrish This implements topics that change over time and a model of how individual documents predict that change. ctm-c Correlated topic models This implements variational inference for the CTM. Mallet LDA, Hierarchical LDA Java A. McCallum Implements LDA and Hierarchical LDA Stanford topic modeling toolbox LDA, Labeled LDA, Partially Labeled LDA Stanford NLP Group Implements LDA, Labeled LDA, and PLDA

gensim (Python) Example: Topic Modeling News Articles
We use the news article dataset from the Lee corpus. test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) lee_train_file = test_data_dir + os.sep + 'lee_background.cor‘ def build_texts(fname): with open(fname) as f: for line in f: yield gensim.utils.simple_preprocess(line, deacc=True, min_len=3) train_texts = list(build_texts(lee_train_file)) bigram = gensim.models.Phrases(train_texts) def process_texts(texts): texts = [[word for word in line if word not in stops] for line in texts] texts = [bigram[line] for line in texts] texts = [[word.split('/')[0] for word in lemmatize(' '.join(line), allowed_tags=re.compile('(NN)'), min_length=3)] for line in texts] return texts train_texts = process_texts(train_texts) dictionary = Dictionary(train_texts) corpus = [dictionary.doc2bow(text) for text in train_texts] Data import & preprocessing Bigram collocation detection Creating a bag-of-words model Tutorial retrieved from:

Specifying the number of topics ldamodel = gensim.models.LdaModel(corpus=corpus, num_topics=10, id2word=dictionary) ldatopics = ldamodel.show_topics(formatted=False) import pyLDAvis.genism pyLDAvis.enable_notebook() pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary) Executing LDA & visualize Example output: Tutorial retrieved from:

Executing HDP & showing learned topics hdpmodel = gensim.models.HdpModel(corpus=corpus, id2word=dictionary) hdptopics = hdpmodel.show_topics(formatted=False) hdpmodel.show_topics() Example output: [u'topic 0: 0.004*collapse *afghanistan *troop *force *government *benefit *operation *taliban *time *today *ypre *tourism *person *help *wayne *fire *peru *day *united_state *hih', u'topic 1: 0.003*group *government *target *palestinian *end *terrorism *cease *memorandum *radio *call *official *path *security *wayne *attack *human_right *four *gunman *sharon *subsidiary', u'topic 2: 0.003*rafter *double *team *reality *manager *cup *australia *abc *nomination *user *freeman *herberton *lung *believe *injury *steve_waugh *fact *statement *mouth *alejandro', … Each topic is returned with its most probable keywords and its associated weights. Tutorial retrieved from:

We now evaluate the HDP model in comparison with the LDA model as measured by topic coherence. hdptopics = [[word for word, prob in topic] for topicid, topic in hdptopics] ldatopics = [[word for word, prob in topic] for topicid, topic in ldatopics] hdp_coherence = gensim.models.CoherenceModel(topics=hdptopics[:10], texts=train_texts, dictionary=dictionary, window_size=10).get_coherence() lda_coherence = gensim.models.CoherenceModel(topics=ldatopics, texts=train_texts, dictionary=dictionary, window_size=10).get_coherence() def evaluate_bar_graph(coherences, indices): assert len(coherences) == len(indices) n = len(coherences) x = np.arange(n) plt.bar(x, coherences, width=0.2, tick_label=indices, align='center') plt.xlabel('Models') plt.ylabel('Coherence Value') evaluate_bar_graph([hdp_coherence, lda_coherence], ['HDP', 'LDA']) Calculating topic coherence Visualize topic coherence comparison The output bar graph is on the next slide. Tutorial retrieved from:

The figure on the right shows the comparison of the performance between HDP and LDA, as measured by topic coherence. As suggested in the figure, HDP generates more coherent topics than LDA. Tutorial retrieved from:

An Overview of Topic Modeling

Similar presentations

Presentation on theme: "An Overview of Topic Modeling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Overview of Topic Modeling

Similar presentations

Presentation on theme: "An Overview of Topic Modeling"— Presentation transcript:

Similar presentations

About project

Feedback