Topic Modeling and Latent Dirichlet Allocation: An Overview

Slides:



Advertisements
Similar presentations
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Simultaneous Image Classification and Annotation Chong Wang, David Blei, Li Fei-Fei Computer Science Department Princeton University Published in CVPR.
Title: The Author-Topic Model for Authors and Documents
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Generative Topic Models for Community Analysis
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Information Retrieval in Practice
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Data Mining – Intro.
Overview of Search Engines
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Data Mining Techniques
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait Daniel Barbará
Search Engines and Information Retrieval Chapter 1.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
Online Learning for Latent Dirichlet Allocation
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Hidden Topic Markov Models Amit Gruber, Michal Rosen-Zvi and Yair Weiss in AISTATS 2007 Discussion led by Chunping Wang ECE, Duke University March 2, 2009.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
Finding the Hidden Scenes Behind Android Applications Joey Allen Mentor: Xiangyu Niu CURENT REU Program: Final Presentation 7/16/2014.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Randomized Algorithms for Bayesian Hierarchical Clustering
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Latent Dirichlet Allocation
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Information Retrieval in Practice
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Search Engine Architecture
Online Multiscale Dynamic Topic Models
Multi-Dimensional Data Visualization
Resource Recommendation for AAN
Michal Rosen-Zvi University of California, Irvine
Topic models for corpora and for graphs
Topic Models in Text Processing
Hierarchical Relational Models for Document Networks
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Presentation transcript:

Topic Modeling and Latent Dirichlet Allocation: An Overview Weifeng Li, Sagar Samtani and Hsinchun Chen Acknowledgements: David Blei, Princeton University The Stanford Natural Language Processing Group

Outline Introduction and Motivation Latent Dirichlet Allocation Probabilistic Modeling Overview LDA Assumptions Inference Evaluation Two Examples on Applying LDA to Cyber Security Research Profiling Underground Economy Sellers Understanding Hacker Source Code LDA Variants Relaxing the Assumptions of LDA Incorporating Metadata Generalizing to Other Kinds of Data Future Directions LDA Tools

Outline Introduction and Motivation Latent Dirichlet Allocation Probabilistic Modeling Overview LDA Assumptions Inference Evaluation Two Examples on Applying LDA to Cyber Security Research Profiling Underground Economy Sellers Understanding Hacker Source Code LDA Variants Relaxing the Assumptions of LDA Incorporating Metadata Generalizing to Other Kinds of Data Future Directions LDA Tools

Introduction and Motivation As more information is becoming easily available, it is difficult to find and discover what we need. Topic models are a suite of algorithms for discovering the main themes that pervade a large and other wise unstructured collection of documents. Among these algorithms, Latent Dirichlet Allocation (LDA), a technique based in Bayesian Modeling, is the most commonly used nowadays. Topic models can be applied to massive collections of documents to automatically organize, understand, search, and summarize large electronic archives. Especially relevant in today’s “Big Data” environment.

Introduction and Motivation Each topic is a distribution of words; each document is a mixture of corpus-wide topics; and each word is drawn from one of those topics.

Introduction and Motivation In reality, we only observe documents. The other structures are hidden variables. Our goal to infer the hidden variables.

Introduction and Motivation The resulting output from an LDA model would be sets of topics containing keywords which would then be manually labeled. On the left are the inferred topic proportions for the example articles from the pervious figure.

Use Cases of Topic Modeling Topic models have been used to: Annotate documents and images Organize and browse large corpora Model topic evolution Categorize source code archives Discover influential articles

Outline Introduction and Motivation Latent Dirichlet Allocation Probabilistic Modeling Overview LDA Assumptions Inference Evaluation Two Examples on Applying LDA to Cyber Security Research Profiling Underground Economy Sellers Understanding Hacker Source Code LDA Variants Relaxing the Assumptions of LDA Incorporating Metadata Generalizing to Other Kinds of Data Future Directions LDA Tools

Probabilistic Modeling Overview Modeling: treat the data as arising from a generative process that includes hidden variables. This defines a joint distribution over both the observed and the hidden variables. Inference: infer the conditional distribution (posterior) of the hidden variables given the observed variables. Analysis: check the fit of the model; make prediction based on new data; explore the properties of the hidden variables. Modeling Inference Analysis Blei

Latent Dirichlet Allocation: Assumptions LDA is a generative Bayesian model for topic modeling, which is built on the following assumptions: Assumptions on all variables: Word: the basic unit of discrete data Document: a collection of words (exchangeability assumption) Corpus: a collection of documents Topic (hidden): a distribution over words & the number of topics 𝐾 is known. Assumptions on how texts are generated: Dirichlet Dist. (next slide) For each topic 𝑘, draw a multinomial over words 𝛽 𝑘 ~𝐷𝑖𝑟 𝜂 For each document 𝑑, Draw a document topic proportion 𝜽 𝑑 ~𝐷𝑖𝑟 𝛼 For each word 𝑤 𝑑,𝑛 : Draw a topic 𝑧 𝑑,𝑛 ~𝑀𝑢𝑙𝑡𝑖 𝜽 𝑑 Draw a word 𝑤 𝑑,𝑛 ~𝑀𝑢𝑙𝑡𝑖( 𝛽 𝑧 𝑑,𝑛 )

Dirichlet Distribution A family of continuous multivariate probability distributions parameterized by a vector 𝜶 of positive reals. 𝑝 𝜽 𝜶 = Γ( 𝑘 𝛼 𝑘 ) 𝑘 Γ( 𝛼 𝑘 ) 𝑘 𝜃 𝑘 𝛼 𝑘 −1 A K-dimensional Dirichlet random variable 𝜽 has the following properties 𝜽 is a K-dimensional vector: 𝜽= 𝜃 1 , 𝜃 2 ,…, 𝜃 𝐾 , 𝑥 𝑘 𝜖(0,1) The sum of all the dimensions is 1: 𝑘=1 𝐾 𝜃 𝑘 =1 The parameter vector 𝜶 controls the mean shape and sparsity of 𝜽. The topic proportions are a K-dimensional Dirichlet. The topics are a V-dimensional Dirichlet.

LDA: Probabilistic Graphical Model Per-document topics proportions 𝜃 𝑑 is a multinomial distribution, which is generated from Dirichlet distribution parameterized by 𝛼. Smilarly, topics 𝛽 𝑘 is also a multinomial distribution, which is generated from Dirichlet distribution parameterized by 𝜂. For each word 𝑛, its topic 𝑍 𝑑,𝑛 is drawn from document topic proportions 𝜃 𝑑 . Then, we draw the word 𝑊 𝑑,𝑛 from the topic 𝛽 𝑘 , where 𝑘= 𝑍 𝑑,𝑛 .

LDA: Joint Distribution This distribution specifies a number of dependencies that define LDA.

Inference Objective: computing the conditional distribution (posteriors) of the topic structure given the observed documents. 𝑝 𝜷,𝜽,𝒛 𝒘,𝛼 = 𝑝(𝜷,𝜽,𝒛,𝒘|𝛼) 𝑝(𝒘|𝛼) 𝑝(𝜷,𝜽,𝒛,𝒘|𝛼): the joint distribution of all the random variables, which is easy to compute 𝑝(𝒘|𝛼): the marginal probability of observations, which is intractable. In theory, 𝑝(𝒘|𝛼) is computed by summing the joint distribution over every possible combination of 𝜷,𝜽,𝒛, which is exponentially large. Approximation methods: Sampling-based algorithms attempt to collect samples from the posterior to approximate it with an empirical distribution. Variational algorithms posit a parameterized family of distributions over the hidden structure and then find the member of that family that is closest to the posterior.

More on Approximation Methods In Sampling-based algorithms, Gibbs sampling is the most commonly used: Construct a Markov chain—a sequence of random variables, each dependent on the previous—whose limiting distribution is the posterior. The Markov chain is defined on the hidden topic variables for a particular corpus, and the algorithm is to run the chain for a long time, collect samples from the limiting distribution, and then approximate the distribution with the collected samples. Variational algorithms are a deterministic alternative to sampling- based algorithms. The inference problem is transformed to an optimization problem. Variational methods open the door for innovations in optimization to have practical impact in probabilistic modeling. Can easily handles millions of documents and can accommodate streaming collections of text

Model Evaluation: Perplexity Perplexity is the most typical evaluation of LDA models (Bao & Datta, 2014; Blei et al., 2003). Perplexity measures the modeling power by calculating the inverse probability of unobserved documents. Better models have lower perplexity, suggesting less uncertainties about the unobserved document. Average log-likelihood of all unobserved document Log-likelihood of each unobserved document Wd: words in document d; Nd: Length of document d The figure compares LDA with other topic modeling approaches. The LDA model is consistently better than all other benchmark approaches. Moreover, as the number of topics go up, the LDA model becomes better (i.e., the perplexity decreases.)

Model Selection: How Many Topics to Choose The author of LDA suggests to select the number of topics from 50 to 150 (Blei 2012); however, the optimal number usually depends on the size of the dataset. Cross validation on perplexity is often used for selecting the number of topics. Specifically, we propose possible numbers of topics first, evaluate the average perplexity using cross validation, and pick the number of topics that has the lowest perplexity. The following plot illustrates the selection of optimal number of topics for 4 datasets.

Outline Introduction and Motivation Latent Dirichlet Allocation Probabilistic Modeling Overview LDA Assumptions Inference Evaluation Two Examples on Applying LDA to Cyber Security Research Profiling Underground Economy Sellers Understanding Hacker Source Code LDA Variants Relaxing the Assumptions of LDA Incorporating Metadata Generalizing to Other Kinds of Data Future Directions LDA Tools

Cybersecurity Research Example – Profiling Underground Economy Sellers The underground economy is the online black market for exchanging products/services that relate to cybercrimes. Cyber crime activities have been mostly commoditized in the underground economy.  Sellers impose a growing threats to cyber security. Sellers advertise their products/services by giving details about their resources, payments, contacts, etc. Objective: to profile underground economy sellers to reflect their specialties(characteristics)

Cybersecurity Research Example – Profiling Underground Economy Sellers Input: Original threads from hacker forums Preprocessing: Thread Retrieval: Identifying threads related to the underground economy by conducting snowball sampling-based keywords search Thread Classification: Identifying advertisement threads using MaxEnt classifier Focusing on malware advertisements and stolen card advertisement Can be generalized to other advertisements.

Cybersecurity Research Example – Profiling Underground Economy Sellers To profile the seller, we seek to identify the major topics in its advertisement. Example input: 22 Seller of stolen data: Rescator Description of the stolen data/service Prices of the stolen data Contact: a dedicated shop and ICQ Payment Options

Cybersecurity Research Example – Profiling Underground Economy Sellers For LDA model selection, we use perplexity to choose the optimal number of topics for the advertisement corpus. Output: LDA gives the probabilities of each topics associated with the seller. We pick the top-𝐾 topics to profile the seller (𝐾=5 in our example). For each topic, we pick the top-𝐽 keywords to interpret the topic (𝐽=10 in our example). The following table helps us to profile Rescator based on its characteristics in terms of the product, the payment, and the contact. Top Seller Characteristics of Rescator # Top Keywords Interpretation 5 shop, wmz, icq, webmoney, price, dump, Product: CCs, dumps (valid, verified); Payment: wmz, webmoney, bitcoin, lesspay; Contact: shop, register, deposit, email, icq, jabber 6 валид(valid), чекер(checker), карты(cards), баланс(balance), карт(cards) 8 shop, good, CCs, bases, update, cards, bitcoin, webmoney, validity, lesspay 11 dollars, dumps, deposit, payment, sell, online, verified 16 email, shop, register, icq, account, jabber,

Cybersecurity Research Example – Understanding Hacker Source Code Underground hacker forums provide a unique platform to understand assets which may be used for a cyber-attack. Hacker source code is one of the more abundant cyber-assets in such communities. There are often tens of thousands of such snippets of code embedded in forum postings. However, little research has attempted to automatically understand such assets. Such research can lead to better cyber- defenses and also potential reuse of such assets for educational purposes.

Cybersecurity Research Example – Understanding Hacker Source Code LDA can help us to better understand the types of malicious source code assets which are available in such forums. To perform the analysis, all source code post within a given forum represents a corpus which the LDA model is run on. Associated metadata (e.g., post date, author name) are preserved for post LDA analysis. Stanford Topic Modeling Toolbox (TMT) is used to run LDA on the source code forum postings.

Cybersecurity Research Example – Understanding Hacker Source Code To identify the optimal number of topics to run for in each forum, perplexity charts are calculated. The topic number which has the lowest perplexity score generally signifies the optimal number of topics. However, further manual evaluation is often needed. Data Optimal Topic Number Perplexity DamageLab 60 440.772 Exploit 65 1,424.834 OpenSC 95 4,866.838 Prologic 970.41 Reverse4You 80 1,576.980 Xakepok 90 390.453 Xeksec 1,198.133

Cybersecurity Research Example – Understanding Hacker Source Code After running the model for the optimal number of topics, we can evaluate the outputted keywords, and label each topic. We can further analyze the results by using the associated metadata to create temporal trends, allowing us to discover interesting insights. E.g., SQL injections in hacker forums were popular in 2009, a time which many companies were worried about website backdooring. Overall, LDA allows researchers to automatically understand large amounts of hacker code and how specific types of code trend over time. Reduces the need for manual explorations, and also shows the emerging exploits available in forums.

Outline Introduction and Motivation Latent Dirichlet Allocation Probabilistic Modeling Overview LDA Assumptions Inference Evaluation Two Examples on Applying LDA to Cyber Security Research Profiling Underground Economy Sellers Understanding Hacker Source Code LDA Variants Relaxing the Assumptions of LDA Incorporating Metadata Generalizing to Other Kinds of Data Future Directions LDA Tools

LDA Variants: Relaxing the Assumptions of LDA Consider the order of the words: words in a document cannot be exchanged Conditioning on the previous word (Wallach 2006) Hidden Markov Model (Griffiths et al. 2005) Consider the order of the documents Dynamic LDA (Blei & Lafferty 2006) Consider previously unseen topics: the number of topics is not fixed Bayesian Nonparametrics (Blei et al. 2010)

Dynamic LDA Motivation: In Dynamic LDA, topic evolves over time. LDA assumes the order of documents does not matter (Not appropriate for sequential corpora) We want to capture language change over time. In Dynamic LDA, topic evolves over time. Dynamic LDA uses a logistic normal distribution to model topics evolving over time. Topics drift through time Example: Blei, D. M., and Lafferty, J. D. 2006. “Dynamic topic models,” in Proceedings of the 23rd international conference on Machine learning (ICML 2006), pp. 113–120 (doi: 10.1145/1143844.1143859).

LDA Variants: Incorporating Metadata Account for metadata of the documents (e.g., author, title, geographic location, links, etc.) Author-topic model (Rosen-Zvi et al. 2004) Assumption: The topic proportions are attached to authors. Allows for inferences about authors, for example, author similarity. Relational topic model (Chang & Blei 2010) Documents are linked (e.g., citation, hyperlink) Assumption: links between documents depend on the distance between their topic proportions. Takes into account node attributes (the words of the document) in modeling the network links. Supervised topic model (Blei & McAuliffe 2007) A general purpose method for incorporating metadata into topic models

Supervised LDA Supervised LDA are topic models of documents and response variables. They are fit to find topics predictive of the response variable. rating 10-topic sLDA model on movie reviews (Pang and Lee, 2005): identifying the topics correspond to ratings Blei, D. M., and Mcauliffe, J. D. 2008. “Supervised Topic Models,” in Advances in neural information processing systems, pp. 121–128 (doi: 10.1002/asmb.540).

LDA Variants: Generalizing to Other Kinds of Data LDA is mixed-membership model of grouped data. Rather than associating each group of data with one topic, each gropu exhibits multiple topics in different proportions. Hence, LDA can be adapted to other kinds of observations with only small changes to the corresponding inference algorithms. Population genetics Application: finding ancestral populations Intuition: each individual’s genotype descends from one or more of the ancestral populations (topics) Computer vision Application: classifying images, connect images and captions, build image hierarchies, etc. Intuition: each image exhibits a combination of visual patterns and that the same visual patterns recur throughout a collection of images

Outline Introduction and Motivation Latent Dirichlet Allocation Probabilistic Modeling Overview LDA Assumptions Inference Evaluation Two Examples on Applying LDA to Cyber Security Research Profiling Underground Economy Sellers Understanding Hacker Source Code LDA Variants Relaxing the Assumptions of LDA Incorporating Metadata Generalizing to Other Kinds of Data Future Directions LDA Tools

Future Directions Evaluation and model checking Which topic model should I use? How can I decide which of the many modeling assumptions are important for my goals? How should I move between the many kinds of topic models that have been developed? Visualization and user interfaces How to display the topics? How to best display a document with a topic model? How can we best display document connections? What is an effective interface to the whole corpus and its inferred topic structure? Topic models for data discovery How can topic models help us form hypothesis about the data? What can we learn about the language based on the topic model posterior?

Outline Introduction and Motivation Latent Dirichlet Allocation Probabilistic Modeling Overview LDA Assumptions Inference Evaluation Two Examples on Applying LDA to Cyber Security Research Profiling Underground Economy Sellers Understanding Hacker Source Code LDA Variants Relaxing the Assumptions of LDA Incorporating Metadata Generalizing to Other Kinds of Data Future Directions LDA Tools

Topic Modeling - Tools Name Model/Algorithm Language Author Notes lda-c Latent Dirichlet allocation C D. Blei This implements variational inference for LDA. class-slda Supervised topic models for classification C++ C. Wang Implements supervised topic models with a categorical response. lda R package for Gibbs sampling in many models R J. Chang Implements many models and is fast . Supports LDA, RTMs (for networked documents), MMSB (for network data), and sLDA (with a continuous response). tmve Topic Model Visualization Engine Python A. Chaney A package for creating corpus browsers. dtm Dynamic topic models and the influence model S. Gerrish This implements topics that change over time and a model of how individual documents predict that change. ctm-c Correlated topic models This implements variational inference for the CTM. Mallet LDA, Hierarchical LDA Java A. McCallum Implements LDA and Hierarchical LDA Stanford topic modeling toolbox LDA, Labeled LDA, Partially Labeled LDA Stanford NLP Group Implements LDA, Labeled LDA, and PLDA