27. May 20141 Topic Models Nam Khanh Tran L3S Research Center.

Slides:



Advertisements
Similar presentations
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
Information retrieval – LSI, pLSI and LDA
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Introduction of Probabilistic Reasoning and Bayesian Networks
{bojan.furlan, jeca, 1/42 Probabilistic Graphical Models For Text Mining: A Topic Modeling Survey V. Jelisavčić*, B.
Statistical Topic Modeling part 1
Latent Dirichlet Allocation (LDA)
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Generative Topic Models for Community Analysis
Topic Modeling with Network Regularization Md Mustafizur Rahman.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
Probabilistic Latent Semantic Analysis
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait Daniel Barbará
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Online Learning for Latent Dirichlet Allocation
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Probabilistic Topic Models
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Finding the Hidden Scenes Behind Android Applications Joey Allen Mentor: Xiangyu Niu CURENT REU Program: Final Presentation 7/16/2014.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Building Topic Models in a Federated Digital Library Through Selective Document Exclusion ASIST 2011 New Orleans, LA October 10, 2011 Miles Efron Peter.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Latent Dirichlet Allocation
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Web-Mining Agents Topic Analysis: pLSI and LDA
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
Analysis of Social Media MLD , LTI William Cohen
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Latent Dirichlet Allocation (LDA)
14.0 Linguistic Processing and Latent Topic Analysis.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
The topic discovery models
Multimodal Learning with Deep Boltzmann Machines
The topic discovery models
Latent Dirichlet Analysis
The topic discovery models
Topic Modeling Nick Jordan.
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
Junghoo “John” Cho UCLA
Topic Models in Text Processing
14.0 Linguistic Processing and Latent Topic Analysis
Presentation transcript:

27. May Topic Models Nam Khanh Tran L3S Research Center

Nam Khanh Tran2 Acknowledgements  The slides are in part based on the following slides  “Probabilistic Topic Models”, David M. Blei 2012  “Topic Models”, Claudia Wagner, and the papers  David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 2003  Steyvers and Griffiths, Probabilistic Topic Models, (2006).  David M. Blei, John D. Lafferty, Dynamic Topic Models. Proceedings of the 23rd international conference on Machine learning

Nam Khanh Tran3 Outline  Introduction  Latent Dirichlet Allocation  Overview  The posterior distribution for LDA  Gibbs sampling  Beyond latent Dirichlet Allocation  Demo

Nam Khanh Tran4 The problem with information  As more information becomes available, it becomes more difficult to find and discover what we need  We need new tools to help us organize, search, and understand these vast amounts of information

Nam Khanh Tran5 Topic modeling Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives 1)Discover the hidden themes that pervade the collection 2)Annotate the documents according to those themes 3)Use annotations to organize, summarize, search, form predictions

Nam Khanh Tran6 Discover topics from a corpus

Nam Khanh Tran7 Model the evolution of topics over time

Nam Khanh Tran8 Model connections between topics

Nam Khanh Tran9 Image annotation

Latent Dirichlet Allocation

Nam Khanh Tran11 Latent Dirichlet Allocation  Introduction to LDA  The posterior distribution for LDA  Gibbs sampling

Nam Khanh Tran12 Probabilistic modeling  Treat data as observations that arise from a generative probabilistic process that includes variables  For documents, the hidden variables reflect the thematic structure of the collection  Infer the hidden structure using posterior inference  What are the topics that describe this collections?  Situate new data into the estimated model  How does the query or new document fit into the estimated topic structure

Nam Khanh Tran13 Intuition behind LDA

Nam Khanh Tran14 Generative model

Nam Khanh Tran15 The posterior distribution

Topic Models Topic 1 Topic 2 3 latent variables: Word distribution per topic (word-topic-matrix) Topic distribution per doc (topic-doc-matrix) Topic word assignment (Steyvers, 2006)

Topic models  Observed variables :  Word-distribution per document  3 latent variables  Topic distribution per document : P(z) = θ (d)  Word distribution per topic: P(w, z) = φ (z)  Word-Topic assignment: P(z|w)  Training: Learn latent variables on trainings-collection of documents  Test: Predict topic distribution θ (d) of an unseen document d

Latent Dirichlet Allocation (LDA) Advantage: We learn topic distribution of a corpus  we can predict topic distribution of an unseen document of this corpus by observing its words Hyper-parameters α and β are corpus-level parameters  are only sampled once P( w | z, φ (z) ) P( φ (z) | β ) number of documents number of words

Matrix Representation of LDA observed latent θ (d) φ (z)

Statistical Inference and Parameter Estimation  Key problem: Compute posterior distribution of the hidden variables given a document  Posterior distribution is intractable for exact inference (Blei, 2003) Latent Vars Observed Vars and Priors

Statistical Inference and Parameter Estimation  How can we estimate posterior distribution of hidden variables given a corpus of trainings-documents?  Direct (e.g. via expectation maximization, variational inference or expectation propagation algorithms)  Indirect  i.e. estimate the posterior distribution over z (i.e. P(z))  Gibbs sampling, a form of Markov chain Monte Carlo, is often used to estimate the posterior probability over a high-dimensional random variable z

Markov Chain Example  Random var X refers to the weather  X t is value of var X at time point t  State space of X = {sunny, rain}  Transition probability matrix:  P(sunny|sunny) = 0.9  P(sunny|rain) = 0.1  P(rain|sunny) = 0.5  P(rain|rain) = 0.5  Today ist sunny.  What will be the wheather tomorrow?  The day after tomorrow? source:

Markov Chain Example  With increasing number of days n predictions for the weather tend towards a “steady state vector” q.  q is independent from initial conditions  it must be unchanged when transformed by P.  This makes q an eigenvector (with eigenvalue 1), and means it can be derived from P

Gibbs Sampling  Generates a sequence of samples from the joint probability distribution of two or more random variables.  Aim: compute posterior distribution over latent variable z  Pre-request: we must know the conditional probability of z P( z i = j | z -i, w i, d i,. )

Gibbs Sampling for LDA Random start Iterative For each word we compute How dominant is a topic z in the doc d? How often was the topic z already used in doc d? How likely is a word for a topic z? How often was the word w already assigned to topic z?

Run Gibbs Sampling Example (1) topic1topic2 money32 bank36 Loan21 River22 Stream doc1doc2doc3 topic1444 topic Random topic assignments 2.2 count-matrices: C WT  Words per topic C DT  Topics per document

Gibbs Sampling for LDA Probability that topic j is chosen for word w i, conditioned on all other assigned topics of words in this doc and all other observed vars. Count number of times a word token w i was assigned to a topic j across all docs Count number of times a topic j was already assigned to some word token in doc d i unnormalized! => divide the probability of assigning topic j to word wi by the sum over all topics T

Run Gibbs Sampling Start: assign each word token to a random topic C WT = Count number of times a word token wi was assigned to a topic j C DT = Count number of times a topic j was already assigned to some word token in doc di First Iteration: For each word token, the count matrices C WT and C DT are first decremented by one for the entries that correspond to the current topic assignment Then, a new topic is sampled from the current topic-distribution of a doc and the count matrices C WT and C DT are incremented with the new topic assignment. Each Gibbs sample consists the set of topic assignments to all N word tokens in the corpus, achieved by a single pass through all documents

Run Gibbs Sampling  Start: assign each word token to a random topic  C WT = Count number of times a word token wi was assigned to a topic j  C DT = Count number of times a topic j was already assigned to some word token in doc di  First Iteration:  For each word token, the count matrices C WT and C DT are first decremented by one for the entries that correspond to the current topic assignment  Then, a new topic is sampled from the current topic-distribution of a doc and the count matrices C WT and C DT are incremented with the new topic assignment.  Each Gibbs sample consists the set of topic assignments to all N word tokens in the corpus, achieved by a single pass through all documents

Run Gibbs Sampling Example (2) topic1topic2 money32 bank36 Loan21 River22 Stream doc1doc2doc3 topic1444 topic2444 First Iteration:  Decrement C DT and C WT for current topic j  Sample new topic from the current topic- distribution of a doc

Run Gibbs Sampling Example (2) topic1topic2 money23 bank36 Loan21 River22 Stream doc1doc2doc3 topic1344 topic2544 First Iteration:  Decrement C DT and C WT for current topic j  Sample new topic from the current topic- distribution of a doc

Run Gibbs Sampling Example (3) α = 50/T = 25 and β = 0.01 “Bank” is assigned to Topic 2 How often were all other topics used in doc d i How often was topic j used in doc d i

Gibbs Sampling Parameter Estimation  Gibbs sampling estimates posterior distribution of z. But we need word- distribution φ of each topic and topic-distribution θ of each document. num of times word wi was related with topic j num of times all other words were related with topic j num of times topic j was related with doc d num of times all other topics were related with doc d predictive distributions of sampling a new token of word i from topic j, predictive distributions of sampling a new token in document d from topic j

Example inference

Topics vs. words

Visualizing a document  Use the posterior topic probabilities of each document and the posterior topic assignments to each word

Document similarity  Two documents are similar if they assign similar probabilities to topics

Nam Khanh Tran38 Beyond Latent Dirichlet Allocation

Nam Khanh Tran39 Extending LDA  LDA is a simple topic model  It can be used to find topics that describe a corpus  Each document exhibits multiple topics  How can we build on this simple model of text?

Nam Khanh Tran40 Extending LDA  LDA can be embedded in more complicated models, embodying further intuitions about the structure of the texts (e.g., account for syntax, authorship, dynamics, correlation, and other structure)  The data generating distribution can be changed. We can apply mixed- membership assumptions to many kinds of data (e.g., models of images, social networks, music, computer code and other types)  The posterior can be used in many ways (e.g., use inferences in IR, recommendation, similarity, visualization and other applications)

Nam Khanh Tran41 Dynamic topic models

Nam Khanh Tran42 Dynamic topic models

Nam Khanh Tran43 Dynamic topic models

Nam Khanh Tran44 Dynamic topic models

Nam Khanh Tran45 Long tail of data

Topic Modeling mittels LDA Topic Modeling mittels LDA Corpus Collection durch Suche Corpus Collection durch Suche Term Selection Finden charakteris- tischer Terme Term Selection Finden charakteris- tischer Terme Thema 1: team, kollegen, … Thema 2: prozess, planung, … Thema 3: schicht, nacharbeit,.. Thema 4: qualifizierung, lernen Topic Inference basierend auf dem gelernten Model Topic Inference basierend auf dem gelernten Model Thema 2 Thema 4 Topic cropping

Nam Khanh Tran47 Implementations of LDA  There are many available implementations of topic modeling  LDA-C : A C implementation of LDA  Online LDA: A python package for LDA on massive data  LDA in R: Package in R for many topic models  Mallet: Java toolkit for statistical NLP

Nam Khanh Tran48 Demo

Nam Khanh Tran49 Discussion