Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April 23 2010 TexPoint fonts used in EMF. Read the TexPoint manual before.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
Xiaolong Wang and Daniel Khashabi
Information retrieval – LSI, pLSI and LDA
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Unsupervised Learning
Expectation Maximization
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Generative learning methods for bags of features
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Generative Topic Models for Community Analysis
Overview Full Bayesian Learning MAP learning
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Generative learning methods for bags of features
British Museum Library, London Picture Courtesy: flickr.
A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Language Modeling Approaches for Information Retrieval Rong Jin.
Semi-Supervised Learning
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Probabilistic Topic Models
27. May Topic Models Nam Khanh Tran L3S Research Center.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Latent Dirichlet Allocation
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
Techniques for Dimensionality Reduction
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Automatic Labeling of Multinomial Topic Models
Web-Mining Agents Topic Analysis: pLSI and LDA
Analysis of Social Media MLD , LTI William Cohen
Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.
Latent Dirichlet Allocation (LDA)
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
The topic discovery models
Latent Variables, Mixture Models and EM
The topic discovery models
Latent Dirichlet Analysis
The topic discovery models
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
Junghoo “John” Cho UCLA
Topic Models in Text Processing
CS590I: Information Retrieval
Presentation transcript:

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A AA A A A A A

Outline 2  Topic Models: An Example of Latent Dirichlet Allocation  Learning : Expectation Maximization Algorithms  Learning: Gibbs sampling

Problem set-up  Given: a collection of documents  Want:  Detect key ‘topics’ discussed in the collection  For each document detect which ‘topics’ are discussed there  Requirements:  No supervision (documents are not labeled)  Probabilistic methods

Motivation  Visualization of collections:  what are the topics discussed?  which documents discuss a topic?  Opinion mining:  what is sentiment towards a product aspects?  what are important aspects of a product?  Dimensionality reduction:  for information retrieval  for document classification  Summarization:  ensuring topic coverage in a summary ...

Latent Semantic Analysis [Deerwester et al., 1990]  Decomposition of the coccurence matrix  Approximate the co-occurence matrix Hope: terms having common meaning are mapped to the same direction - Hope: documents discussing similar topics have similar representation - Non-zero inner products between documents with non-overlapping terms - Hope: documents discussing similar topics have similar representation - Non-zero inner products between documents with non-overlapping terms

Latent Semantic Analysis [Deerwester et al., 1990]  Optimal rank k approximation (Frobenius norm):

Latent Semantic Analysis [Deerwester et al., 1990]  Not motivated probabilistically (no clean underlying probability model)  No obvious interpretation of directions

Probabilistic LSA [Hofmann, 99]  P arameters:  Distributions of topics in document P(z | d), for every d  Distribution of words for every topic, P(w | z), z 2 {1, …K}  Generative story:  For each document d  For each word occurrence i in document d  Select topic z i for the word from P(z i | d)  Generate word w i from P(w i | z i )  Note:  Words in the same documents can be generated from different topics

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party … Generative story: Given parameters generate text Generative story: Given parameters generate text

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

Probabilistic LSA : Example Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party …

 Note:  Words in the same documents can be generated from different topics  Does not take into account order, the following to texts are guaranteed to have the same probability under the model delays due to the volcanic ash cloud will affect Formula1 teams’ preparations to ash due Formula1 affect volcanic the will teams’ cloud preparations delays PLSA: Example

We considered: Document 1 Document 2 delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party … ? ?

In fact we are solving reverse problem: Document 1 … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … Document 2 … Obama will not attend the ceremony due to delays caused by eruption … delays volcanic volcano ash cloud … … Eruption P(w | z = 1) football teams ball preparations Formula1 … … Sport P(w | z = 2) … Politics P(w | z = 3) Obama Merkel ceremony attend party … ? ? ? ? ? We are going to talk about learning in a moment

Directed Graphical Models … … … Doc 1Doc M N1N1 NMNM - Roughly, arrows denote conditional dependencies - A generative story corresponds to a topological order on the graph - Roughly, arrows denote conditional dependencies - A generative story corresponds to a topological order on the graph

Plate Notation … … … Doc 1Doc M N1N1 NMNM N M

Plate Notation N M Generated from P(z | d) Generate from P(w | z)

Distributions are also variables… N M K

If they are variables, maybe we can generate them from some variables? N M K Prior distribution for topic distributions in documents Prior distribution for word distributions This is essentially the Latent Dirichlet Allocation (LDA) model (Blei et al, 2001): Hierarchical Bayesian generalization of PLSA

Latent Dirichlet Allocation  Parameters: two scalars: and  Generative story:  For draw word distrib-s  For each document d  Draw topic distribs  For each word occurrence i in document d  Select topic z i for the word from  Generate word w i from Dirichlet distribution: think of it as a “distribution of over distributions” We will talk about semantics of these parameters later

Dirichlet distribution  Defines a distribution over p 1,.., p K-1 such as:   - can be regarded as a distribution  “Distribution over distributions” (K-1 dimensional simplex)  Form of the Dirichlet distribution (with symmetric prior) Normalization (Beta function)

Meaning of hyperparameter  Consider case K= 2 (i.e., Beta distribution), pdf: ® = 0.5 : prefers “biased” distributions ® = 2 : prefers smooth distributions Often, we want sparse distributions: though the collection may represent 1,000 topics, each document discusses 2-3 topics. In this case we use small ® Can be regarded as “pseudo counts”: ® =2 means that you observed 1 example of each class

Summary so far  We defined a generative model of document collections:  Now, first we will consider how to do MAP estimation, i.e.  Then we will consider Bayesian methods Expectation maximization algorithm

Expectation Maximization 41  EM is a class of algorithms that is used to estimate parameters in the presence of missing attributes.  Non-convex optimization, therefore:  converges to a local maximum of the maximum likelihood function (or posterior distribution if we incorporate the prior distribution).  can be very sensitive to the starting point In LDA we do not observe from each topic z a word is generated

Three coin example 42  We observe a series of coin tosses generated in the following way:  A person has three coins.  Coin 0: probability of Head is ®  Coin 1: probability of Head p  Coin 2: probability of Head q  Consider the following coin-tossing scenario

Three coin example 43  Scenario:  Toss coin 0 (do not show it to anyone!).  If Head – toss coin 1 M times;  Else -- toss coin 2 M times.  Only the series of tosses are observed  HHHT, HTHT, HHHT, HTTH  What are the parameters of the coins ? ( ®, p, q)

Three coin example 44  Scenario:  Toss coin 0 (do not show it to anyone!).  If Head – toss coin 1 M times;  Else -- toss coin 2 M times.  Only the series of tosses are observed  HHHT, HTHT, HHHT, HTTH  What are the parameters of the coins ? ( ®, p, q)  There is no closed form solution to the problem

Key Intuition 45  If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, that would be trivial.

Key Intuition 46  If we knew which of the data points (HHHT), (HTHT), (HTTH) came from Coin1 and which from Coin2, there was no problem.  Instead, use an iterative approach for estimating the parameters:  Guess the probability that a data point came from Coin 1/2  Generate fictional labels, weighted according to this probability.  Re-estimate the initial parameter setting: set them to maximize the likelihood of these augmented data.  This process can be iterated and can be shown to converge to a local maximum of the likelihood function

EM-algorithm: coins (E-step) 47  We will assume (for a minute) that we know the parameters and use it to estimate which Coin it is  Then, we will use the estimation for the tossed Coin, to estimate the most likely parameters and so on...  What is the probability that the ith data point came from Coin1 ?

EM-algorithm: coins (M-step) 48  At this point we would like to compute the likelihood of the data, and find the parameters which maximize it  We will maximize the likelihood of the data (n data points)  But one of the variables: the coin name is hidden:  Instead on each step we maximize expectation of the likelihood over the coin name:

EM-algorithm: coins (M-step) 49 Continue:

EM-algorithm: coins 50  Now fine the most likely parameters by setting derivatives to zero: Note that are fixed (computed from the previous estimate of parameters)

EM-algorithm: coins 51  Now fine the most likely parameters by setting derivatives to zero: Note that are fixed (computed from the previous estimate of parameters)

Summary: EM 52

Summary: EM 53

EM for PLSA / LDA 54 N M K

EM for PLSA / LDA 55 N M K Can be negative… Ignore it for now. Can be regarded as smoothing

Summary so far  We defined a generative model of document collections:  We considered how to do MAP estimation, i.e.  Now we could visualize collections  Estimate topic distributions in each document

Example (Science collection)  Top 10 words for 10 topics out of 128 (ordered by

Example: topics of a document … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations …

Outline 59  Topic Models: An Example of Latent Dirichlet Allocation  Learning : Expectation Maximization Algorithms  Learning: Gibbs sampling

Gibbs Sampling for PLSA / LDA 60 N M K Choosing word for topic j for position i Choosing topic j for document d i

Gibbs Sampling for PLSA / LDA  We do not directly obtain and  Instead we observe which words are assigned to which topics:  We can estimate the parameters from the sample (and take into account pseudo counts priors) … Delays due to the volcanic ash cloud will affect Formula1 teams’ preparations … For details on Gibbs sampling refer to Finding Scientific Topics, Griffiths and Steyvers, PNAS 2004 For details on Gibbs sampling refer to Finding Scientific Topics, Griffiths and Steyvers, PNAS 2004

Summary 62  We considered the most standard “topic model”: PLSA / LDA  Reviewed two basic inference techniques:  EM  Collapsed Gibbs sampling  These methods are applicable to the majority of models we will consider in class

Formalities 63  Doodle poll: we will stay with Friday, 2pm  Paper selection:  Deadline extended  Not sure what we will do with the next class April 30 (watch for announcements)  If we do not get all the paper selected, review selections may be affected  These methods are applicable to the majority of models we will consider in class

References 64  PLSA: Hofmann, Probabilistic Latent Semantic Indexing, SIGIR 99  LDA (they use variational inference though): Blei, Ng, Jordan, and Lafferty, Latent Dirichlet allocation, Journal of Machine Learning Research, 2003  Collapsed Gibbs sampling for LDA: Griffiths and Steyvers, Finding Scientific Topics, PNAS 2004  EM (original paper): Dempster, Laird and Rubin. Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society B, 1977  EM, a note by Michael Collins: people.csail.mit.edu/mcollins/papers/wpeII.4.ps  Video Tutorial on EM by Chris Bishop: (see part 4)