Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Similar presentations


Presentation on theme: "CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative."— Presentation transcript:

1 CS246 Latent Dirichlet Analysis

2 LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative numbers  Q: Can we develop a more interpretable method?

3 Theory of LDA (Model-based Approach)  Develop a simplified model on how users write a document based on topics.  Fit the model to the existing corpus and “reverse engineer” the topics used in a document  Q: How do we write a document?  A: (1) Pick the topic(s) (2) Start writing on the topic(s) with related terms

4 Two Probability Vectors  For every document d, we assume that the user will first pick the topics to write about  P(z|d) : probability to pick topic z when the user write each word in document d.  Document-topic vector of d  We also assume that every topic is associated with each term with certain probability  P(w|z) : the probability of picking the term w when the user write on the topic z.  Topic-term vector of z

5 Probabilistic Topic Model  There exists T number of topics  The topics-term vector for each topic is set before any document is written  P(w j |z i ) is set for every z i and w j  Then for every document d,  The user decides the topics to write on, i.e., P(z i |d)  For each word in d  The user selects a topic z i with probability P(z i |d)  The user selects a word w j with probability P(w j |z i )

6 Probabilistic Document Model Topic 1 Topic 2 DOC 1 DOC 2 DOC 3 1.0 0.5 P(w|z)P(z|d ) money 1 bank 1 loan 1 bank 1 money 1... river 2 stream 2 river 2 bank 2 stream 2... money 1 river 2 bank 1 stream 2 bank 2...

7 Example: Calculating Probability  z 1 = {w 1 :0.8, w 2 :0.1, w 3 :0.1} z 2 = {w 1 :0.1, w 2 :0.2, w 3 :0.7}  d’s topics are {z 1 : 0.9, z 2 :0.1} d has three terms {w 3 2, w 1 1, w 2 1 }.  Q: What is the probability that a user will write such a document?

8 Corpus Generation Probability  T: # topics  D: # documents  M: # words per document  Probability of generating the corpus C

9 Generative Model vs Inference (1) Topic 1 Topic 2 DOC 1 DOC 2 DOC 3 1.0 0.5 P(w|z)P(z|d ) money 1 bank 1 loan 1 bank 1 money 1... river 2 stream 2 river 2 bank 2 stream 2... money 1 river 2 bank 1 stream 2 bank 2...

10 Generative Model vs Inference (2) Topic 1 Topic 2 DOC 1 DOC 2 DOC 3 ? ? ? ? money ? bank ? loan ? bank ? money ?... river ? stream ? river ? bank ? stream ?... money ? river ? bank ? stream ? bank ?...

11 Probabilistic Latent Semantic Index (pLSI)  Basic Idea: We pick P(z j |d i ), P(w k |z j ), and z ij values to maximize the corpus generation probability  Maximum-likelihood estimation (MLE)  More discussion later on how to compute the P(z j |d i ), P(w k |z j ), and z ij values that maximize the probability

12 Problem of pLSI  Q: 1M documents, 1000 topics, 1M words. 1000 words/doc. How much input data? How many variables do we have to estimate?  Q: Too much freedom. How can we avoid overfitting problem?  A: Adding constraints to reduce degree of freedom

13 Latent Dirichlet Analysis (LDA)  When term probabilities are selected for each topic  Topic-term probability vector, (P(w 1 |z j ), …, P(w W |z j )), is sampled randomly from Dirichlet distribution  When users select topics for a document  Document-topic probability vector, (P(z 1 |d), …, P(z T |d)), is sampled randomly from Dirichlet distribution

14 What is Dirichlet Distribution?  Multinomial distribution  Given the probability p i of each event e i, what is the probability that each event e i occurs  i times after n trial?  We assume p i ’s. The distribution assigns  i ’s probability.  Dirichlet distribution  “Inverse” of multinomial distribution: We assume  i ’s. The distribution assigns p i ’s probability.

15 Dirichlet Distribution  Q: Given  1,  2,…,  k, what are the most likely p 1, p 2, p k values?

16 Normalized Probability Vector and Simplex  Remember that and  When (p1, …, pn) satisfies p1 + … + pn = 1, they are on a “simplex plane”  (p1, p2, p3) and their 2-simplex plane

17 Effect of  values p1 p2 p3 p1 p2 p3

18 Effect of  values p1 p2 p3 p1 p2 p3

19 Effect of  values p1 p2 p3 p1 p2 p3

20 Effect of  values p1 p2 p3 p1 p2 p3

21 Minor Correction is not “standard” Dirichlet distribution. The “standard” Dirichlet Distribution formula:  Used non-standard to make the connection to multinomial distribution clear  From now on, we use the standard formula

22 Back to LDA Document Generation Model  For each topic z  Pick the word probability vector P(w j |z)’s by taking a random sample from Dir(  1,…,  W )  For every document d  The user decides its topic vector P(z i |d)’s by taking a random sample from Dir(  1,…,  T )  For each word in d  The user selects a topic z with probability P(z|d)  The user selects a word w with probability P(w|z)  Once all is said and done, we have  P(w j |z): topic-term vector for each topic  P(z i |d): document-topic vector for each document  Topic assignment to every word in each document

23 Symmetric Dirichlet Distribution  In principle, we need to assume two vectors, (  1,…,  T ) and (  1,…,  W ) as input parameters.  In practice, we often assume all  i ’s are equal to  and all  i ’s =   Use two scalar values  and , not two vectors.  Symmetric Dirichlet distribution  Q: What is the implication of this assumption?

24  Q: What does it mean? How will the sampled document topic vectors change as  grows?  Common choice:  = 50/T,  200/W Effect of  value on Symmetric Dirichlet p1 p2 p3 p1 p2 p3

25 Plate Notation T M N w z P(z|d) P(w|z)  

26 LDA as Topic Inference  Given a corpus d 1 : w 11, w 12, …, w 1m … d N : w N1, w N2, …, w Nm  Find P(z|d), P(w|z), z ij that are most “consistent” with the given corpus  Q: How can we compute such P(z|d), P(w|z), z ij ?  The best method so far is to use Monte Carlo method together with Gibbs sampling

27 Monte Carlo Method (1)  Class of methods that compute a number through repeated random sampling of certain event(s).  Q: How can we compute Pi?

28 Monte Carlo Method (2) 1. Define the domain of possible events 2. Generate the events randomly from the domain using a certain probability distribution 3. Perform a deterministic computation using the events 4. Aggregate the results of the individual computation into the final result  Q: How can we take random samples from a particular distribution?

29 Gibbs Sampling  Q: How can we take a random sample x from the distribution f(x)?  Q: How can we take a random sample (x, y) from the distribution f(x, y)?  Gibbs sampling  Given current sample (x1, …, xn), pick an axis xi, and take a random sample of xi value assuming all other (x1, …, xn) values  In practice, we iterative over xi’s sequentially

30 Markov-Chain Monte-Carlo Method (MCMC)  Gibbs sampling is in the class of Markov Chain sampling  Next sample depends only on the current sample  Markov-Chain Monte-Carlo Method  Generate random events using Markov-Chain sampling and apply Monte-Carlo method to compute the result

31 Applying MCMC to LDA  Let us apply Monte Carlo method to estimate LDA parameters.  Q: How can we map the LDA inference problem to random events?  We first focus on identifying topics {z ij } for each word {w ij }.  Event: Assignment of the topics {z ij } to w ij ’s. The assignment should be done according to P({z ij }|C)  Q: How to sample according to P({z ij }|C)?  Q: Can we use Gibbs sampling? How will it work?  Q: What is P(z ij |{z -ij },C)?

32  n wt : how many times the word w has been assigned to the topic t  n dt : how many words in the document d have been assigned to the topic t  Q: What is the meaning of each term?

33 LDA with Gibbs Sampling  For each word w ij  Assign to topic t with probability  For the prior topic t of w ij, decrease n wt and n dt by 1  For the new topic t of w ij, increase n wt and n dt by 1  Repeat the process many times  At least hundreds of times  Once the process is over, we have  z ij for every w ij  n wt and n dt

34 Result of LDA (Latent Dirichlet Analysis)  TASA corpus  37,000 text passages from educational materials collected by Touchstone Applied Science Associates  Set T=300 (300 topics)

35 Inferred Topics

36 Word Topic Assignments

37 LDA Algorithm: Simulation  Two topics: River, Money Five words: “river”, “stream”, “bank”, “money”, “loan”  Generate 16 documents by randomly mixing the two topics and using the LDA model riverstreambankmoneyloan River1/3 Money1/3

38 Generated Documents and Initial Topic Assignment before Inference First 6 and the last 3 documents are purely from one topic. Others are mixture White dot: “River”. Black dot: “Money”

39 Topic Assignment After LDA Inference First 6 and the last 3 documents are purely from one topic. Others are mixture After 64 iterations

40 Inferred Topic-Term Matrix  Model parameter  Estimated parameter  Not perfect, but very close especially given the small data size riverstreambankmoneyloan River0.33 Money0.33 riverstreambankmoneyloan River0.250.40.35 Money0.320.290.39

41 SVD vs LDA  Both perform the following decomposition  SVD views this as matrix approximation  LDA views this as probabilistic inference based on a generative model  Each entry corresponds to “probability”: better interpretability doc topic term = X topic

42 LDA as Soft Classfication  Soft vs hard clustering/classification  After LDA, every document is assigned to a small number of topics with some weights  Documents are not assigned exclusively to a topic  Soft clustering

43 Summary  Probabilistic Topic Model  Generative model of documents  pLSI and overfitting  LDA, MCMC, and probabilistic interpretation  Statistical parameter estimation  Multinomial distribution and Dirichlet distribution  Monte Carlo method  Gibbs sampling  Markov-Chain class of sampling


Download ppt "CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative."

Similar presentations


Ads by Google