Topic Models.

Name: Topic Models.
Uploaded: 2017-08-24T14:46:44+00:00
Duration: PTM31S43
Channel: Edwin Morton
Description: Topic Models.

Topic Models

Outline Review of directed models Directed models for text
Independencies, d-separation, and “explaining away” Learning for Bayes nets Directed models for text Naïve Bayes models Latent Dirichlet allocation (LDA)

Review of Directed Models (aka Bayes Nets)

Directed Model = Graph + Conditional Probability Distributions

The Graph  (Some) Pairwise Conditional Independencies

Plate notation lets us denote complex graphs
=

Directed Models > HMMs
P(S) s 1.0 t 0.0 u v Directed Models > HMMs S1 a S2 S3 c S4 S S’ P(S’|S) s 0.1 t 0.9 u 0.0 … 0.5 .. t v s 0.9 0.5 0.8 0.2 0.1 u a c 0.6 0.4 0.3 0.7 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property. S X P(X|S) s a 0.9 c 0.1 t 0.6 0.4 .. …

Directed Models > HMMs
a S2 S3 c S4 t v s 0.9 0.5 0.8 0.2 0.1 u a c 0.6 0.4 0.3 0.7 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property. 1 Important point: I can compute Pr(S2=t | aaca) So inference does not always “follow the arrows”

Some More DETAILS ON Directed Models
The example police say we’re in violation: Insufficient use of “Monty Hall” problem Discussing Bayes nets without discussing burglar alarms

The (highly practical) Monty Hall problem
P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 You’re in a game show. Behind one door is a car. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing…a goat! You now can either stick with your guess or change doors A B Stick, or swap? C D The revealed goat D P(D) Stick 0.5 Swap A B C P(C|A,B) 1 2 0.5 3 1.0 … E Second guess

A few minutes later, the goat from behind door C drives away in the car.

P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 A B Stick or swap? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 … E Second guess A C D P(E|A,C,D) …

We could construct the joint and compute P(E=B|D=swap) A P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 A B …again by the chain rule: P(A,B,C,D,E) = P(E|A,C,D) * P(D) * P(C | A,B ) * P(B ) * P(A) Stick or swap? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 … E Second guess A C D P(E|A,C,D) …

P(A) 1 0.33 2 3 The joint table has…? First guess The money B P(B) 1 0.33 2 3 A B 3*3*3*2*3 = 162 rows Stick or swap? The conditional probability tables (CPTs) shown have … ? Big questions: why are the CPTs smaller? how much smaller are the CPTs than the joint? can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 … *3*3 + 2*3*3 = 51 rows < 162 rows E Second guess A C D P(E|A,C,D) …

P(A) 1 0.33 2 3 Why is the CPTs representation smaller? Follow the money! (B) First guess The money B P(B) 1 0.33 2 3 A B Stick or swap? C The goat D E is conditionally independent of B given A,D,C A B C P(C|A,B) 1 2 0.5 3 1.0 … E Second guess A C D P(E|A,C,D) …

Conditional Independence (again)
Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S1’s assignments  S2’s assignments & S3’s assignments)= P(S1’s assignments  S3’s assignments)

First guess The money What are the conditional indepencies? I<A, {B}, C> ? I<A, {C}, B> ? I<E, {A,C}, B> ? I<D, {E}, B> ? … A B Stick or swap? C The goat D E Second guess

What Independencies does a Bayes Net Model?
In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents. This implies But what else does it imply?

Example: Given Y, does learning the value of Z tell us nothing at all new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y). Z Y X

Let I<X,Y,Z> represent X and Z being conditionally independent given Y. I<X,Y,Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant. Y X Z

Things get a little more confusing
X has no parents, so we know all its parents’ values trivially Z is not a descendant of X So, I<X,{},Z>, even though there’s a undirected path from X to Z through an unknown variable Y. What if we do know the value of Y, though? Or one of its descendants? X Z Y

The “Burglar Alarm” example
Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh! Burglar Earthquake Alarm Phone Call

Things get a lot more confusing
But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. Earthquake “explains away” the hypothetical burglar. But then it must not be the case that I<Burglar,{Phone Call}, Earthquake>, even though I<Burglar,{}, Earthquake>! Burglar Earthquake Alarm Phone Call

“Explaining away” NO X Y E YES This is “explaining away”:
E is common symptom of two causes, X and Y After observing E=1, both X and Y become more probable After observing E=1 and X=1, Y becomes less probable since X alone is enough to “explain” E

“Explaining away” and common-sense
Historical note: Classical logic is monotonic: the more you know, the more you deduce. “Common-sense” reasoning is not monotonic birds fly but, not after being cooked for 20min/lb at 350o F This led to numerous “non-monotonic logics” for AI This examples shows that Bayes nets are not monotonic If P(Y|E) is “your belief” in Y after observing E, and P(Y|X,E) is “your belief” in Y after observing E,X your belief in Y decreases after you discover X

How can I make this less confusing?
But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. Earthquake “explains away” the hypothetical burglar. But then it must not be the case that I<Burglar,{Phone Call}, Earthquake>, even though I<Burglar,{}, Earthquake>! Burglar Earthquake Alarm Phone Call

d-separation to the rescue
Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation. Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true: ... ie. X and Z are dependent iff there exists an unblocked path

A path is “blocked” when...
There exists a variable V on the path such that it is in the evidence set E the arcs putting Y in the path are “tail-to-tail” Or, there exists a variable V on the path such that the arcs putting Y in the path are “tail-to-head” Or, ... unknown “common causes” of X and Z impose dependency Y unknown “causal chains” connecting X an Z impose dependency Y

A path is “blocked” when… (the funky case)
… Or, there exists a variable V on the path such that it is NOT in the evidence set E neither are any of its descendants the arcs putting Y on the path are “head-to-head” Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z Y

Summary: d-separation
X E Y There are three ways paths from X to Y given evidence E can be blocked. X is d-separated from Y given E iff all paths from X to Y given E are blocked If X is d-separated from Y given E, then I<X,E,Y> Z Z Z

Learning for Bayes Nets

(Review) Breaking it down: Learning parameters for the “naïve” HMM
Training data defines unique path through HMM! Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transitions from i with smoothing, of course

(Review) Breaking it down: NER using the “naïve” HMM
Define the HMM structure: one state per entity type Training data defines unique path through HMM for each labeled example Use this to estimate transition and emission probabilities At test time for a sequence x Use Viterbi to find sequence of states s that maximizes Pr(x|s) Use s to derive labels for the sequence x

Learning for Bayes nets ~ Learning for HMMS if everything is observed
Input: Sample of the joint: Graph structure of the variables for I=1,…,N, you know Xi and parents(Xi) Output: Estimated CPTs A B B P(B) 1 0.33 2 3 C D Learning Method (discrete variables): Estimate each CPT independently Use a MLE or MAP A B C P(C|A,B) 1 2 0.5 3 1.0 … E …

Learning for Bayes nets ~ Learning for HMMS if some things are not observed
Input: Sample of the joint: Graph structure of the variables for I=1,…,N, you know Xi and parents(Xi) Output: Estimated CPTs A B B P(B) 1 0.33 2 3 C D Learning Method (discrete variables): Use inference* to estimate distribution of the unobserved values Use EM A B C P(C|A,B) 1 2 0.5 3 1.0 … E … * The HMM methods generalize to trees. I’ll talk about Gibbs sampling soon

LDA and Other Directed Models FOR MODELING TEXT

Supervised Multinomial Naïve Bayes
Naïve Bayes Model: Compact representation   C C = ….. WN W1 W2 W3 W M N b M b

Supervised Multinomial Naïve Bayes
Naïve Bayes Model: Compact representation   C C = ….. WN W1 W2 W3 W M N b M  K

Review – supervised Naïve Bayes
Multinomial Naïve Bayes  For each class 1..K Construct a multinomial i For each document d = 1,, M Generate Cd ~ Mult( . | ) For each position n = 1,, Nd Generate wn ~ Mult(.|,Cd) … or if you prefer wn ~ Pr(w|Cd) C ….. WN W1 W2 W3 M b K

Review – unsupervised Naïve Bayes
Mixture model: EM solution E-step: M-step: Key capability: estimate distribution of latent variables given observed variables

Review – unsupervised Naïve Bayes
Mixture model: unsupervised naïve Bayes model Joint probability of words and classes: But classes are not visible:  Z C W N M b

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI)
Every document is a mixture of topics For i=1…K: Let bi be a multinomial over words For each document d: Let d be a distribution over {1,..,K} For each word position in d: Pick a topic z from d Pick a word w from bi Turns out to be hard to fit: Lots of parameters! Also: only applies to the training data  C Z W N M b K

The LDA Topic Model

LDA Motivation Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,,M Generate d ~ D1(…) For each word n = 1,, Nd generate wn ~ D2( . | θdn) Now pick your favorite distributions for D1, D2  w N M

Latent Dirichlet Allocation
“Mixed membership” Latent Dirichlet Allocation  For each document d = 1,,M Generate d ~ Dir(. | ) For each position n = 1,, Nd generate zn ~ Mult( . | d) generate wn ~ Mult( . | zn) a z w N M f K b

LDA’s view of a document

LDA topics

Review - LDA Latent Dirichlet Allocation Parameter learning:
Variational EM Numerical approximation using lower-bounds Results in biased solutions Convergence has numerical guarantees Gibbs Sampling Stochastic simulation unbiased solutions Stochastic convergence

Review - LDA Gibbs sampling – works for any directed model!
Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

Why does Gibbs sampling work?
What’s the fixed point? Stationary distribution of the chain is the joint distribution When will it converge (in the limit)? If graph defined by the chain is connected How long will it take to converge? Depends on second eigenvector of that graph

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables
Fr: Parameter estimation for text analysis - Gregor Heinrich

LDA Latent Dirichlet Allocation “Mixed membership”  a f b
Randomly initialize each zm,n Repeat for t=1,…. For each doc m, word n Find Pr(zmn=k|other z’s) Sample zmn according to that distr. a z w N M f K b

EVEN More Detail On LDA…

Way way more detail

More detail

What gets learned…..

In A Math-ier Notation N[*,k] N[d,k] N[*,*]=V M[w,k]

for each document d and word position j in d
z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d

for each pass t=1,2,…. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j]

Some comments on LDA Very widely used model
Also a component of many other models

Topic Models.

Similar presentations

Presentation on theme: "Topic Models."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topic Models.

Similar presentations

Presentation on theme: "Topic Models."— Presentation transcript:

Similar presentations

About project

Feedback