Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li leili@cs

Outline Introduction Unigram model and mixture Text classification using LDA Experiments Conclusion

Text Classification What class can you tell given a doc? …………………… the New York Stock Exchange …………………… America’s Nasdaq ……………………… Buy ……………………… …………………… bank debt loan interest billion buy ……………………… …………………… the New York Stock Exchange …………………… America’s Nasdaq ……………………… Buy ……………………… …………………… Iraq war weapon army Ak-47 bomb ……………………… finance military

Why db guys care? Could be adapted to model discrete random variables –Disk failures –user access pattern –Social network, tags –blog

Document “bag of words”: no order on words d=(w 1, w 2, … w N ) w i one value in 1…V (1-of-V scheme) V: vocabulary size

Modeling Document Unigram: simple multinomial dist Mixture of unigram LDA Other: PLSA, bigram

Unigram Model for Classification Y is the class label, d={w 1, w 2, … w N } Use bayes rule: How to model the document given class ~ Multinomial distribution, estimated as word frequency Y w N

Unigram: example P(w|Y)bankdebtinterestwararmyweapon finance0.20.150.10.0001 military0.0001 0.10.150.2 d = bank * 100, debt * 110, interest * 130, war * 1, army * 0, weapon * 0 P(finance|d)=? P(military|d)=? P(Y) finance0.6 military0.4

Mixture of unigrams for classification Y w N z For each class, assume k topics Each topic represents a multinomial distribution Under each topic, each word is multinomial

Unigram: example d = bank * 100, debt * 110, interest * 130, war * 1, army * 0, weapon * 0 P(finance|d)=? P(military|d)=? P(Y) finance0.6 military0.4 P(w|z,Y)bankdebtinterestwararmyweapon finance0.010.150.10.0001 0.20.01 0.0001 military0.0001 0.10.150.01 0.0001 0.01 0.2 P(z|Y) finance0.3 0.7 military0.5

Bayesian Network Given a DAG Nodes are random variables, or parameters Arrow are conditional probability dependency Given some prob on part nodes, there are algorithm to infer values for other nodes

Latent Dirichlet Allocation Model a θ as a Dirichlet distribution, on α For n-th term w n : –Model n-th latent variable z n as a multinomial distribution according to θ. –Model w n as a multinomial distribution according to z n and β.

Variational inference for LDA Direct inference with LDA is HARD Approximation with variational distribution use factorized distribution on variational parameters γ and Φ to approximate posterior distribution of latent variables θand z.

Experiment Data set: Reuters-21578, 8681 training documents, 2966 test documents. Classification task: “EARN” vs. “Non-EARN” For each document, learn LDA features and classify with them (discriminative)

Result 'bank''trade''shares''tonnes' 'banks''japan''company''mln' 'debt''japanese''stock''reuter' 'billion''states''dlrs''sugar' 'foreign''united''share''production' 'dlrs''officials''reuter''gold' 'government''reuter''offer''wheat' 'interest''told''common''nil' 'loans''government''pct''gulf' most frequent words in each topic

Classification Accuracy

Comparison of Accuracy

Take Away Message LDA with few topics and few training data could produce relative better results Bayesian network is useful to model multiple random variable, nice algorithm for it, Potential use of LDA: –disk failure –database access pattern –user preference (collaborative filtering) –social network (tags)

Reference Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Journal of machine Learning Research

Classification time

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Similar presentations

Presentation on theme: "Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Similar presentations

Presentation on theme: "Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li"— Presentation transcript:

Similar presentations

About project

Feedback