Automatic Labeling of Multinomial Topic Models

Automatic Labeling of Multinomial Topic Models
Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign

Outline Background: statistical topic models Labeling a topic model
Criteria and challenge Our approach: a probabilistic framework Experiments Summary

Statistical Topic Models for Text Mining
term relevance weight feedback independ model … Topic models (Multinomial distributions) Text Collections Probabilistic Topic Modeling Subtopic discovery Opinion comparison Summarization Topical pattern analysis … PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] CPLSA [Mei & Zhai 06] … Pachinko allocation [Li & McCallum 06] Topic over time [Wang et al. 06] … web search link graph … …

Topic Models: Hard to Interpret
Use top words automatic, but hard to make sense Human generated labels Make sense, but cannot scale up term relevance weight feedback independence 0.03 model frequent probabilistic 0.02 document … insulin foraging foragers collected grains loads collection nectar … Term, relevance, weight, feedback ? Retrieval Models Question: Can we automatically generate understandable labels for topics?

What is a Good Label? iPod Nano じょうほうけんさく Pseudo-feedback
Retrieval models Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics term relevance weight feedback independence model frequent probabilistic document … iPod Nano じょうほうけんさく Pseudo-feedback Mei and Zhai 06: a topic in SIGIR Information Retrieval

Our Method 1 3 2 4 NLP Chunker Ngram Stat. Collection
information retrieval, retrieval model, index structure, relevance feedback, … Candidate label pool 1 Collection (e.g., SIGIR) term relevance weight feedback independence 0.03 model … Discrimination 3 information retriev retrieval models IR models pseudo feedback …… Relevance Score Information retrieval retrieval models IR models pseudo feedback …… 2 4 Coverage retrieval models IR models pseudo feedback …… information retrieval 0.01 filtering collaborative … trec evaluation …

Relevance (Task 2): the Zero-Order Score
Intuition: prefer phrases well covering top words Clustering p(“clustering”|) = 0.4 Good Label (l1): “clustering algorithm” √ p(“dimensional”|) = 0.3 dimensional ? algorithm > Latent Topic  … birch p(“shape”|) = 0.01 shape Bad Label (l2): “body shape” … p(w|) body p(“body”|) = 0.001

Relevance (Task 2): the First-Order Score
Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic  … P(w|) Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label (l1) “clustering algorithm” l2: “hash join” Clustering hash dimension key algorithm … p(w | hash join) key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… Score (l,  ) = D(||l)

Discrimination and Coverage (Tasks 3 & 4)
Discriminative across topic: High relevance to target topic, low relevance to other topics High Coverage inside topic: Use MMR strategy

Variations and Applications
Labeling document clusters Document cluster  unigram language model Applicable to any task with unigram language model Context sensitive labels Label of a topic is sensitive to the context An alternative way to approach contextual text mining tree, prune, root, branch  “tree algorithms” in CS  ? in horticulture  ? in marketing?

Experiments Datasets: Topic models: Evaluation:
SIGMOD abstracts; SIGIR abstracts; AP news data Candidate labels: significant bigrams; NLP chunks Topic models: PLSA, LDA Evaluation: Human annotators to compare labels generated from anonymous systems Order of systems randomly perturbed; score average over all sample topics

Result Summary Automatic phrase labels >> top words
1-order relevance >> 0-order relevance Bigram > NLP chunks Bigram works better with literature; NLP better with news System labels << human labels Scientific literature is an easier task

Results: Sample Topic Labels
north case trial iran documents walsh reagan charges the, of, a, and, to, data, > 0.02 … clustering time clusters databases large performance 0.01 quality iran contra … clustering algorithm clustering structure … tree trees spatial b r disk array cache r tree b tree … large data, data quality, high data, data application, … indexing methods

Results: Context-Sensitive Labeling
sampling estimation approximation histogram selectivity histograms … Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; Explore the different meaning of a topic with different contexts (content switch) An alternative approach to contextual text mining

Summary Labeling: A postprocessing step of all multinomial topic models A probabilistic approach to generate good labels understandable, relevant, high coverage, discriminative Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive Future work: Labeling hierarchical topic models Incorporating priors

Thanks!

Automatic Labeling of Multinomial Topic Models

Similar presentations

Presentation on theme: "Automatic Labeling of Multinomial Topic Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Labeling of Multinomial Topic Models

Similar presentations

Presentation on theme: "Automatic Labeling of Multinomial Topic Models"— Presentation transcript:

Similar presentations

About project

Feedback