Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Labeling of Multinomial Topic Models

Similar presentations


Presentation on theme: "Automatic Labeling of Multinomial Topic Models"— Presentation transcript:

1 Automatic Labeling of Multinomial Topic Models
Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign

2 Outline Background: statistical topic models Labeling a topic model
Criteria and challenge Our approach: a probabilistic framework Experiments Summary

3 Statistical Topic Models for Text Mining
term relevance weight feedback independ model Topic models (Multinomial distributions) Text Collections Probabilistic Topic Modeling Subtopic discovery Opinion comparison Summarization Topical pattern analysis PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] CPLSA [Mei & Zhai 06] Pachinko allocation [Li & McCallum 06] Topic over time [Wang et al. 06] web search link graph …

4 Topic Models: Hard to Interpret
Use top words automatic, but hard to make sense Human generated labels Make sense, but cannot scale up term relevance weight feedback independence 0.03 model frequent probabilistic 0.02 document insulin foraging foragers collected grains loads collection nectar Term, relevance, weight, feedback ? Retrieval Models Question: Can we automatically generate understandable labels for topics?

5 What is a Good Label? iPod Nano じょうほうけんさく Pseudo-feedback
Retrieval models Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics term relevance weight feedback independence model frequent probabilistic document iPod Nano じょうほうけんさく Pseudo-feedback Mei and Zhai 06: a topic in SIGIR Information Retrieval

6 Our Method 1 3 2 4 NLP Chunker Ngram Stat. Collection
information retrieval, retrieval model, index structure, relevance feedback, Candidate label pool 1 Collection (e.g., SIGIR) term relevance weight feedback independence 0.03 model Discrimination 3 information retriev retrieval models IR models pseudo feedback …… Relevance Score Information retrieval retrieval models IR models pseudo feedback …… 2 4 Coverage retrieval models IR models pseudo feedback …… information retrieval 0.01 filtering collaborative … trec evaluation …

7 Relevance (Task 2): the Zero-Order Score
Intuition: prefer phrases well covering top words Clustering p(“clustering”|) = 0.4 Good Label (l1): “clustering algorithm” p(“dimensional”|) = 0.3 dimensional ? algorithm > Latent Topic  birch p(“shape”|) = 0.01 shape Bad Label (l2): “body shape” p(w|) body p(“body”|) = 0.001

8 Relevance (Task 2): the First-Order Score
Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic P(w|) Clustering hash dimension algorithm partition p(w | clustering algorithm ) Good Label (l1) “clustering algorithm” l2: “hash join” Clustering hash dimension key algorithm p(w | hash join) key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… Score (l,  ) = D(||l)

9 Discrimination and Coverage (Tasks 3 & 4)
Discriminative across topic: High relevance to target topic, low relevance to other topics High Coverage inside topic: Use MMR strategy

10 Variations and Applications
Labeling document clusters Document cluster  unigram language model Applicable to any task with unigram language model Context sensitive labels Label of a topic is sensitive to the context An alternative way to approach contextual text mining tree, prune, root, branch  “tree algorithms” in CS  ? in horticulture  ? in marketing?

11 Experiments Datasets: Topic models: Evaluation:
SIGMOD abstracts; SIGIR abstracts; AP news data Candidate labels: significant bigrams; NLP chunks Topic models: PLSA, LDA Evaluation: Human annotators to compare labels generated from anonymous systems Order of systems randomly perturbed; score average over all sample topics

12 Result Summary Automatic phrase labels >> top words
1-order relevance >> 0-order relevance Bigram > NLP chunks Bigram works better with literature; NLP better with news System labels << human labels Scientific literature is an easier task

13 Results: Sample Topic Labels
north case trial iran documents walsh reagan charges the, of, a, and, to, data, > 0.02 clustering time clusters databases large performance 0.01 quality iran contra clustering algorithm clustering structure tree trees spatial b r disk array cache r tree b tree … large data, data quality, high data, data application, … indexing methods

14 Results: Context-Sensitive Labeling
sampling estimation approximation histogram selectivity histograms Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; Explore the different meaning of a topic with different contexts (content switch) An alternative approach to contextual text mining

15 Summary Labeling: A postprocessing step of all multinomial topic models A probabilistic approach to generate good labels understandable, relevant, high coverage, discriminative Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive Future work: Labeling hierarchical topic models Incorporating priors

16 Thanks!


Download ppt "Automatic Labeling of Multinomial Topic Models"

Similar presentations


Ads by Google