Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Similar presentations


Presentation on theme: "Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign."— Presentation transcript:

1 Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign

2 2 Outline Background: statistical topic models Labeling a topic model –Criteria and challenge Our approach: a probabilistic framework Experiments Summary

3 3 Statistical Topic Models for Text Mining Text Collections Probabilistic Topic Modeling … web 0.21 search 0.10 link 0.08 graph 0.05 … … Subtopic discovery Opinion comparison Summarization Topical pattern analysis … term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independ. 0.03 model 0.03 … Topic models (Multinomial distributions) PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [ Steyvers et al. 04 ] CPLSA [Mei & Zhai 06] … Pachinko allocation [Li & McCallum 06] Topic over time [Wang et al. 06]

4 4 Topic Models: Hard to Interpret Use top words –automatic, but hard to make sense Human generated labels –Make sense, but cannot scale up term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … Retrieval Models Question: Can we automatically generate understandable labels for topics? Term, relevance, weight, feedback insulin foraging foragers collected grains loads collection nectar … ?

5 5 What is a Good Label? Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … iPod Nano Pseudo-feedback Information Retrieval Retrieval models じょうほうけんさく – Mei and Zhai 06: a topic in SIGIR

6 6 Our Method Collection (e.g., SIGIR) term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 … NLP Chunker Ngram Stat. information retrieval, retrieval model, index structure, relevance feedback, … Candidate label pool 1 Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… 2 Discrimination 3 information retriev. 0.26 0.01 retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… 4 Coverage retrieval models 0.20 IR models 0.18 0.02 pseudo feedback 0.09 …… information retrieval 0.01

7 7 Relevance (Task 2): the Zero-Order Score Intuition: prefer phrases well covering top words Clustering dimensional algorithm birch shape Latent Topic  … Good Label ( l 1 ): “clustering algorithm” body Bad Label ( l 2 ): “body shape” … p(w|  ) p(“clustering”|  ) = 0.4 p(“dimensional”|  ) = 0.3 p(“body”|  ) = 0.001 p(“shape”|  ) = 0.01 √ > ?

8 8 Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label ( l 1 ) “clustering algorithm” Clustering hash dimension key algorithm … p(w | hash join ) key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… l 2 : “hash join” Relevance (Task 2): the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic  … P(w|  ) Score (l,  ) = D(  || l )

9 9 Discrimination and Coverage (Tasks 3 & 4) Discriminative across topic: –High relevance to target topic, low relevance to other topics High Coverage inside topic: –Use MMR strategy

10 10 Variations and Applications Labeling document clusters –Document cluster  unigram language model –Applicable to any task with unigram language model Context sensitive labels –Label of a topic is sensitive to the context –An alternative way to approach contextual text mining tree, prune, root, branch  “tree algorithms” in CS  ? in horticulture  ? in marketing?

11 11 Experiments Datasets: –SIGMOD abstracts; SIGIR abstracts; AP news data –Candidate labels: significant bigrams; NLP chunks Topic models: –PLSA, LDA Evaluation: –Human annotators to compare labels generated from anonymous systems –Order of systems randomly perturbed; score average over all sample topics

12 12 Result Summary Automatic phrase labels >> top words 1-order relevance >> 0-order relevance Bigram > NLP chunks –Bigram works better with literature; NLP better with news System labels << human labels –Scientific literature is an easier task

13 13 Results: Sample Topic Labels tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 clustering algorithm clustering structure … large data, data quality, high data, data application, … iran contra … r tree b tree … indexing methods

14 14 Results: Context-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) Explore the different meaning of a topic with different contexts (content switch) An alternative approach to contextual text mining

15 15 Summary Labeling: A postprocessing step of all multinomial topic models A probabilistic approach to generate good labels –understandable, relevant, high coverage, discriminative Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive Future work: –Labeling hierarchical topic models –Incorporating priors

16 16 Thanks! - Please come to our poster tonight (#40)

17 17 Multinomial Topic Models – Blei et al. http://www.cs.cmu.edu/ ~lemur/science/topics.htmlhttp://www.cs.cmu.edu/ ~lemur/science/topics.html term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Text Collection … Topics word prob data 0.0358569 university 0.0132301 new 0.0119887 results 0.0119384 end 0.0116994 high 0.00987482 research 0.00962146 figure 0.00897542 analysis 0.00769567 number 0.00739933 institute 0.00728071 … Topic Model term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Topic Unigram language model Multinomial distribution of Terms = = (Multinomial mixture, PLSA, LDA, & lots of extensions) Applications: topic extraction, IR, contextual text mining, opinion analysis…

18 18 Multinomial Topic Models Statistic topic models –Multinomial mixture, PLSA, LDA, a lot of extensions. Applications –topic extraction; –information retrieval; –contextual text mining; –opinion extraction A common problem: –Hard to interpret (label topics) pollen 0.46 foraging 0.04 foragers 0.04 collected 0.03 grains 0.03 loads 0.03 collection 0.02 nectar 0.02 … pollen glucose mice diabetes hormone body weight fat … ?

19 19 Overview Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … iPod Nano Pseudo-feedback Information Retrieval Retrieval models じょうほうけんさく – Mei and Zhai 06: a topic in SIGIR

20 20 Our Method Statistical topic models NLP Chunker Ngram stat. term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Multinomial topic models database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Collection (Context) Ranked List of Labels clustering algorithm; distance measure; … Relevance Score Re-ranking Coverage; Discrimination 1 2

21 21 Clustering hash dimension key algorithm … Bad Label ( l 2 ): “hash join” p(w | hash join ) Relevance: the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic  … P(w|  ) D(  | clustering algorithm ) < D(  | hash join ) SIGMOD Proceedings Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label ( l 1 ): “clustering algorithm” Score (l,  )

22 22 Our Method Guarantee understandability with a pre- processing step –Use phrases as candidate topic labels –NLP Chunks / statistically significant Ngrams A ranking problem: satisfy relevance, coverage, and discriminability with a probabilistic framework Good labels = Understandable + Relevant + High Coverage + Discriminative

23 23 Results: Contextual-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) dependencies functional cube multivalued iceberg buc …

24 24 Results: Sample Topic Labels sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 clustering algorithm clustering structure … large data, data quality, high data, data application, … selectivity estimation … iran contra … r tree b tree … indexing methods

25 25 Preliminary Results (SIGMOD) Sensor networks constraint 0.057 sensor 0.032 assert 0.031 index 0.020 integrity 0.020 network 0.016 procedure 0.014 View maintenance view 0.189 update 0.043 warehouse 0.018 copy 0.017 array 0.016 directory 0.015 increment 0.015 Query languages language 0.067 relate 0.035 relational 0.034 model 0.032 extension 0.021 semantic 0.018 definition 0.018 Recursive queries recursion 0.071 algebra 0.057 b-tree 0.035 rule 0.022 general 0.019 relate 0.018 nest 0.016 Concurrency contr. transact 0.123 concurrent 0.067 control 0.059 protocol 0.050 lock 0.044 replicate 0.028 distribute 0.027 Clustering algo. cluster 0.114 spatial 0.094 join 0.080 algorithm 0.040 dimension 0.020 dataset 0.017 mine 0.015 Query optimizers optimize 0.130 query 0.085 plan 0.075 execution 0.040 join 0.032 statistic 0.026 estimate 0.022 Graphic interface graph 0.080 visual 0.057 multimedia 0.046 browse 0.024 graphic 0.013 transitive 0.013 interface 0.013

26 26 Preliminary Results (SIGMODII) Client-server file 0.123 serve 0.089 client 0.046 grid 0.021 message 0.017 policy 0.014 storage 0.014 Knowledge base dependency 0.06 schema 0.040 knowledge 0.026 function 0.026 rule 0.021 form 0.018 extract 0.016 Data cube cube 0.058 rank 0.019 db 0.013 aggregate 0.013 dimension 0.010 search 0.010 framework 0.010 XML data xml 0.170 document 0.07 query 0.038 xquery 0.031 temporal 0.029 twig 0.014 element 0.013 Stream manage. stream 0.111 parallel 0.073 process 0.033 continuous 0.029 partition 0.026 resource 0.019 physical 0.017 Information src. web 0.054 integrate 0.047 service 0.042 source 0.040 enterprise 0.025 business 0.014 wrap 0.014 Declarative lang. workflow 0.030 system 0.026 language 0.022 path 0.018 database 0.015 constraint 0.015 integrity 0.013 Index structures tree 0.145 index 0.045 node 0.043 r-tree 0.030 b 0.024 structure 0.019 main 0.015


Download ppt "Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign."

Similar presentations


Ads by Google