Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju.

Similar presentations


Presentation on theme: "Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju."— Presentation transcript:

1 Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

2 Outline Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model

3 Outline Overview of topic models PLSI and LDA Some slides borrowed from CS410 – ChengXiang Zhai Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model

4 Probabilistic Topic Models Idea: each document is some mix of topics Each word in the document belongs to a topic

5 5 Document as a Sample of Mixed Topics Applications of topic models: – Summarize themes/aspects – Facilitate navigation/browsing – Retrieve documents – Segment documents – Many others How can we discover these topic word distributions? Topic  1 Topic  k Topic  2 … Background B government 0.3 response 0.2... donate 0.1 relief 0.05 help 0.02... city 0.2 new 0.1 orleans 0.05... is 0.05 the 0.04 a 0.03... [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

6 Probabilistic Latent Semantic Indexing [Hofmann, 1999] Each token in a document is associated with 2 variables: a word w (observable) a topic z (hidden) P(w,z|d) = P(z|d) P(w|z)

7 7 PLSA as a Mixture Model Topic  1 Topic  k Topic  2 … Document d Background B warning 0.3 system 0.2.. aid 0.1 donation 0.05 support 0.02.. statistics 0.2 loss 0.1 dead 0.05.. is 0.05 the 0.04 a 0.03.. kk 11 22 B B W  d,1  d, k 1 - B  d,2 “Generating” word w in doc d in the collection Parameters: B =noise-level (manually set)  ’s and  ’s are estimated with Maximum Likelihood ? ? ? ? ? ? ? ? ? ? ?

8 How to Estimate Multiple Topics? (Expectation Maximization) 8 the 0.2 a 0.1 we 0.01 to 0.02 … Known Background p(w | B) … text =? mining =? association =? word =? … Unknown topic model p(w|  1 )=? “Text mining” Observed Doc(s) M-Step: Max. Likelihood Estimator based on “fractional counts” … … information =? retrieval =? query =? document =? … Unknown topic model p(w|  2 )=? “information retrieval” E-Step: Predict topic labels using Bayes Rule

9 PLSI - Problems Each document is represented as a dummy variable d Number of parameters grows linearly with corpus size Overfitting Not fully generative Not clear how to model previously unseen documents

10 Latent Dirichlet Allocation [Blei et al, 2003] Per-document topic mixtures and word multinomials come from Dirichlet priors Exact solution is intractable – Inference is more complicated Variational methods Monte Carlo

11 Dirichlet Distribution Conjugate prior of multinomial distribution

12 Latent Dirichlet Allocation

13 Outline Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model

14 Cross-Collection LDA (ccLDA) LDA extension for modeling multiple text collections Each topic has a probability distribution that is shared among all collections as well as word distributions that are unique to each collection Automatically discovers differences between collections and organizes them by topic

15 Example Topic of weather and the outdoors in travel forums Topic weather time day going rain summer month high days thanks UKIndiaSingapore wind waterproof ending rolling walkers rochdale layers snow footwear ankle leh monsoon road manali ladakh trekking trek season rains monsoons hot humid humidity heat degree equator sweat bring rain umbrella

16 ccLDA Inference can be done with Gibbs sampling Graphical representation: The generative process: α φ β C T θ z w c x D γ 0 ψ σ δ γ 1 TC N

17 Previous Work Comparative mixture model (CCMix) – ChengXiang Zhai, Atulya Velivelli, Bei Yu. A cross-collection mixture model for comparative text mining. Proceedings of ACM KDD 2004. Improvements in ccLDA: – Does not rely on user-defined parameters – Distributions have Dirichlet/Beta priors – Document-topic distributions have collection-dependent priors – P(x) depends on the topic and collection ccMix (2004)ccLDA (2009) CommonDellAppleIBMCommonDellAppleIBM cd drive rw combo dvd apoint blah hook tug 2499 airport burn 4x read schools t20 ultrabay tells device number drive cd dvd hard rw battery laptop bay inspiron media itunes burn imovie burning minutes 2000 ultrabay hot device swappable

18 Outline Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model

19 Cross-Cultural Analysis  Documents from or about 3 countries:  United Kingdom  India  Singapore  3,266 forum discussions  collected from lonelyplanet.com  represents the perspective of tourists  7,388 English-language blogs  collected through blogcatalog.com  represents the perspective of locals

20 Cross-Cultural Analysis Topic of religion from the blogs Topic: god jesus lord life faith holy man christ church love UKIndiaSingapore church god john todd bentley christ luke bible christian sermon krishna religion religious spiritual guru lord sri shri baba hindu god sin john spirit things lamb exodus suffering cross lives

21 Cross-Cultural Analysis Topic of entertainment from the blogs Compare against ccMix ccLDAccMix Topic: music song new songs like album dance comments rock guitar Topic: comment posted like music just blog time labels post love UKIndiaSingaporeUKIndiaSingapore music band album dance festival sound bands remix tracks amp movie film movies songs films director best bollywood indian awards band music american japanese mark world video sound idol week music album band song songs new review track bands pop kerala india tiger rajasthan birds water park city temple sanctuary kids baby cool desktop miss fun wallpaper love dont little

22 Cross-Cultural Analysis Topic of travel from the blogs Compare against LDA (on each collection individually) ccLDALDA Topic: travel hotel hotels city best place visit holiday trip world Topic: travel city hotel park holiday hotels place beach road visit UKIndiaSingaporeUKIndiaSingapore holiday holidays hotels spain london great surf breaks train ski india delhi indian mumbai bangalore tour air dubai city mahindra singapore hong kong spa hotel beach chinese pictures restaurant bangkok travel holiday hotel city london park hotel place holidays hall travel city beach place hotel temple road park hotels tourism travel hotel city park place beach trip hotels spa visit

23 Cross-Cultural Analysis Topic of food from both datasets Compare the view of tourists and locals Perspective of LocalsPerspective of Tourists food add chicken recipe cooking taste rice recipes sugar soup food eat restaurant restaurants tea cheap meal eating cafe drink UKIndiaSingaporeUKIndiaSingapore food wine restaurant coffee cheese soup eat chef english drink recipe recipes powder indian salt tsp rice masala oil coriander coffee cup oil comments fried add restaurant rice tea seafood fish haggis chips respectability decent veggie pudding photoblog sausages sandwiches cooking spices sick flour tomato batter ate cook olive recipe hawker satay stalls noodles roti stall seafood malay rochester noodle

24 Outline Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA – Scientific research/literature analysis – Media analysis and bias detection Model evaluation An alternative cross-collection model

25 Research Analysis 16,186 abstracts from computational linguistics and linguistics journals Interdisciplinary research topic discovery Topic evolution over time

26 Research Analysis Topic of communication Topic: speech spoken interaction human discourse paper understanding task context communication goal users Comp LingLinguistics dialogue user systems information utterances dialogues utterance agent plan recognition agents research multi social communication verbal women speakers speaker relationship interaction ways means behavior face men

27 Research Analysis Topic of parsing/grammars across two time intervals Topic: parser grammar tree parsers grammars free context syntactic parse structure Old (<2000)New (>= 2000) number result corresponding networks known binding lr introduce consider recognition transformational ambiguous networks dependency probabilistic stochastic treebank pcfg constraint lexicalized ccg projective robustness hpsg modeling treebanks

28 Media Analysis 623 news articles from msnbc.com and foxnews.com from August 2008 Discover editorial differences within topics Topic: percent economy prices marketTopic: car vehicle cars fuel drive MSNBCFOX NewsMSNBCFOX News stocks account trades tools spending consumers sales investors trading company oil drilling poverty offshore coverage insurance growing uninsured census congress diesel says autos camaro tax credit smaller mileage hybrid chevrolet mazda gallardo chrysler minivan horsepower lamborghini mph sports lp traffic

29 Outline Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model

30 Model Evaluation  Greater likelihood of held-out data than alternative models

31 Model Evaluation  Document classification – new vs old  Compare to NB and SVM (linear kernel)

32 Outline Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model

33 Alternative Model Similar to hierarchical Pachinko Allocation [Mimno et al, 2007] Model as 2-level hierarchy

34 Alternative Model Single, global set of “super-topics” One set of “sub-topics” for each collection Choose super-topic T from P(T|d) Choose sub-topic t from P(t|T,c) Choose hierarchy level l from P(l|t,T) if l = 0, choose word from P(w|T) else if l = 1, choose word from P(w|t)

35 Alternative Model This is just a generalization of ccLDA! ccLDA = special case, constrained such that for each super-topic T=j there is exactly one sub-topic such that P(t=j|T=j)=1 and P(t=i|T=j)=0 for all i ≠ j

36 Alternative Model Topic of religion in the blogs Super-Topic god 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574 UK 1 church 0.030402 john 0.017007 todd 0.016154 jesus 0.015552 bentley 0.014348 luke 0.012693 religion 0.012592 christ 0.012091 cross 0.011388 neville 0.009482 0.970483

37 Alternative Model Topic of religion in the blogs Super-Topic god 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574 India 1 religion 0.021439 krishna 0.019062 spiritual 0.014765 hindu 0.012343 lord 0.01216 religious 0.012114 guru 0.011108 mother 0.01088 shri 0.010194 sri 0.009646 0.984414

38 Alternative Model Super-Topic god 0.046994 lord 0.015877 jesus 0.012076 life 0.01143 faith 0.010692 church 0.010185 holy 0.009189 man 0.00882 world 0.00869 people 0.007574 SG 1 god 0.032249 christ 0.018867 cross 0.015467 sin 0.012505 grace 0.012395 jesus 0.011957 john 0.011628 lamb 0.009982 mahendra 0.009489 good 0.009434 SG 2 daily 0.020028 free 0.016023 fast 0.014822 silent 0.014221 wait 0.012418 going 0.011818 sign 0.009414 friday 0.009214 health 0.008413 star 0.008413 0.851749 0.102534 Topic of religion in the blogs

39 ccLDA Topic of religion from the blogs Topic: god jesus lord life faith holy man christ church love UKIndiaSingapore church god john todd bentley christ luke bible christian sermon krishna religion religious spiritual guru lord sri shri baba hindu god sin john spirit things lamb exodus suffering cross lives

40 Alternative Model Super-Topic people 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413 UK 1 labour 0.049547 british 0.041125 workers 0.029925 european 0.026252 bbc 0.024908 david 0.017203 crisis 0.016934 immigration 0.014694 left 0.014336 trade 0.011648 UK 2 war 0.023458 world 0.01909 wales 0.019002 welsh 0.017823 brown 0.014503 britain 0.013498 gordon 0.012188 london 0.011445 politics 0.010004 anti 0.009916 0.29108 0.699227 Topic of politics in the blogs

41 Alternative Model Topic of politics in the blogs Super-Topic people 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413 India 1 pakistan 0.052105 india 0.038041 kashmir 0.037222 state 0.023186 muslims 0.017312 muslim 0.016634 political 0.010647 taliban 0.010647 jammu 0.009461 kashmiri 0.00932 0.987059

42 Alternative Model Topic of politics in the blogs Super-Topic people 0.021148 government 0.016807 world 0.010694 obama 0.009229 political 0.00902 media 0.008975 politics 0.008669 country 0.008534 state 0.007906 rights 0.007413 SG 1 singapore 0.04263 world 0.027554 singaporeans 0.014817 people 0.013387 earth 0.012478 malaysia 0.011698 global 0.010398 say 0.010398 myanmar 0.009488 workers 0.008838 0.970675

43 ccLDA Topic of politics from the blogs Topic: people government war world state political human rights said country UKIndiaSingapore news politics london media post obama war labour world bbc pakistan india kashmir indian pakistani muslims state muslim brigade taliban singapore comments singaporeans labels chinese ago news world joo posted

44 Outline Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model

45 Questions?


Download ppt "Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju."

Similar presentations


Ads by Google