Presentation is loading. Please wait.

Presentation is loading. Please wait.

MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics.

Similar presentations


Presentation on theme: "MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics."— Presentation transcript:

1 MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics

2 Probabilistic Topic Models Each word token associated with hidden “topic” variable Probabilistic approach to dimensionality reduction Useful for uncovering latent structures in text Basic formulation:  P(w|d) = P(w|topic) P(topic|d)

3 Probabilistic Topic Models “Topics” are latent distributions over words A topic can be interpreted of as a cluster of words Topic models often cluster words by what people would consider topicality There are often other dimensions in which words could be clustered  Sentiment/perspective/theme What if we want to model both?

4 Previous Work Topic-Sentiment Mixture Model (Mei et al., 2007)  Words come from either topic distribution or sentiment distribution Topic+Perspective Model (Lin et al., 2008)  Words are weighted as topical vs. ideological

5 Previous Work Cross-Collection LDA (Paul and Girju, 2009)  Each document belongs to a collection  Each topic has a word distribution shared among collections plus distributions unique to each collection What if the “collection” was a hidden variable?  -> Topic-Aspect Model (TAM)

6 Topic-Aspect Model Each document has  a multinomial topic mixture  a multinomial aspect mixture Words may depend on both!

7 Topic-Aspect Model Topic and aspect mixtures are drawn independently of one another  This differs from hierarchical topic models where one depends on the other  Can be thought of as two separate clustering dimensions

8 Topic-Aspect Model Each word token also has 2 binary variables:  the “level” (background or topical) denotes if the word depends on the topic or not  the “route” (neutral or aspectual) denotes if the word depends on the aspect or not A word may depend on a topic, an aspect, both, or neither

9 Topic-Aspect Model A word may depend on a topic, an aspect, both, or neither “Computational” Aspect Route / LevelBackgroundTopical Neutralpaper, new, presentspeech, recognition Aspectualalgorithm, modelmarkov, hmm, error

10 Topic-Aspect Model A word may depend on a topic, an aspect, both, or neither Route / LevelBackgroundTopical Neutralpaper, new, presentspeech, recognition Aspectuallanguage, linguisticprosody, intonation, tone “Linguistic” Aspect

11 Topic-Aspect Model A word may depend on a topic, an aspect, both, or neither Route / LevelBackgroundTopical Neutralpaper, new, presentcommunication, interaction Aspectuallanguage, linguisticconversation, social “Linguistic” Aspect

12 Topic-Aspect Model A word may depend on a topic, an aspect, both, or neither Route / LevelBackgroundTopical Neutralpaper, new, presentcommunication, interaction Aspectualalgorithm, modeldialogue, system, user “Computational” Aspect

13 Topic-Aspect Model Generative process for a document d:  Sample a topic z from P(z|d)  Sample an aspect y from P(y|d)  Sample a level l from P(l|d)  Sample a route x from P(x|l,z)  Sample a word w from either:  P(w|l=0,x=0),  P(w|z,l=1,x=0),  P(w|y,l=0,x=1),  P(w|z,y,l=1,x=1)

14 Topic-Aspect Model Distributions have Dirichlet/Beta priors  Latent Dirchlet Allocation framework Number of aspects and topics are user-supplied parameters Straightforward inference with Gibbs sampling

15 Topic-Aspect Model Semi-supervised TAM when aspect label is known Two options:  Fix P(y|d)=1 for the correct aspect label and 0 otherwise  Behaves like ccLDA (Paul and Girju, 2009)  Define a prior for P(y|d) to bias it toward the true label

16 Experiments Three Datasets:  4,247 abstracts from the ACL Anthology  2,173 abstracts from linguistics journals  594 articles from the Bitterlemons corpus (Lin et al., 2006)  a collection of editorials on the Israeli/Palestinian conflict

17 Experiments Example: Computational Linguistics

18 Experiments Example: Israeli/Palestinian Conflict Unsupervised Prior for P(aspect|d) for true label

19 Evaluation Cluster coherence  “word intrusion” method (Chang et al., 2009)  5 human annotators  Compare against ccLDA and LDA  TAM clusters are as coherent as other established models

20 Evaluation Document classification  Classify Bitterlemons perspectives (Israeli vs Palestinian)  Use TAM (2 aspects + 12 topics) output as input to SVM  Use aspect mixtures and topic mixtures as features  Compare against LDA

21 Evaluation Document classification  2 aspects from TAM much more strongly associated with true perspectives than 2 topics from LDA  Suggests that TAM is clustering along a different dimension than LDA by separating out another “topical” dimension (with 12 components)

22 Summary: Topic-Aspect Model Can cluster along two independent dimensions Words may be generated by both dimensions, thus clusters can be inter-related Cluster definitions are arbitrary and their structure will depend on the data and the model parameterization (especially # of aspects/topics)  Modeling with 2 aspects and many topics is shown to produce aspect clusters corresponding to document perspectives on certain corpora

23 References Chang, J.; Boyd-Graber, J.; Gerrish, S.; Wang, C.; and Blei, D. 2009. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems. Lin, W.; Wilson, T.; Wiebe, J.; and Hauptmann, A. 2006. Which side are you on? identifying perspectives at the document and sentence levels. In Proceedings of Tenth Conference on Natural Language Learning (CoNLL). Lin, W.; Xing, E.; and Hauptmann, A. 2008. A joint topic and perspective model for ideological discourse. In ECML PKDD ’08: Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases -Part II, 17–32. Berlin, Heidelberg: Springer-Verlag. Mei, Q.; Ling, X.; Wondra, M.; Su, H.; and Zhai, C. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, 171–180. Paul, M., and Girju, R. 2009. Cross-cultural analysis of blogs and forums with mixed-collection topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 1408–1417.


Download ppt "MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics."

Similar presentations


Ads by Google