Presentation on theme: "1 Modeling Political Blog Posts with Response Tae Yano Carnegie Mellon University IBM SMiLe Open House Yorktown Heights, NY October 8,"— Presentation transcript:
1 Modeling Political Blog Posts with Response Tae Yano Carnegie Mellon University email@example.com IBM SMiLe Open House Yorktown Heights, NY October 8, 2009
2 Talk is about How we are designing topic models for online political discussion
3 Political blogs Why (should we) study political blogs? An influential social phenomenon. An important venue for civil discourse. Blog text is relatively understudied. Interest in text analysis from social/political science researchers Monroe et al., 2009; Hopkins and King, 2009; many others
4 Political blogs Why (should we) study political blogs? A different / interesting type of text we don’t usually deal with in NLP Spontaneous text: Often ungrammatical, copious misspelling and colloquialism Elusive information needs (“popularity”, “influence”, “trustworthy”). Difficult and costly in classical supervised approach. The text is a composed of the mixture of diverse linguistic styles.
6 Comment style is casual, creative, less carefully edited Posts are often coupled with comment sections
7 Political blogs - Illustration Comments often meander across several themes “The rock that keeps things off the table” “Taxes and Fee” “If the “President gets health care” On topic Tangent Ranting?
8 Political blogs - Illustration Posts tend to discuss multiple themes House Republicans? Government neglect? Oil companies? Energy policy?
9 Political blogs - Illustration Comments can be constructive and formal …or subjective and conversational “I am in total agreement … In contrast … My understanding is….” “ Iowa-Shiowa”
10 Political blogs - Illustration Comments can be very long …or quite terse “Absurd”
11 Political blogs - Illustration How should we approach this sort of data? Our approach is to treat it as an instance of Topic Modeling Latent Dirichlet Allocation or LDA (Blei, Ng, and Jordan, 2003)
12 Topic modeling What does this approach buy us? Naturally express the idea that a text is comprised of several distinctive components: A post and its reactions (comments) A mixture of different themes within one post Diverse personal styles and pet peeves A convenient choice for corpora with uncertainty We can encode hypotheses, and have the model learn from data. Modularity makes it easy to change the model
Modeling political blogs Our proposed political blog model: Comment LDA D = # of documents; N = # of words in post; M = # of words in comments z, z` = topic w = word (in post) w`= word (in comments) u = user
Modeling political blogs Our proposed political blog model: CommentLDA LHS is vanilla LDA a zizi wiwi D dd ß NdNd D = # of documents; N = # of words in post; M = # of words in comments
Modeling political blogs Our proposed political blog model: CommentLDA RHS to capture the generation of reaction separately from the post body Two separate sets of word distributions D = # of documents; N = # of words in post; M = # of words in comments Two chambers share the same topic-mixture
Modeling political blogs Our proposed political blog model: CommentLDA User IDs of the commenters as a part of comment text generate the words in the comment section D = # of documents; N = # of words in post; M = # of words in comments
Modeling political blogs Three variations on user ID generation: “Verbosity” (original model) M = # of words in all comments L = 1 “Comment frequency” M = # of comments to the post L = # of words in the comment “Response” M = # of participants to the post L = # of words by one participant CommentLDA L
Whatever ….Liberty… …Democracy… ….Fraternity… …Equality… …Whatever… Equality Democracy Think of this as encoding a hypothesis about which type of user ought to weigh more! Verbosity Response Comment freq :^) Liberty Fraternity
Modeling political blogs Another model we tried: CommentLDA This is a model agnostic to the words in the comment section! D = # of documents; N = # of words in post; M = # of words in comments Took out the words from the comment section!
Modeling political blogs Another model we tried: LinkLDA (Erosheva et al, 2004) The model is structurally (but not semantically) equivalent to the Link LDA from (Erosheva et al., 2004; Nallapati and Cohen, 2008) D = # of documents; N = # of words in post; M = # of words in comments
21 Topic discovery What topics did the models discover? What differences are there between the post and comments? Data sets: 5 major US blogs collected over a year - this data is available on our website (http://www.ark.cs.cmu.edu/blog-data). Each site has 1000 to 2000 training posts; details about the data sets in Yano, Cohen, and Smith, 2009. Inference is implemented with Gibbs sampling. Following are some topics from Matthew Yglesias site.
25 Comment prediction A guessing game: Can we predict which users will react given an unseen post? Infer the topic mixture for each test post using the fitted model Rank users according to p(user | post, model) for each user Envisioned useful for personalized blog filtering or recommendation system
26 Comment prediction CommentLDA performs consistently better for MY site, LinkLDA is a much better option for RS. Does our model lack the expressive power to reflect site differences? (MY) CommentLDA (R,C) 27.54 20.54 14.83 12.56 LinkLDA (R) 25.19 16.92 12.14 9.82 (RS) Precision at top 5, 10, 20, 30 user prediction From left to right: Link LDA(-v, -r,-c) Comment LDA (-v, -r, -c) Our models perform at least as well as a word-based NB baseline
27 Comment prediction Variation in user counting does make a difference. Giving more weight to verbose users does not help for this task. CommentLDA: (MY) LinkLDA: (RS) Verbosity vs. Response From left to right: cut off n = 5,10, 20, and 30 top ranked users
28 Future work What forecasting task can our model do? Using Comment LDA to predict the topics of the post given comments: Useful for automatic text categorization or text search when post has no searchable text.
Future work Can we automatically adjust how much the words influence the topics given the site? S BG Better comment prediction? Inferential questions involving multiple sites
Future work Can we guess which posts will collect more responses (number of comments, volume of comments)? A variant of SLDA (Blei and McAuliffe, 2007) with comments Link LDA-type model also possible. M
31 Summary Political blogs are an exciting new domain for language and learning research. Topic modeling is a viable framework for analyzing the text of online political discussions. It is convenient and competitive in tasks that have potential uses in real applications.
33 References Our published version of this work includes a detailed profile of our data set, as well as more experiments. http://www.aclweb.org/anthology/N/N09/N09-1054.pdf Please refer back to the original LDA paper for the complete picture. http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf The Gibbs sampling for LDA is detailed in Griffiths & Steyvers, 2004. http://www.pnas.org/cgi/reprint/0307752101v1.pdf Hierarchical Bayesian Compiler (HBC) used for Gibbs sampling: http://www.cs.utah.edu/~hal/HBC
34 Modest performance (16% to 32% precision), but compares favorably to the Naïve Bayes baseline Comment prediction 20.54 % 16.92 % 32.06 % Comment LDA (R) Link LDA (R) Link LDA (C) Precision at top 10 user prediction From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), Baseline (Freq, NB) (CB) (MY) (RS)