Presentation on theme: "Social media == new source of information and the ground for social interaction Twitter: Noisy and content-sparse data Question: Can we carve out fine."— Presentation transcript:
Social media == new source of information and the ground for social interaction Twitter: Noisy and content-sparse data Question: Can we carve out fine grained topics within the micro-documents, e.g. topics such as food, movies, music, etc ?
Ritter et al. (2010) uncover dialogue acts such as Status, Question to Followers, Comment, Reaction within Twitter post via unsupervised modeling. Ramage et al. (2010) cluster a set of Twitter conversations using supervised LDA. Main finding: Twitter is 11% substance, 5% status, 16% style, 10% social, and 56% other.
1119 posts were selected from Twitter corpus, with the following topics: Food 276Movies 141 Sickness 39Sports 34 Music 153Relationships 66 Computers 57Travel 32 Other 321 The topic labels were not included in the data.
Latent Dirichlet Allocation (Blei et al, 2003) was used to model topics Each document == a mixture of topics Topic == a probability distribution over words Intuition behind LDA == words that co-occur across documents have similar meaning and indicate the topic of the document. Example : if café and food co-occur in multiple documents, they are related to the same topic.
Individual posts broken into words and a random topic between 0 and 8 was assigned to each word. Only nouns and verbs were left in the data. A representation of a tweet such as (1) is in (2) (1) hey, whats going on lets get some food in this café (2) [(food, t1), (cafe, t2)]
t0 music 67%; movies 33% t1 other 50%; travel 12.5%; sports 12.5%; food 25% t2 music 50%; relationships 25%; sickness 12.5%; computers 12.5% t3 food 80%; movies 20% t4 food 74%; music 11%; sickness 10%; other 5% t5 music 67%; other 14%; movies 9%; food 5%; sports 5% t6 food 50%; other 14%; sickness 14%; computers 14%; movies 8% t7 food 70%; music 20%; relationships 10% t8 food 45%; travel 28%; relationships 9%; other 9%; computers 9%
Results: Tweets on the same topic scattered across clusters Clusters contained various unrelated tweets. Problem: tweets on the same topic had no words in common Solution : Thesaurus of words related to the topics in posts was created Most generic words from the thesaurus were selected as pads Sample thesaurus entry: (3) food_topic [pizza, café, food, eat..] food topic padding set: [food, café] Posts shorter than 3 words were padded with 2-3 words from the relevant padding set Padding ensured that some shared words appear in posts Unpadded posts: (4) café eat (5) food Padded posts: (4) café food eat food (5) café food food.
t0 music 77%; food 9%; computers 4.54%; sports 4.54%; travel 4.54% t1 movies 80%; relationships 3%; sickness 10%; music 5%; food 2% t2 food 69%; movies 16%; music 5%; sickness 5%; sports 5% t3 food 93%; sickness 4%; other 3%; movies 1% t4 other 65%; travel 29%; movies 6% t5 sickness 100% t6 music 46%; other 24%; sports 20%; computers 10% t7 food 49%; computers 22%; other 20%; music 8%; movies 1% t8 relationships 55%; food 23%; music 10%; sickness 6%; movies 6%
5 out of 9 clusters contained at least one strongly dominant semantic topic (over 60%) 6 topics emerge from the padded data clustering: food, movies, music, other, sickness, and relationships. Food is distributed across 3 clusters and is the dominant topic in all of them The most cohesive clusters are t5, t3, t2, t1, and t0; they have a dominant semantic topic – one that occurs in over 60% of the posts within the cluster
Differences across the two data sets: Non-padded data has no pure clusters and no clusters where one topic occurs < 80% of the time 3 clusters found in non-padded data: music, food, and other. 6 distinct clusters found in the padded data: music, movies, food, relationships, other, and sickness Similarities in the padded and the non-padded data sets: movie posts were given the same topic ID as food posts Food, music, and movies were distributed across several topic assignments
Information sparsity has a strong negative effect on topic clustering. Topic clustering is improved, when: a) only nouns and verbs are used in posts b) posts are padded with similar words Future investigation: Can padding help discover topics in large sets of tweets? Thank you!
Barzilay, R. and L. Lee. 2004. Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization. In Proceedings of HLT-NAACL, 113-120. Stroudsburg, PA. Blei, D., Ng, A. and Jordan, M. 2003. Latent Dirichlet Allocation. Journal of Machine Learning : 3, 993- 1022. Griffiths, T. and Steyvers, M. 2003. Prediction and Semantic Association. In Neural Information Processing Systems 15: 11-18. Griffiths, T. and Steyvers, M. 2004. Finding Scientific Topics. Proceedings of Natural Academic of Sciences, 101 Supplement 1:5228-5235. Griffiths, T. and Steyvers, M. 2007. Probabilistic Topic Models. In Landauer, T., McNamara, D., D. Dennis, S. and Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum. Hoffman, T. 1999. Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, 289-296. Stockholm, Sweden: Morgan Kaufman. Hoffman, T. 2001. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning Journal, 42(1), 177-196. Ramage, D., Dumais, S., and Liebling, D. 2010. Characterizing Microblogs with Topic Models. Association for the Advancement of Artificial Intelligence. Ritter, A., Cherry, C., and Dolan, B. 2010. Unsupervised Modeling of Twitter Conversations. In Proceedings of HLT-NAACL 2010. 113-120. Stroudsburg, PA. Ritter, A. et. al. 2011. Status Messages: A Unique Textual Source of Realtime and Social Information. Presented at Natural Language Seminar, Information Sciences Institute, USC. Yao, L., Mimno,D., and McCallum, A. 2009. Efficient Methods for Topic Model Inference on Streaming Document Collections. In Proceedings of the 15 th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 937-946.