Social media == new source of information and the ground for social interaction Twitter: Noisy and content-sparse data Question: Can we carve out fine.

Slides:



Advertisements
Similar presentations
Jack Jedwab Association for Canadian Studies September 27 th, 2008 Canadian Post Olympic Survey.
Advertisements

EcoTherm Plus WGB-K 20 E 4,5 – 20 kW.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Variations of the Turing Machine
PDAs Accept Context-Free Languages
AP STUDY SESSION 2.
Reflection nurulquran.com.
EuroCondens SGB E.
Worksheets.
Slide 1Fig 26-CO, p.795. Slide 2Fig 26-1, p.796 Slide 3Fig 26-2, p.797.
Slide 1Fig 25-CO, p.762. Slide 2Fig 25-1, p.765 Slide 3Fig 25-2, p.765.
& dding ubtracting ractions.
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Addition and Subtraction Equations
Multiplication X 1 1 x 1 = 1 2 x 1 = 2 3 x 1 = 3 4 x 1 = 4 5 x 1 = 5 6 x 1 = 6 7 x 1 = 7 8 x 1 = 8 9 x 1 = 9 10 x 1 = x 1 = x 1 = 12 X 2 1.
Division ÷ 1 1 ÷ 1 = 1 2 ÷ 1 = 2 3 ÷ 1 = 3 4 ÷ 1 = 4 5 ÷ 1 = 5 6 ÷ 1 = 6 7 ÷ 1 = 7 8 ÷ 1 = 8 9 ÷ 1 = 9 10 ÷ 1 = ÷ 1 = ÷ 1 = 12 ÷ 2 2 ÷ 2 =
David Burdett May 11, 2004 Package Binding for WS CDL.
Create an Application Title 1Y - Youth Chapter 5.
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
1 1  1 =.
CHAPTER 18 The Ankle and Lower Leg
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
Learning to show the remainder
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Media-Monitoring Final Report April - May 2010 News.
Break Time Remaining 10:00.
The basics for simulations
A sample problem. The cash in bank account for J. B. Lindsay Co. at May 31 of the current year indicated a balance of $14, after both the cash receipts.
PP Test Review Sections 6-1 to 6-6
Look at This PowerPoint for help on you times tables
Regression with Panel Data
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
Chapter 1: Expressions, Equations, & Inequalities
2.5 Using Linear Models   Month Temp º F 70 º F 75 º F 78 º F.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Artificial Intelligence
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Slide R - 1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Prentice Hall Active Learning Lecture Slides For use with Classroom Response.
Subtraction: Adding UP
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Static Equilibrium; Elasticity and Fracture
Converting a Fraction to %
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
& dding ubtracting ractions.
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Select a time to count down from the clock above
9. Two Functions of Two Random Variables
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Multiplication Facts Practice
Graeme Henchel Multiples Graeme Henchel
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
1Ort ML A Figures and References to Topic Models, with Applications to Document Classification Wolfgang Maass Institut für Grundlagen der Informationsverarbeitung.
Topic Modeling using Latent Dirichlet Allocation
Presentation transcript:

Social media == new source of information and the ground for social interaction Twitter: Noisy and content-sparse data Question: Can we carve out fine grained topics within the micro-documents, e.g. topics such as food, movies, music, etc ?

Ritter et al. (2010) uncover dialogue acts such as Status, Question to Followers, Comment, Reaction within Twitter post via unsupervised modeling. Ramage et al. (2010) cluster a set of Twitter conversations using supervised LDA. Main finding: Twitter is 11% substance, 5% status, 16% style, 10% social, and 56% other.

1119 posts were selected from Twitter corpus, with the following topics: Food 276Movies 141 Sickness 39Sports 34 Music 153Relationships 66 Computers 57Travel 32 Other 321 The topic labels were not included in the data.

Latent Dirichlet Allocation (Blei et al, 2003) was used to model topics Each document == a mixture of topics Topic == a probability distribution over words Intuition behind LDA == words that co-occur across documents have similar meaning and indicate the topic of the document. Example : if café and food co-occur in multiple documents, they are related to the same topic.

Individual posts broken into words and a random topic between 0 and 8 was assigned to each word. Only nouns and verbs were left in the data. A representation of a tweet such as (1) is in (2) (1) hey, whats going on lets get some food in this café (2) [(food, t1), (cafe, t2)]

t0 music 67%; movies 33% t1 other 50%; travel 12.5%; sports 12.5%; food 25% t2 music 50%; relationships 25%; sickness 12.5%; computers 12.5% t3 food 80%; movies 20% t4 food 74%; music 11%; sickness 10%; other 5% t5 music 67%; other 14%; movies 9%; food 5%; sports 5% t6 food 50%; other 14%; sickness 14%; computers 14%; movies 8% t7 food 70%; music 20%; relationships 10% t8 food 45%; travel 28%; relationships 9%; other 9%; computers 9%

Results: Tweets on the same topic scattered across clusters Clusters contained various unrelated tweets. Problem: tweets on the same topic had no words in common Solution : Thesaurus of words related to the topics in posts was created Most generic words from the thesaurus were selected as pads Sample thesaurus entry: (3) food_topic [pizza, café, food, eat..] food topic padding set: [food, café] Posts shorter than 3 words were padded with 2-3 words from the relevant padding set Padding ensured that some shared words appear in posts Unpadded posts: (4) café eat (5) food Padded posts: (4) café food eat food (5) café food food.

t0 music 77%; food 9%; computers 4.54%; sports 4.54%; travel 4.54% t1 movies 80%; relationships 3%; sickness 10%; music 5%; food 2% t2 food 69%; movies 16%; music 5%; sickness 5%; sports 5% t3 food 93%; sickness 4%; other 3%; movies 1% t4 other 65%; travel 29%; movies 6% t5 sickness 100% t6 music 46%; other 24%; sports 20%; computers 10% t7 food 49%; computers 22%; other 20%; music 8%; movies 1% t8 relationships 55%; food 23%; music 10%; sickness 6%; movies 6%

5 out of 9 clusters contained at least one strongly dominant semantic topic (over 60%) 6 topics emerge from the padded data clustering: food, movies, music, other, sickness, and relationships. Food is distributed across 3 clusters and is the dominant topic in all of them The most cohesive clusters are t5, t3, t2, t1, and t0; they have a dominant semantic topic – one that occurs in over 60% of the posts within the cluster

Differences across the two data sets: Non-padded data has no pure clusters and no clusters where one topic occurs < 80% of the time 3 clusters found in non-padded data: music, food, and other. 6 distinct clusters found in the padded data: music, movies, food, relationships, other, and sickness Similarities in the padded and the non-padded data sets: movie posts were given the same topic ID as food posts Food, music, and movies were distributed across several topic assignments

Information sparsity has a strong negative effect on topic clustering. Topic clustering is improved, when: a) only nouns and verbs are used in posts b) posts are padded with similar words Future investigation: Can padding help discover topics in large sets of tweets? Thank you!

Barzilay, R. and L. Lee Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization. In Proceedings of HLT-NAACL, Stroudsburg, PA. Blei, D., Ng, A. and Jordan, M Latent Dirichlet Allocation. Journal of Machine Learning : 3, Griffiths, T. and Steyvers, M Prediction and Semantic Association. In Neural Information Processing Systems 15: Griffiths, T. and Steyvers, M Finding Scientific Topics. Proceedings of Natural Academic of Sciences, 101 Supplement 1: Griffiths, T. and Steyvers, M Probabilistic Topic Models. In Landauer, T., McNamara, D., D. Dennis, S. and Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum. Hoffman, T Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden: Morgan Kaufman. Hoffman, T Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning Journal, 42(1), Ramage, D., Dumais, S., and Liebling, D Characterizing Microblogs with Topic Models. Association for the Advancement of Artificial Intelligence. Ritter, A., Cherry, C., and Dolan, B Unsupervised Modeling of Twitter Conversations. In Proceedings of HLT-NAACL Stroudsburg, PA. Ritter, A. et. al Status Messages: A Unique Textual Source of Realtime and Social Information. Presented at Natural Language Seminar, Information Sciences Institute, USC. Yao, L., Mimno,D., and McCallum, A Efficient Methods for Topic Model Inference on Streaming Document Collections. In Proceedings of the 15 th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,