Presentation is loading. Please wait.

Presentation is loading. Please wait.

BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert.

Similar presentations


Presentation on theme: "BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert."— Presentation transcript:

1 BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert Gordon University) Andrew MacFarlance (City University London)

2 Outline Introduction & Motivation BNgram approach Further modifications Experiments Demo Conclusions and Future work References #2

3 Outline Introduction & Motivation BNgram approach Further modifications Experiments Demo Conclusions and Future work References #3

4 Introduction & Motivation Newsworthy stories are increasingly being shared through social networking platforms such as Twitter and Reddit Journalists use Social Media to rapidly discover stories and eye-witness accounts. #4

5 Other tools to detect newsworthy stories: –Twitter trends – –Trendsmap - –Newship - Introduction & Motivation #5

6 Introduction & Motivation Gap in the market –Story description is incomplete/unclear (based on the use of hashtags and entities) –Use of mainstream media Proposal of an approach to detect newsworthy stories in real time from Twitter where story description is complete and posts from social network users are associated to each story –Journalists and news readers don’t get overwhelmed. #6

7 Outline Introduction & Motivation BNgram approach Further modifications Experiments Demo Conclusions and Future work References #7

8 BNgram approach Detection of the most representative topics from a timeslot making special emphasis on temporal dimension of data. 1.Detection of emerging phrases (word n-grams) based on df- idf t score. It is a variant of tf-idf. Ranking of n-grams per timeslot sorted by df-idf t, avoiding overlaps. Boost factor: Named entity recognition (Stanford) – 3 class classifier (Person, location and organization). #8

9 BNgram approach 2.Hierarchical clustering of the top k n-grams with the highest df-idf t scores. Topic score is computed as the maximum df- idf t of its n-grams. #9

10 BNgram approach Evaluation benchmark: Comparison with other 4 TDT (document-pivot and feature-pivot) and a baseline (LDA) approach – TMM paper User-centred evaluation: –Collections: FA Cup, Super Tuesday and US Elections (tracking keywords). –Ground truth: Set of representative topics (manually selected) corresponding to different timeslots, coming from main-stream media(MSM). Timeslot size: FA Cup – 1 min., Super Tuesday and US elections – 10 min. Topics: 13 FA Cup, 22 Super Tuesday and 64 US elections. #10

11 BNgram approach Collections: #11

12 BNgram approach Results – TMM paper #12 – FA – Super Tuesday – US Elections Latent Dirichlet Allocation (baseline) Document-pivot topic detection Graph-based feature- pivot topic detection Frequent pattern mining Soft Frequent pattern mining BNgram

13 BNgram approach Examples of topics #13 Detected topicCorresponding storySample tweet FACUP over line saved super cech claiming carroll header liverpool #cfcwembley #facupfinal sl Liverpool nearly score Andy Carroll takes a shot. Petr Cech makes a fantastic save. Liverpool nearly score Andy Carroll takes a shot. Petr Cech makes a fantastic save. Super Tuesday romney wins virginia republican presidential primary breaking Fox/NBC is projecting Mitt Romney has won the Virginia BREAKING: Mitt Romney wins the Virginia Republican presidential primary. -RAS US four more years Obama tweeted “Four more years” Several television networks report Obama has four more years

14 Outline Introduction & Motivation BNgram approach Further modifications Experiments Demo Conclusions and Future work References #14

15 Further modifications BNgram approach modifications: –Study of different types of n-grams. –Timeslots vs. Number of tweet slots –Clustering techniques have been tested for BNgram approach: Apriori and GMM algorithms. –New topic ranking technique has been considered. #15

16 N-grams Word order is often essential to indicate meaning. For example, 'dog bites man' is not news, but 'man bites dog' is news. A bag-of-words approach cannot distinguish these cases. Popular in NLP In this work, n-gram we refer to sequences of up to n consecutive terms Copies of posts and RTs are very frequent in Twitter space. Focused posts in 140 characters. #16

17 What’s the best timeslot size?. Other alternatives: Number of tweet slots – Minimum changes in the approach. Small slot size  missed stories Large slot size  delay in some stories (refresh rate) Timeslots vs. Number of tweet slots #17 Fixed number of tweets instead of time boost

18 Clustering approaches Weakness detected in our clustering technique: –Example: US elections ngram ranking (sorted by df-idf t ): Basic hierarchical clustering: Incomplete stories. –From our example, the candidate clusters could be: Cluster 1: Barack Obama wins + wins Wisconsin (Complete) Cluster 2: wins California (Incomplete, who?) New grouping techniques where one n-gram can be assigned to different clusters. #18 PositionNgramDocs #1Barack Obama wins1,2,4,6,7,8,9,10 #2wins Wisconsin1,2,4,6 #3wins California7,8,9,10

19 Clustering approaches – Gaussian Mixture Models (GMM) Unsupervised method Assign probabilities (or strengths) of membership of each n-gram to each cluster – Partial membership Iterative approach. Tries to find the parameters of the probability distribution that has the maximum likelihood of its attributes. Input: Number of clusters - Bayesian Information Criteria (BIC) #19

20 Clustering approaches – Gaussian Mixture Models (GMM) Expectation-Maximisation - Two steps: –E-Step: Estimates the probability of each point belongs to each cluster. –M-step: Re-estimate the parameter vector of the probability distribution of each class. The algorithm finishes when the distribution parameters converges or maximum number of iterations. #20

21 Clustering approaches - Apriori algorithm Explore associations between n-grams based on the number of shared tweets. Number of n-grams per association: Each association contains from 1 n-gram to the considered number of n-grams from the ranking. One association is considered if the number of shared tweets for the n-grams of the association is bigger than a threshold (support value). In a posterior step, the maximal associations are obtained to avoid overlaps. #21

22 Clustering approaches - Apriori algorithm From the previous example (if threshold is 3): –Candidate associations: #1, #2, #3, #1#2, #1#3 –Maximal associations: #1#2, #1#3 #22

23 Topic ranking Maximum df-idf t n-gram approach is not the best alternative for these new clustering techniques Inconvenient for slots with active and diverse topics. #23 n-gram1 n-gram2 N-gram ranking Topic ranking topic1 topic4 topic3 topic2 topic5

24 Topic ranking Weighted topic-length approach: where s t is the score of topic t, L t is the length of the topic, L max is the maximum number of terms in any topic from the current slot, N t is the number of tweets in topic t and N s is the number of tweets in the slot. Finally, α is a weighting term. #24

25 Evaluation We have estimated the starting and ending times of each event in the ground-truth #25 Topics for slot i-3 Topics for slot i-2 Topics for slot i-1 Topics for slot i Starting time (event) Ending time (event) m m m m Merged topics to evaluate the event (top m)

26 Outline Introduction & Motivation BNgram approach Further modifications Experiments Demo Conclusions and Future work References #26

27 Experiments – n-grams Topic recall for different types of n-grams and three datasets using hierarchical clustering and maximum n-gram topic ranking techniques and fixing the slot size to 1000 tweets (similar patterns observed using other configurations) #27

28 Normalised area under the curve for the three datasets and its weighted average. Experiments – n-grams #28

29 Experiments- slot size Topic recall for different slot-sizes using hierarchical clustering and weighted topic-length topic ranking techniques (3-grams). Possible correlation between slot size and tweet rate (Super Tuesday: 832 tpm, FA Cup: 1293 tpm, US elections: 2209 tpm) Consider refresh rate UI #29

30 Experiments – clustering and topic ranking techniques Topic recall for different clustering techniques in the three datasets and using both topic ranking techniques (3-grams and slot size = 1500 tweets) #30

31 Normalised area under the curve Experiments – clustering and topic ranking techniques #31

32 Outline Introduction & Motivation BNgram approach Further modifications Experiments Demo Conclusions and Future work References #32

33 Demo Social Sensor project – #33

34 Outline Introduction & Motivation BNgram approach Further modifications Experiments Demo Conclusions and Future work References #34

35 Conclusions and Future work New TDT approach based on temporal dimension of data and n- grams in Twitter space Improve tracking issues – ongoing Trust and verifications based on following newshounds – ongoing Improve Topic title – ongoing Better association of tweets to topics – ongoing Improve evaluation methods/metrics Smoothing techniques for df-idf t computation Entity recognition – Other approaches (Illinois NLP tools,…) Participation in TDT challenges (SNOW14) #35

36 Outline Introduction & Motivation BNgram approach Further modifications Experiments Demo Conclusions and Future work References #36

37 References Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R., Goker, A., Kompatsiaris, I., Jaimes, A.: Sensing trending topics in twitter. Multimedia, IEEE Transactions on 15(6) (2013) 1268–1282 Martin, C., Corney, D., Goker, A.: Finding newsworthy topics on Twitter. IEEE Computer Society Special Technical Community on Social Networking E-Letter 1(3) (September 2013) Steve Schifferes, Nic Newman, Neil Thurman, David Corney, Ayse Göker, Carlos Martin. (2013). Identifying and verifying news through social media: Developing a user-centred tool for professional journalists. In The Future of Journalism Conference 2013, Cardiff, UK. Spot the ball: Detecting sports events on Twitter. In proceedings of ECIR 2014, Amsterdam, Netherlands. (To appear) #37

38 Thank you!


Download ppt "BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert."

Similar presentations


Ads by Google