Presentation is loading. Please wait.

Presentation is loading. Please wait.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Symeon Papadopoulos (CERTH) David Corney (RGU) Luca Aiello (Yahoo! Labs)

Similar presentations


Presentation on theme: "WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Symeon Papadopoulos (CERTH) David Corney (RGU) Luca Aiello (Yahoo! Labs)"— Presentation transcript:

1 WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Symeon Papadopoulos (CERTH) David Corney (RGU) Luca Aiello (Yahoo! Labs)

2 Overview of Challenge Goal: Detection of newsworthy topics in a large and noisy set of tweets Topic: a news story represented by a headline + tags + representative tweets + representative images (optional) Newsworthy: A topic that ends up being covered by at least some major online news sources Topics are detected per timeslot (small equally-sized time intervals) We want a maximum number of topics per timeslot #2

3 Challenge Activity Log Challenge definition (Dec 2013) Challenge toolkit and registration (Jan 20, 2014) Development dataset collection (Feb 3, 2014) Rehearsal dataset collection (Feb 17, 2014) Test dataset collection (Feb 25, 2014) Results submission (Mar 4, 2014) Paper submission (Mar 9, 2014) Results evaluation (Mar 5-18, 2014) Workshop (Apr 7, 2014) #3

4 Some statistics Registered participants: 25 –India: 4, Belgium: 3, Germany: 3, UK: 3, Greece: 3, Ireland: 2, USA: 2, France: 2, Italy: 1, Spain: 1, Russia: 1 Participants that signed the Challenge agreement: 19 Participants that submitted results: 11 Participants that submitted papers: 9 #4

5 Evaluation Protocol Defined several evaluation criteria: –Newsworthiness  Precision/Recall, F-score –Readability  scale [1-5] –Coherence  scale [1-5] –Diversity  scale [1-5] List of reference topics Set up precise evaluation guidelines Blind evaluation (i.e. evaluator not aware of which method a topic comes from) based on Web UI Participants submitted topics for 96 timeslots, but manual evaluation happened for 5 sample timeslots. Result validation and analysis #5

6 Results – Reference topic recall #6 TeamRecall (%)Rank A 0.445 B 0.584 C 0.327 D 0.632 E 0.661 F 0.396 G 0.248 H 0.63 I 0.1710 J 0.248 K 0.1411 Recall computed with respect to 59 reference topics. Those were partitioned in three groups (20, 20, 19) and each of the three evaluators manually matched the topics of participants to the topics assigned to him. Eval. PairCorrelation Eval. 1 – Eval. 2 0.894913 Eval. 1 – Eval. 3 0.930247 Eval. 2 – Eval. 3 0.811976

7 Results – Pooled topic recall (1/2) Each evaluator independently evaluated the topics of each participant as newsworthy or not Selected all topics that were marked as newsworthy by at least two evaluators Manually extracted the unique topics (70 in total, partially overlapping with reference topic list) Manually matched correct topics of each participants to the list of newsworthy topics Computed precision, recall and F-score #7

8 Results – Pooled topic recall (2/2) #8 TeamMatchedUniqueTotalPrecRecF-scoreRank A 13 270.4810.1860.2686 B 12 230.5220.1710.2587 C 2215500.440.2140.2884 D 1814390.4620.20.2795 E 2825500.560.3570.4361 F 42150.2670.0290.05210 G 44 0.40.0570.0999 H 1917490.3880.2430.2993 I 3615450.80.2140.3382 J 1180.1250.0140.02711 K 87100.80.10.1788

9 Results - Readability #9 TeamReadabilityRank A 4.299 B 4.922 C 4.497 D 4.596 E 4.744 F 4.1810 G 4.931 H 4.715 I 4.83 J 3.3811 K 4.328 Eval. PairCorrelation Eval. 1 – Eval. 2 0.902124 Eval. 1 – Eval. 3 0.357733 Eval. 2 – Eval. 3 0.278632

10 Results - Coherence #10 TeamCoherenceRank A 4.46 B 4.089 C 4.685 D 4.912 E 4.971 F 4.784 G 4.833 H 4.228 I 3.9510 J 3.7511 K 4.367 Eval. PairCorrelation Eval. 1 – Eval. 2 0.549512 Eval. 1 – Eval. 3 0.730684 Eval. 2 – Eval. 3 0.684426

11 Results - Diversity #11 TeamDiversityRank A 2.127 B 2.364 C 2.316 D 2.118 E 8 F 210 G 1.9211 H 3.272 I 2.364 J 2.53 K 3.471 Eval. PairCorrelation Eval. 1 – Eval. 2 0.873365 Eval. 1 – Eval. 3 0.890415 Eval. 2 – Eval. 3 0.905915

12 Results – Image Relevance #12 TeamPrecision (%)Rank A 54.193 B 31.755 C 58.092 D 52.044 E 27.396 F 08 G 08 H 58.821 I 08 J 08 K 18.457 Eval. PairCorrelation Eval. 1 – Eval. 2 0.944946 Eval. 1 – Eval. 3 0.919469 Eval. 2 – Eval. 3 0.79596

13 Results – Aggregate (1/2) For each criterion C i, we computed the score of each team relative to the best team for this criterion: C i * (team) = C i (team) / max(C i (team j )) We then aggregated over the different norm. scores: C tot = 0.25*C ref *C pool + 0.25*C read + 0.25*C coh + 0.25*C div where C ref is computed from the recall of reference topics, C pool from the F-score of the pooled topics, and C read, C coh and C div from readability, coherence and diversity respectively. #13

14 Results – Aggregate (2/2) #14 TeamPrecision (%)Rank A 0.6947 B 0.7554 C 0.70995 D 0.7853 E 0.8921 F 0.61410 G 0.6529 H 0.8422 I 0.6628 J 0.54611 K 0.709876 We tried several other alternative aggregation scores. The top three teams were the same!

15 Program 15:20-15:30: Carlos Martin-Dancausa and Ayse Goker: Real-time topic detection with bursty n-grams. 16:00-16:20: Gopi Chand Nuttaki, Olfa Nasraoui, Behnoush Abdollahi, Mahsa Badami, Wenlong Sun: Distributed LDA based topic modelling and topic agglomeration in a latent space. 16:20-16:40: Steven van Canneyt, Matthias Feys, Steven Schockaert, Thomas Demeester, Chris Develder, Bart Dhoedt: Detecting newsworthy topics in Twitter. 16:40-17:00: Georgiana Ifrim, Bichen Shi, Igor Brigadir: Event detection in Twitter using aggressive filtering and hierarchical tweet clustering. 17:00-17:20: Gerard Burnside, Dimitrios Milioris, Philippe Jacquet: One day in Twitter: Topic detection via joint complexity. 17:20-17:30: Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris: Two-level message clustering for topic detection in Twitter. 17:30-17:40: Winners’ announcement! #15

16 Limitations – Lessons Learned Did not take into account time –However, methods that produce a newsworthy topic earlier should be rewarded Did not take into account image relevance –since we considered it an optional field Coherence and diversity had extreme values in numerous cases –e.g. when a single relevant tweet was provided as representative Evaluation turned out to be a very complex task! Assessing only five slots (out of the 96) is definitely a compromise: (a) consider use of more evaluators/AMT, (b) consider simpler evaluation tasks #16

17 Plan Release evaluation resources –list of reference topics –list of pooled newsworthy topics –evaluation scores Papers –SNOW Data Challenge paper –Resubmission of participants’ papers with CEUR style –Submission to CEUR-ws.org Open-source implementations? Further plans? #17

18 Thank you! #18


Download ppt "WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Symeon Papadopoulos (CERTH) David Corney (RGU) Luca Aiello (Yahoo! Labs)"

Similar presentations


Ads by Google