Presentation on theme: "CLEar (Clairaudient Ear) A Realtime Online Observatory for Bursty and Viral Events A demonstration of CLEar System."— Presentation transcript:
CLEar (Clairaudient Ear) A Realtime Online Observatory for Bursty and Viral Events A demonstration of CLEar System
The Architecture of CLEar To sum up, an meaningful event observatory should equip the following functions: Detection of a bursty topic as soon as it emerges; Early prediction if the bursty topic is likely to go viral; Summarization of related bursty topics into semantically coherent events that can be monitored; Contextualization of the events with its temporal evolution and corresponding coverage across other news media. 1/13
Recommended Materials A Tutorial at WWW 2014 : Towards a Social Media Analytics Platform: Event Detection and User Profiling for Twitter 2/13 A Tutorial at KDD 2009: Tutorial on Event Detection Hila Becker Hila Becker http://www.cs.columbia.edu/~hila/http://www.cs.columbia.edu/~hila/
Why Bursty and Viral? Compared against traditional news media, Twitter have been recognized as much more responsive and reliable sources to pick up bursty events. Trigger a surge of public interest within a short period of time. Capable of handling both planned and unplanned event. 3/13
Topic Detection in Social Media 4/13 Document-pivot : for a new tweet, assign it to a simliar existing event or take it as a new event if no similar events existed(This tweet is also called the first story of this event). Sasa Petrovic.etc Streaming first story detection with application to twitter HLT ‘10 Feature-pivot : some bursty features of hidden events would show an sharply increase than excepted when an event is happening. Chen Lin.etc Generating event storylines from microblogs CIKM’12 Chenliang Li.etc Twevent: segment-based event detection from tweets CIKM’12 Bursty Term Detection Bursty Term Grouping Candidate Event Filtering #MH370 lives Southern #eat Tmr korean Sleep indian save #MH370 lives Southern #eat Tmr korean Sleep indian save MH370 Southern indian Korean save lives #eattmr sleep MH370 Southern indian Korean save lives #eattmr sleep
The shortness of existing Works Existing works mostly focus on event detection and extraction without any post-processing. The lack of a well-established analysis for an event limits its utility. 5/13 Many challenging research problems 1 234 Popularity prediction Topic clustering Event summarization Event contextualization …
Popularity Prediction 6/13 User behaviors like replying and retweeting provide new mechanism for information diffusion. Topic popularity can be measured by the size of involved users. Prediction of topic popularity can not only have a recognize of event trends, but also remove noisy and spam bursty topics at an early stage. The challenges of this problem come from the uncertainty in information diffusion path and insufficient information at the early stage of a burst, offering little clue as to whether the detected bursty topic would sustain its virality or simply die down quickly.
Topic Clustering Due to the existence of many duplicate and semantically close topics, it is desirable to remove duplicate topics and group together topics to form a coherent event. A single-pass incremental clustering problem. Simply based on co-occurrence of bursty keywords likely to be absent because they are much shorter compared to formal document and largely depend on the detection algorithm. 8 7/13 The essential problem of clustering is define a metric to measure the similarity between topic and exiting event(cluster).
Measure the similarity between topic and event from the following perspective: Content Similarity An intuitive approach to combine those individual similarities is using different weights. However, the number of different weight combination is huge and we don’t have some prior knowledge about the weights. Learning weighting scheme through a classification model to form a unified similarity metric. Topic Clustering cont. 8/13 User Similarity Entity Similarity Volume Similarity How to combine those individual similarities ? Time Similarity
Traditional summarization methods mainly focus on content summarization to extract representative tweets from an event relevant tweet set. Besides, we propose to summarize this event from structure and user perspective. A fundamental problem is Sub-event Detection. Event Summarization 9/13
Sub-event Detection 10/13 An event usually contains some more fine-grained stages and detection algorithms can’t detect all stage of an event generally. Detection of all possible sub-events provide a basis for study some deeply properties of event. Both volume [2,3] and content  of this event provide a signal to sub-event occurrence. Compared against volume curve, we think that the content is more trustful due to the volume curve is largely depended on the retrieval results and user publish pattern.  Akshaya Iyengar.etc Content-based prediction of temporal boundaries for events in twitter. Socialcom 2011  Jeffrey Nichols.etc Summarizing sporting events using twitter IUI’2012  Arkaitz Zubiaga.etc Towards real-time summarization of scheduled events from twitter streams To solve this problem, we should overcome the following two difficulties: Retrieval : How to retrieve high-quality tweets about this event? Sub-event : How to detect all sub-events in a online manner? To solve this problem, we should overcome the following two difficulties: Retrieval : How to retrieve high-quality tweets about this event? Sub-event : How to detect all sub-events in a online manner?
12 1. How to retrieve high-quality tweets about this event? 11/13 Common practice : using event keywords as a query to search in tweet collections. The following three factors remains a large obstacle to employ standard retrieval methods: -A. Seemingly relevant tweets with good textual quality might not be truly relevant to the event; - B. Tweets highly relevant to the event might not contain any of the query keywords; - C. Query keywords might can’t represent the event comprehensively and even provide a noisy indicator. To solve A, besides relevance score returned by Elasticsearch, we can integrate other features like tweet-specific features, publisher features to reorder the search result. To solve B and C, we can use event keyword expansion, take the burstiness of term  into consideration besides traditional TF-IDF value during the expand term selection.  Metzler D, Cai C, Hovy E. Structured event retrieval over microblog archives[C] ACL 2012: 646-655.
2. How to detect all sub-events in a online manner? Topic Model : high complexity and its output are usually a general topic. Event Boundary Prediction : can only divide this event into before, during, after. We propose to firstly divide the event duration into equal-sized non-overlapping timespan, then merge adjacent timespans into an sub-event along a chronological order. Finally, we should verify sub-event’s popularity and reliability to filter spurious sub-events. The reliability can measured by total followers of all publishers while the popularity can reflected by the number of retweets. 12/13
Event Contextualization 13/13 Find a representative picture of this event. Find some related news about this event.