Presentation on theme: "SNA: Research Dr. Nawaporn Wisitpongphan 1. Michael Mathioudakis, Nick Koudas TwitterMonitor: Trend Detection over the Twitter Stream Michael Mathioudakis,"— Presentation transcript:
SNA: Research Dr. Nawaporn Wisitpongphan 1
Michael Mathioudakis, Nick Koudas TwitterMonitor: Trend Detection over the Twitter Stream Michael Mathioudakis, Nick Koudas, Nick Koudas, TwitterMonitor: trend detection over the twitter stream., In: SIGMOD Conference, pp. 155-1158, 2010
INTRODUCTION TwitterMonitor is a system that performs trend detection over the Twitter stream. Identifies emerging topics on Twitter in real time and provides analytics that synthesize and accurate description of each topic.
TREND DETECTION AND ANALYSIS Step 1: Trend Detection. Identifies ‘bursty’ keywords,keywords that suddenly appear in tweets at an unusually high rate Groups bursty keywords into trends where a trend is identified as a set of bursty keywords that occur frequently together in tweets “keyword ‘NBA’ may usually appear in 5 tweets per minute, yet suddenly exhibit a rate of 100 tweets/min. Such ‘bursts’ in keyword frequency are typically associated with sudden popular interest in a particular topic” Step 2: Analyzes trends in a third step: Extracts additional information to discover interesting aspects of it.
QueueBurst Algorithm: Detecting Bursty Keywords Algorithm: QueueBurst 1) One-pass. Stream data need only be read once to declare when a keyword is bursty. 2) Real-time. Identify bursty keyword as it arrives 3) Adjustable against ‘spurious’ bursts. In some cases, a keyword may appear in many tweets over a short period of time simply by coincidence. QueueBurst avoids reporting such instances as real burst. 4) Adjustable against spam. Ignore spam users: Spam user groups repetitively generate large numbers of similar tweets. 5) theoretically sound. QueueBurst is based on queuing theory results.
GroupBurst : From Bursty Keywords to Trends
Trend Analysis Compose description of each trend. Identify more keywords associated with it. Non-bursty keyword occurs in the same tweets as the bursty ones. Use context extraction algorithms (PCA, SVD, etc.) to search the recent history and reports the most correlated word. Use Grapevine’s entity extractor to identify frequently mentioned entities in trends. Frequently cited sources are added to the trend description. Identifies frequent geographical origins of tweet that belong to the trend. A chart will be produced for each trend. Show popularity of trends over time.
Architecture: Back-End The TwitterListener module receives sample which consists 1.2M out of 6M tweets per day, via the Twitter API. Then separates tweet information into fields and exports two feeds: Reporting tweets with all their fields to an Index module Reporting only the text and timestamp of tweets to Bursty Keywords Detection module After bursty keywords are identified and grouped into trends, the Index is contacted by the Trend Analysis module to retrieve information on tweets that belong to each trend.
Architecture: Front-End (Cont.) A webpage reports recent trends in real time An interface allows users to rank trends by frecency or current activity rate and submit their own short description for trends. Use an additional tab to display daily trends.
Demonstration Every trend will be represented by the entities, by the related bursty keywords. The audience will have the option to use the interface in order to acquire more information. They will be shown additional keywords and skim through representative tweets They will be able to track a trend’s popularity over time and spot the origin. They will interact with the system by tracking the displayed trends according different criteria and submitting descriptions.
Hamed Abdelhaq, Christian Sengstock, and Michael Gertz EvenTweet: Online Localized Event Detection from Twitter Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter. VLDB 6(12): pp. 1326-1329 (2013)
INTRODUCTION EvenTweet, a system to detect localized events from a stream of tweets in real-time. Only about 1% of tweets are georeferenced. Focuses on detecting localized events from a stream of tweets in real-time. Adopts a continuous analysis of the most recent tweets within a time-based sliding window. Described by 1) related keywords & 2) estimation of the start time and the geographic location.
INTRODUCTION Tracks evolution over time: a fine-grained temporal resolution. A scoring scheme that gives a score of each event over time. Identify localized events using a possibly small amount of geo-tagged tweets: Both geo- and non-geo-tagged tweets are used to identify words best describing events. Only geo-tagged tweets are used to estimate the spatial distribution of such words.
Localized Event Detection Basic Definitions Event: a phenomenon that stimulates people to post messages for a certain period of time. Localized events: Events happen within a small region, having a small spatial extent. (e.g., concerts, soccer matches, road works)
Localized Event A localized event is described as a tuple: le = (el, et, K) el is event location, represented as a small set of connected rectangular. et is the start time. K is a set of words frequently published during the event time and at that location.
Online Detection Basic Notation ： Each tweet tw = (W, uid, l, t) W: a set of words uid: a user id l = (lon, lat): a geographic location t: timestamp Use a timeline divided into a sequence of equal- length time frames (…f c-1, f c ), where f c denotes the current time frame. Each time frame represents a short time interval during which tweets are posted.
Basic Notation (cont.) Use a time-based sliding window win k fc composed of k time frames and f c as its end point. The detection procedure of EvenTweet is triggered every time a new time frame elapses.
Temporal Keyword Extraction Extraction of words showing a bursty frequency in the current time frame (these words are called keywords, Y c ) Given a set of words W c from the tweets published during the recent time frame f c, extract a subset Y c ⊆ W c which represents words likely to describe localized events.
Temporal Keyword Extraction (cont.) Use discrepancy paradigm to extract keywords based on their burstiness. For Each Timeframe f c u(w, c): number of users publishing tweets containing word w, normalized by the number of users hist w = (u(w, 1), u(w, 2), …, u(w, m)) is a fixed historical sequence of usage values for w collected before the current time frame f c, such that m < c. use history to distinguish normal behavior from bursty The discrepancy paradigm measures the deviation between the word usage value u(w,c) in the current time frame and an expected word usage baseline, b(w), which estimated from hist w. hist w is drawn from Gaussian distribution with mean b(w).μ and deviation b(w).σ Higher deviation, higher burstiness degree
Temporal Keyword Extraction (cont.) The burtinesss degree of a word w is the z-score defined: b_degree(w, c) :=( u(w,c)−b(w).μ)/b(w).σ Choose words whose burstiness degree is larger than two standard deviations above the mean as keywords. Keywords observed for the first time will have μ=0 and σ=0.
Spacial Keyword Identification Find keywords which are highly localized. Only use georeferenced tweets. g grid G
Spacial Keyword Identification Only use georeferenced tweets. g -Calculate Entropy H(S i ) -Discard all keywords with entropy larger than a threshold ρ. Why? -Large entropy keyword spread out in space -Small entropy keyword occurs at only a few locations -We’ll have Y c = set of filtered keywords
Keyword Clustering Each S i is a vector. Clustering event keywords using their S i Similarity calculation: Cosine similarity
Keyword Clustering -There is a distance threshold Т -If a new keyword falls out of the threshold, it forms a new cluster itself.
Cluster Scoring To determine which clusters of keywords is more likely being referred to localized events, filter out spurious clusters. To score a cluster: 1. Calculate score each keyword 2. Calculate score of each cluster by summing up scores of all keywords 3. The clusters with high scores are considered event clusters Due to the noisy nature and the increasing vocabulary size of tweets, the extracted keyword set is enormous and has many spurious keywords, which results in creating clusters related to no events.
Next Week Read one interesting research paper from VLDB conference Present in the class =)
Other Interesting Papers from VLDB RealTime Twitter Recommendation: Online Motif Detection in Large Dynamic Graphs TeRec: A Temporal Recommender System Over Tweet Stream Unicorn: A System for Searching the Social Graph Entity Extraction, Linking, Classification, and Tagging for Social Media: a Wikipedia-Based Approach Piggybacking on social networks Mobility and Social Networking: A Data Management Perspective Recommendation by Examples http://db.disi.unitn.eu/pages/VLDBProgram/lib/FullProgram.html#D1F1400T1530R2