Presentation is loading. Please wait.

Presentation is loading. Please wait.

TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Similar presentations

Presentation on theme: "TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University."— Presentation transcript:

1 TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University of Maryland Group Members Enkh-Amgalan Baatarjav Jedsada Chartree Thiraphat Meesumrarn

2 Outline Introduction to Twitter Problem statement Contributions Key concepts Methodology Assumptions Questions References

3 Introduction: Twitter Three actors User Followers Friend Relationship Unidirectional Bidirectional + = Multi-interface Website, SMS, applications, IM, etc

4 Introduction to Twitter


6 Search Engines

7 Twitter Services: API Twitter API Functions to obtain user-specific information Twitter dataset samples through public feeds Public timeline Spritzer GardenHose: sparse sampling of all feeds BirdDog: tweets written by up to 200,000 users

8 Introduction: Statistics U.S. Unique Visitor (000) Trend (Source: comScore Media Metrix)

9 Introduction: Statistics 21% of Twitter accounts are empty placeholders 94% of Twitter accounts have less than 100 followers 10% of Twitter users create 86% of all activity 49.6% of Twitter users are inactive (1 tweet in last 7 days) 55% of Twitter users use 3 rd party application

10 Introduction: Statistics

11 Problem Statement Conventional system: News aggregators: Google News, Bing News, and Yahoo! News Content providers: newspapers, television stations, news blogs Vast amount of information being generated by Twitter users 2008 Southern California earthquake Iranian election Separating News from Junk

12 Contributions Mobilizing millions of Twitter users to be eyes and ears in the world Geographic proximity plays important role TwitterStand Identifying current news Clustering similar tweets into news stories Ranking news based on importance Geo-tagging news topics

13 Key Concepts Separating news from noise Clustering tweets Mapping the the clusters to geographic location

14 Example: Twitter Vs Aggregator

15 Benefits of Twitter Social networking website Community and structure Meta-data information Description, source location, friends, etc Very open community Diverse community with varied interest Broadcasting less popular view points Capturing breaking news Very little lag time between event and tweet

16 Challenges of Twitter Determining tweet is whether news or not Most of them are not news A very high throughput Needs to be fast, resilient to noise Brevity of the tweets Lucking conveyed information: time critical Credibly issues

17 Key Strategies 1. Utilizing online Algorithm Stream of tweets arrive at furious amount 2. Extracting useful information from noise Noise, spelling & gram. error, abbr., etc 3. Keeping up with Twitter evolution 4. Finding core group of users who tweet about news Manually identify the core group is better than mining SN structure Finding the most common set of followers among them 5. Obtaining user-generated news content Videos, photographs, unconventional news, biased toward entertainment, politics and tech

18 Architecture of TwitterStand

19 Architecture: Input Seeders 2,000 handpicked users that are known to publish news: newspapers, television stations, reporters, bloggers, etc. GardenHose Sampling of all tweets: very noisy feeds from diverse topics. BirdDog Feeds from up to 200,000 users, identified by friend finder Artifacts Links to external resource, only retained from seeders feed Track Automatically generate pool of search keys to scour Twitter for potential news tweets of interest from stream of tweets

20 Separating the Chaff Classify incoming tweets as either junk or news Except for tweets from seeders Goal Not completely rid of noise Discard as many tweets as possible without losing many news tweets Training naïve Bayes classifier with corpus tweets marked as either junk or news

21 Cont. Probability of a tweet is junk or news is denoted by using Bayes Theorem: Assumption of independence among the words in t

22 Cont. If D < 0, the tweet is classified as news, else it is junk

23 Cont. How to insure that not to classify tweets related to news as junk? The corpus is made up of two component Static Large collection of news tweets are marked as news Large collection of tweets are marked as junk Dynamic Periodically obtained from the clustering module Names of people, hashtags News Tweets: Static: Helps to identify news tweets on topics that have not encountered previously Dynamic: Helps to identify news tweets about current event

24 Online Clustering Goal: Automatically group news tweets into sets of tweets, clusters Topic detection: Each cluster contains tweets pertaining to a specific topic Challenges Topic is not predefined No training set Online clustering

25 Cont. Leader-follower clustering Features: be able to cluster both content and time Algorithm details Active cluster list Feature vectors: tweets terms (TF-IDF) Time centroid Inactive cluster: time centroid > 3 days

26 Cont. Cosine similarity measure Feature vectors TFV t, TFV c Pre-specified constantε if > ε, start a new cluster To account for temporal dimension Apply Gaussian attenuator

27 Cont.: Optimization Inverted index of cluster centroids Reduce number of distance computation For each feature f, the index stores pointers to all clusters containing f. iff at least one feature is common between a tweet and a clusters Maintaining a list of active clusters Centroids are less than a three days old

28 Additional Tweaks: Dealing with Noise Very noise medium Seeding good quality clusters Only Seeders are allowed to start new cluster Unreliable feed allowed to add to existing cluster Drawback Seeders are mostly consists of conventional news resource Solution Relaxing the rule by any tweet can form inactive cluster if after the k tweets have been added to the cluster (none of k tweets from seeders) Cluster status changed to active when seeder tweet is added to the cluster

29 Tweak: Fragmentation Several different clusters on a single topic Frequently occurs with online clustering algorithm Tweets are distributed to tens and hundreds of duplicate clusters Solution Periodically checking for duplicate clusters among active clusters Master cluster: one has older time centroid Slave cluster: one has younger time centroid Any new tweets belong to slave cluster added to Master cluster

30 Tweak: Weight upper bounds Dynamic corpus: addition of new features have high TF-IDF values Relatively unimportant, misspelled words, etc. Problem: spurious clusters Clustering based on an unimportant feature Solution To a tweet to be added to a cluster, the tweet and the cluster should share k common features (k > 1)

31 Tweak: Phrases Features containing two or more terms - phrase Problem Treading phrase as separate features results in lost meaning: San Francisco Treading phrase as a single feature results with large TF-IDF score Solution Distinguishing two kinds of relationships betweens words in the phrase by Determining occurrence of t 1 close to t 2 volume Finding a dominant word: Barak Obama=>Obama Merging words to single feature: San Francisco => San Francisco

32 Topic Geographic Focus Associate each cluster of tweets with a set of geographic locations Tweet content: geotagging 1. Toponym recognition: finding all instances of textual reference geographic location 2. Toponym resolution: determining correct location for each recognized toponym out of all possible interpretations Source location of the user Meta-data contains users location Containment or prominence heuristic Computing Topic Focus Ranking geographic locations by frequency

33 User Interface Issues NewsStand

34 Topic Hashtags - Reducing ε value - Proactively searching for more tweets belonging to a particular topic

35 Conclusion General technique to extract concept from noise Adaptable to different environment Generating dynamic corpus online algorithm Pinpointing news clusters to geographic location User interface for displaying news Harbinger of a futuristic technology that can capture and transmit the sum total of all human experiences of the moment

36 Assumptions Noise Tweets that does not belong to the news domain Tweets from seeders are considered to be reliable news To apply Naïve Bayes classifier, assumption is made that words in tweets are independent

37 Questions & Answers Sankaranarayanan, J., et al., TwitterStand: News in Tweets, Proc. ACM GIS 09. Seattle, WA, USA Rohib Bhargava, Influential Marketing Blog stunning-and-useful-stats-about-twitter.html stunning-and-useful-stats-about-twitter.html In-depth study of Twitter: How much we tweet, and when study-of-twitter-how-much-we-tweet-and-when/ study-of-twitter-how-much-we-tweet-and-when/

Download ppt "TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University."

Similar presentations

Ads by Google