BEHAVIORAL PREDICTION OF TWITTER USERS BASED ON TEXTUAL INFORMATION Shiyao Wang.

BEHAVIORAL PREDICTION OF TWITTER USERS BASED ON TEXTUAL INFORMATION Shiyao Wang

Viral Event Ice Bucket Challenge To promote awareness of amyotrophic lateral sclerosis (ALS) Major activity: dump a bucket of ice water on someone’s head and encourage donation towards ALS research Rule: an individual can challenge others (usually 3) to take the challenge. Individual who receives the challenge can either choose to take it within 24 hrs or make a donation to the ALS research foundation. In most cases people take the challenge before nominating others. Went viral on SNS

Our Goal Analyze the spread pattern of this event primarily on Twitter Classify user behavior based on Tweets Look for potential correlations between information cascade and the offline behaviors within the Twitter network Further analysis on this rich set of data

Data 13.95 million tweets purchased from Gnip, a third party Twitter data provider All tweets contains keywords or hashtags related to the ice bucket challenge Among all tweets, 5.44 million were original A total number of 5.56 million users were included and 2.51 million of them published original tweets

Text-based Classifier for User Behavior Goal: predict whether the user has taken the Ice Bucket Challenge (IBC) Data: Tweets related to the IBC (text containing keywords or hashtags)

Initial Approach Manual Labeling: To identify if there are strong signs of users’ taking the challenge Based on both the tweet text and the attached multimedia information (primarily URLs linking towards other SNS) Method debatable Feature Selection: Keyword (first person, third person, take, nominate, etc.) N-gram URL type (type of webpages being linked to) User statistics (number of followers/ees, etc.) Other features

Current Approach Feature Selection: Keyword replacement in tweet text: URLs checked and converted into keywords such as URL_S (URL linking towards SNS) Hashtags were converted as HASHTAG (or HASHTAG_CH if containing IBC related keywords) Mentions were converted as MENTION N-gram based on the modified tweet text POS tags based on the modified tweet text Roughly 1000+ features in total

Previous Toy Classifier Data downloaded using Twitter API 580 tweets included, 155 were labeled as positive (26.7%) Best result given by NaiveBayes with 10 fold CV Positive class F-Measure: 0.902 ROC: 0.924

Real Data Classifier Randomly selected 500 original tweets from the database Manual labeling performed, included opening links from tweets to find signs of taking the challenge (different from the toy classifier) 58 instances labeled as positive among the 500 tweets (11.6%)

Classifier Building Various classifiers were tested including, NaiveBayes, Random Forest/Tree, J48, Logistic, SMO, SVM, etc. Oversampling of training set on positive instances was implemented, different ratios between positive and negative instances tested Manual Cross Validation was implemented NaiveBayes and SVM with Linear Kernel works the best at this point

Results

Problems 1. Labeling method is debatable 2. Highly unbalanced dataset and small number of instances 3. Weka’s ROC and manually calculated ROC were slightly different (based on Python’s sklearn)

BEHAVIORAL PREDICTION OF TWITTER USERS BASED ON TEXTUAL INFORMATION Shiyao Wang.

Similar presentations

Presentation on theme: "BEHAVIORAL PREDICTION OF TWITTER USERS BASED ON TEXTUAL INFORMATION Shiyao Wang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BEHAVIORAL PREDICTION OF TWITTER USERS BASED ON TEXTUAL INFORMATION Shiyao Wang.

Similar presentations

Presentation on theme: "BEHAVIORAL PREDICTION OF TWITTER USERS BASED ON TEXTUAL INFORMATION Shiyao Wang."— Presentation transcript:

Similar presentations

About project

Feedback