Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented.

Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented by Yingjiu Li Singapore Management University, Singapore

LOGO Outline 2 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

LOGO Background 4 Twitter: micro-blogging & social networking FriendFollower Tweet Bob Alice

LOGO Background 5  Popularity brings spam - Spam definition: malicious / phishing / scam content or URL - Social spamming is more successful using social relationship Spam tweet

LOGO Background 6  Spam campaign - Spammer runs multiple accounts to spread spam tweets for a specific purpose (i.e. propagating a spam site, selling goods) Real case of adult pill campaign with multiple accounts

LOGO Background 7  Detecting spam, 1 st step in fighting spam - Tweet level: spam words, spam URLs - Account level: spam tweets, aggressive automation - Campaign level: cluster related tweets/accounts into campaigns, observe collective features (similar content, posting behavior …)  Efficiency - Capture multiple spam accounts at one time  Robustness - Some spamming methods can’t be detected at individual level

LOGO Related Work 8  Existing work relies on solo URL feature - Group tweets into a campaign based on the shared URL. If the URL is blacklisted, the campaign is classified as spam.  Disadvantages -Blacklists have the lag effect (90% of clicks before blacklisted) -Blacklists can only cover part of spam URLs -False positive (whole domain bit.ly is blacklisted, benign webpage http://bit.ly/fg7Uy)bit.lyhttp://bit.ly/fg7Uy -False negative: the URL/website is benign, but the campaign’s collective behavior is spamming

LOGO Background 9  A real spam campaign example of aggressive duplication  Twitter Spamming Rule: “posts duplicate content over multiple accounts” Account EldoYPISILONE Nutz this music video, SO COOL ;) http://on.fb.me/ht2wXJ?=mti0 Account MatthewVankomen Amazing this music footage, you'll like ^^ http://on.fb.me/ht2wXJ?=nzky Account KristaBauske2r Amazing this music vid, Maybe u'll like it :^ http://on.fb.me/ht2wXJ?=mtcz

LOGO Contribution 10  Improve the existing work based on solo URL detection  Introduce new features  Design an automatic detection system using machine learning

LOGO Data Collection 12  Twitter Streaming API - Spritzer - Uniform sampling, 1% of real-time global tweets  Dataset, 50 million tweets - Feb. – Apr. 2011 - Only check tweets with URLs, 8 million (assume tweets without URLs are non-spam)

LOGO Clustering Algorithm 13  URL redirection, tweet = - original URL => final landing URL http://ow.ly/5UbUS ==>... ==> http://www.people.com/people/.../020515101,00.html  Cluster tweets with the same final URL into a campaign Campaign =  Campaign_1  Campaign_2  Campaign_3

LOGO Ground Truth 14  Creation - Blacklists: Google’s SafeBrowsing, PhishTank, URIBL, SURBL, Spamhaus (If URL is blacklisted, the campaign is labeled as spam) - Manual check: content of tweets, accounts (i.e. tweets posted by them outside the campaign)… Violate Official Twitter Rules of Spam and Abuse?  Ground truth set -580 legitimate campaigns -744 spam campaigns

LOGO Twitter Rules 15

LOGO Data Analysis 16 Master URL, http://biy.ly/5As4k3 Affiliate URL spam account Account_1, http://biy.ly/5As4k3?=xd56 Account_2, http://biy.ly/5As4k3?=f2kk Master URL Diversity Ratio = unique_Master_URL_# / tweet_no High ratio ==> account independence Low ratio ==> account dependence

LOGO Data Analysis 17

LOGO Data Analysis 18 Burstiness - overall workload distribution of a campaign

LOGO Classification 20  Binary-class classification  Automatic classification framework

LOGO Feature Extractor 21  Tweet-level Features - Tweet = - Text contains spam words? - URL is redirected? - URL is blacklisted?

LOGO Feature Extractor 22  Account-level Features Account = -Lifetime tweet count -Account registration date - Account protected? Verified? - Friend_count, follower_count, ratio - Account reputation = follower_count / (follower_count + friend_count) - Account taste = avg(account reputation of each of his friend)

LOGO Feature Extractor 23  Campaign-level Features -Campaign = ({tweets}, {accounts}, shared_URL) -Account Diversity Ratio = account_no / tweet_no - Entropy of inter-arrival timing Lower: regular behavior ==> coordination Higher: irregular behavior ==> independent participation Corrected Conditional Entropy (CCE)

LOGO Feature Extractor 24 -Content self-similarity {Tweets} => sense clusters Cluster_1) this music video so cool, amazing this music footage you'll like, this music video hope u like Cluster_2) How to Consolidate Credit Card Debt Consolidate Credit Cards Now to Become Debt Free Later Three Effective Ways to Consolidate Credit Card Debt Without Using Intermediaries

LOGO Feature Extractor 25 -SenseClusters - cluster messages based on contextual similarity -Vector space model: text ==>vector Msg_1, He visited Russia in 1996. Msg_2, In 1996 he went to Russia. … Vocabulary = {in, he, Russia, to, visited, went, 1996, …} Occurrence Matrix Weight, TF-IDF (Term Frequency – Inverse Document Frequency) word_1word_2word_3…word_N Msg_1weight0 00 Msg_20weight000 …0000

LOGO Feature Extractor 26 -Latent Semantic Analysis, rank lowering -2nd-order similarity (1st-order similarity) “Score” => a number that expresses the accomplishment of a team in a game “Goal” => a successful attempt at scoring -Cosine similarity measure - cos0 = 1, same - cos90 = 0, orthogonal - cos_sim > threshold, the same sense cluster

LOGO Feature Extractor 27 -{Tweets} => K sense clusters (on the fly) ClusterSize %Similarity 110%1 230%0.9 360%0.1

LOGO Decision Maker 28  Random Forest -Ensemble classifier that consists of many decision trees -Construction of each tree: calculate the best split based on m (<< M) features in the training set -Prediction of a new sample is pushed down the tree. It is assigned with the label of the leaf node it ends up in -Final decision – majority voting of all trees

LOGO Evaluation 30 Classifier Accuracy % FPR %FNR % Random Forest94.54.16.6 DecisionTable92.16.78.8 RandomTree91.49.18.2 KStar90.27.911.3 Bayes Net 88.89.612.4 SMO85.211.217.6 SimpleLogistic84.010.420.4 J4882.815.218.8  Weka  Try each classifier with the ground truth set, 10-fold cross-validation  High Accuracy, Low FPR (legitimate => spam), Medium FNR (spam => legitimate)

LOGO Evaluation 31 -Evaluate importance for every feature with Decision Tree Only use one feature for classification each time FeatureAccuracy %FPR %FNR % Account Diversity Ratio85.616.213.0 Timing Entropy83.09.522.8 URL Blacklist (Our Result) 82.3 (94.5) 3.2 (4.1) 29.0 (6.6) Avg Account Reputation78.525.618.3 Active Time77.016.228.3 Affiliate URL No76.79.634.0 Manual Device %74.810.336.8 Tweet Total No74.3232.420.4 Content Self Similarity72.333.723.0 Spam Word Ratio70.525.832.4

LOGO Conclusion 33  Large measurement on Twitter  Formulation of new features  Automatic classification system  Overall accuracy 94.5%

LOGO Questions 34

Click to edit company slogan.

Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented.

Similar presentations

Presentation on theme: "Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented.

Similar presentations

Presentation on theme: "Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented."— Presentation transcript:

Similar presentations

About project

Feedback