Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented.

Similar presentations


Presentation on theme: "Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented."— Presentation transcript:

1 Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented by Yingjiu Li Singapore Management University, Singapore

2 LOGO Outline 2 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

3 LOGO Outline 3 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

4 LOGO Background 4 Twitter: micro-blogging & social networking FriendFollower Tweet Bob Alice

5 LOGO Background 5  Popularity brings spam - Spam definition: malicious / phishing / scam content or URL - Social spamming is more successful using social relationship Spam tweet

6 LOGO Background 6  Spam campaign - Spammer runs multiple accounts to spread spam tweets for a specific purpose (i.e. propagating a spam site, selling goods) Real case of adult pill campaign with multiple accounts

7 LOGO Background 7  Detecting spam, 1 st step in fighting spam - Tweet level: spam words, spam URLs - Account level: spam tweets, aggressive automation - Campaign level: cluster related tweets/accounts into campaigns, observe collective features (similar content, posting behavior …)  Efficiency - Capture multiple spam accounts at one time  Robustness - Some spamming methods can’t be detected at individual level

8 LOGO Related Work 8  Existing work relies on solo URL feature - Group tweets into a campaign based on the shared URL. If the URL is blacklisted, the campaign is classified as spam.  Disadvantages -Blacklists have the lag effect (90% of clicks before blacklisted) -Blacklists can only cover part of spam URLs -False positive (whole domain bit.ly is blacklisted, benign webpage http://bit.ly/fg7Uy)bit.lyhttp://bit.ly/fg7Uy -False negative: the URL/website is benign, but the campaign’s collective behavior is spamming

9 LOGO Background 9  A real spam campaign example of aggressive duplication  Twitter Spamming Rule: “posts duplicate content over multiple accounts” Account EldoYPISILONE Nutz this music video, SO COOL ;) http://on.fb.me/ht2wXJ?=mti0 Account MatthewVankomen Amazing this music footage, you'll like ^^ http://on.fb.me/ht2wXJ?=nzky Account KristaBauske2r Amazing this music vid, Maybe u'll like it :^ http://on.fb.me/ht2wXJ?=mtcz

10 LOGO Contribution 10  Improve the existing work based on solo URL detection  Introduce new features  Design an automatic detection system using machine learning

11 LOGO Outline 11 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

12 LOGO Data Collection 12  Twitter Streaming API - Spritzer - Uniform sampling, 1% of real-time global tweets  Dataset, 50 million tweets - Feb. – Apr. 2011 - Only check tweets with URLs, 8 million (assume tweets without URLs are non-spam)

13 LOGO Clustering Algorithm 13  URL redirection, tweet = - original URL => final landing URL http://ow.ly/5UbUS ==>... ==> http://www.people.com/people/.../020515101,00.html  Cluster tweets with the same final URL into a campaign Campaign =  Campaign_1  Campaign_2  Campaign_3

14 LOGO Ground Truth 14  Creation - Blacklists: Google’s SafeBrowsing, PhishTank, URIBL, SURBL, Spamhaus (If URL is blacklisted, the campaign is labeled as spam) - Manual check: content of tweets, accounts (i.e. tweets posted by them outside the campaign)… Violate Official Twitter Rules of Spam and Abuse?  Ground truth set -580 legitimate campaigns -744 spam campaigns

15 LOGO Twitter Rules 15

16 LOGO Data Analysis 16 Master URL, http://biy.ly/5As4k3 Affiliate URL spam account Account_1, http://biy.ly/5As4k3?=xd56 Account_2, http://biy.ly/5As4k3?=f2kk Master URL Diversity Ratio = unique_Master_URL_# / tweet_no High ratio ==> account independence Low ratio ==> account dependence

17 LOGO Data Analysis 17

18 LOGO Data Analysis 18 Burstiness - overall workload distribution of a campaign

19 LOGO Outline 19 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

20 LOGO Classification 20  Binary-class classification  Automatic classification framework

21 LOGO Feature Extractor 21  Tweet-level Features - Tweet = - Text contains spam words? - URL is redirected? - URL is blacklisted?

22 LOGO Feature Extractor 22  Account-level Features Account = -Lifetime tweet count -Account registration date - Account protected? Verified? - Friend_count, follower_count, ratio - Account reputation = follower_count / (follower_count + friend_count) - Account taste = avg(account reputation of each of his friend)

23 LOGO Feature Extractor 23  Campaign-level Features -Campaign = ({tweets}, {accounts}, shared_URL) -Account Diversity Ratio = account_no / tweet_no - Entropy of inter-arrival timing Lower: regular behavior ==> coordination Higher: irregular behavior ==> independent participation Corrected Conditional Entropy (CCE)

24 LOGO Feature Extractor 24 -Content self-similarity {Tweets} => sense clusters Cluster_1) this music video so cool, amazing this music footage you'll like, this music video hope u like Cluster_2) How to Consolidate Credit Card Debt Consolidate Credit Cards Now to Become Debt Free Later Three Effective Ways to Consolidate Credit Card Debt Without Using Intermediaries

25 LOGO Feature Extractor 25 -SenseClusters - cluster messages based on contextual similarity -Vector space model: text ==>vector Msg_1, He visited Russia in 1996. Msg_2, In 1996 he went to Russia. … Vocabulary = {in, he, Russia, to, visited, went, 1996, …} Occurrence Matrix Weight, TF-IDF (Term Frequency – Inverse Document Frequency) word_1word_2word_3…word_N Msg_1weight0 00 Msg_20weight000 …0000

26 LOGO Feature Extractor 26 -Latent Semantic Analysis, rank lowering -2nd-order similarity (1st-order similarity) “Score” => a number that expresses the accomplishment of a team in a game “Goal” => a successful attempt at scoring -Cosine similarity measure - cos0 = 1, same - cos90 = 0, orthogonal - cos_sim > threshold, the same sense cluster

27 LOGO Feature Extractor 27 -{Tweets} => K sense clusters (on the fly) ClusterSize %Similarity 110%1 230%0.9 360%0.1

28 LOGO Decision Maker 28  Random Forest -Ensemble classifier that consists of many decision trees -Construction of each tree: calculate the best split based on m (<< M) features in the training set -Prediction of a new sample is pushed down the tree. It is assigned with the label of the leaf node it ends up in -Final decision – majority voting of all trees

29 LOGO Outline 29 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

30 LOGO Evaluation 30 Classifier Accuracy % FPR %FNR % Random Forest94.54.16.6 DecisionTable92.16.78.8 RandomTree91.49.18.2 KStar90.27.911.3 Bayes Net 88.89.612.4 SMO85.211.217.6 SimpleLogistic84.010.420.4 J4882.815.218.8  Weka  Try each classifier with the ground truth set, 10-fold cross-validation  High Accuracy, Low FPR (legitimate => spam), Medium FNR (spam => legitimate)

31 LOGO Evaluation 31 -Evaluate importance for every feature with Decision Tree Only use one feature for classification each time FeatureAccuracy %FPR %FNR % Account Diversity Ratio85.616.213.0 Timing Entropy83.09.522.8 URL Blacklist (Our Result) 82.3 (94.5) 3.2 (4.1) 29.0 (6.6) Avg Account Reputation78.525.618.3 Active Time77.016.228.3 Affiliate URL No76.79.634.0 Manual Device %74.810.336.8 Tweet Total No74.3232.420.4 Content Self Similarity72.333.723.0 Spam Word Ratio70.525.832.4

32 LOGO Outline 32 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

33 LOGO Conclusion 33  Large measurement on Twitter  Formulation of new features  Automatic classification system  Overall accuracy 94.5%

34 LOGO Questions 34

35 Click to edit company slogan.


Download ppt "Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented."

Similar presentations


Ads by Google