Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented.

Similar presentations


Presentation on theme: "Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented."— Presentation transcript:

1 Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented by Yingjiu Li Singapore Management University, Singapore

2 LOGO Outline 2 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

3 LOGO Outline 3 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

4 LOGO Background 4 Twitter: micro-blogging & social networking FriendFollower Tweet Bob Alice

5 LOGO Background 5  Popularity brings spam - Spam definition: malicious / phishing / scam content or URL - Social spamming is more successful using social relationship Spam tweet

6 LOGO Background 6  Spam campaign - Spammer runs multiple accounts to spread spam tweets for a specific purpose (i.e. propagating a spam site, selling goods) Real case of adult pill campaign with multiple accounts

7 LOGO Background 7  Detecting spam, 1 st step in fighting spam - Tweet level: spam words, spam URLs - Account level: spam tweets, aggressive automation - Campaign level: cluster related tweets/accounts into campaigns, observe collective features (similar content, posting behavior …)  Efficiency - Capture multiple spam accounts at one time  Robustness - Some spamming methods can’t be detected at individual level

8 LOGO Related Work 8  Existing work relies on solo URL feature - Group tweets into a campaign based on the shared URL. If the URL is blacklisted, the campaign is classified as spam.  Disadvantages -Blacklists have the lag effect (90% of clicks before blacklisted) -Blacklists can only cover part of spam URLs -False positive (whole domain bit.ly is blacklisted, benign webpage -False negative: the URL/website is benign, but the campaign’s collective behavior is spamming

9 LOGO Background 9  A real spam campaign example of aggressive duplication  Twitter Spamming Rule: “posts duplicate content over multiple accounts” Account EldoYPISILONE Nutz this music video, SO COOL ;) Account MatthewVankomen Amazing this music footage, you'll like ^^ Account KristaBauske2r Amazing this music vid, Maybe u'll like it :^

10 LOGO Contribution 10  Improve the existing work based on solo URL detection  Introduce new features  Design an automatic detection system using machine learning

11 LOGO Outline 11 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

12 LOGO Data Collection 12  Twitter Streaming API - Spritzer - Uniform sampling, 1% of real-time global tweets  Dataset, 50 million tweets - Feb. – Apr Only check tweets with URLs, 8 million (assume tweets without URLs are non-spam)

13 LOGO Clustering Algorithm 13  URL redirection, tweet = - original URL => final landing URL ==>... ==>  Cluster tweets with the same final URL into a campaign Campaign =  Campaign_1  Campaign_2  Campaign_3

14 LOGO Ground Truth 14  Creation - Blacklists: Google’s SafeBrowsing, PhishTank, URIBL, SURBL, Spamhaus (If URL is blacklisted, the campaign is labeled as spam) - Manual check: content of tweets, accounts (i.e. tweets posted by them outside the campaign)… Violate Official Twitter Rules of Spam and Abuse?  Ground truth set -580 legitimate campaigns -744 spam campaigns

15 LOGO Twitter Rules 15

16 LOGO Data Analysis 16 Master URL, Affiliate URL spam account Account_1, Account_2, Master URL Diversity Ratio = unique_Master_URL_# / tweet_no High ratio ==> account independence Low ratio ==> account dependence

17 LOGO Data Analysis 17

18 LOGO Data Analysis 18 Burstiness - overall workload distribution of a campaign

19 LOGO Outline 19 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

20 LOGO Classification 20  Binary-class classification  Automatic classification framework

21 LOGO Feature Extractor 21  Tweet-level Features - Tweet = - Text contains spam words? - URL is redirected? - URL is blacklisted?

22 LOGO Feature Extractor 22  Account-level Features Account = -Lifetime tweet count -Account registration date - Account protected? Verified? - Friend_count, follower_count, ratio - Account reputation = follower_count / (follower_count + friend_count) - Account taste = avg(account reputation of each of his friend)

23 LOGO Feature Extractor 23  Campaign-level Features -Campaign = ({tweets}, {accounts}, shared_URL) -Account Diversity Ratio = account_no / tweet_no - Entropy of inter-arrival timing Lower: regular behavior ==> coordination Higher: irregular behavior ==> independent participation Corrected Conditional Entropy (CCE)

24 LOGO Feature Extractor 24 -Content self-similarity {Tweets} => sense clusters Cluster_1) this music video so cool, amazing this music footage you'll like, this music video hope u like Cluster_2) How to Consolidate Credit Card Debt Consolidate Credit Cards Now to Become Debt Free Later Three Effective Ways to Consolidate Credit Card Debt Without Using Intermediaries

25 LOGO Feature Extractor 25 -SenseClusters - cluster messages based on contextual similarity -Vector space model: text ==>vector Msg_1, He visited Russia in Msg_2, In 1996 he went to Russia. … Vocabulary = {in, he, Russia, to, visited, went, 1996, …} Occurrence Matrix Weight, TF-IDF (Term Frequency – Inverse Document Frequency) word_1word_2word_3…word_N Msg_1weight0 00 Msg_20weight000 …0000

26 LOGO Feature Extractor 26 -Latent Semantic Analysis, rank lowering -2nd-order similarity (1st-order similarity) “Score” => a number that expresses the accomplishment of a team in a game “Goal” => a successful attempt at scoring -Cosine similarity measure - cos0 = 1, same - cos90 = 0, orthogonal - cos_sim > threshold, the same sense cluster

27 LOGO Feature Extractor 27 -{Tweets} => K sense clusters (on the fly) ClusterSize %Similarity 110%1 230% %0.1

28 LOGO Decision Maker 28  Random Forest -Ensemble classifier that consists of many decision trees -Construction of each tree: calculate the best split based on m (<< M) features in the training set -Prediction of a new sample is pushed down the tree. It is assigned with the label of the leaf node it ends up in -Final decision – majority voting of all trees

29 LOGO Outline 29 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

30 LOGO Evaluation 30 Classifier Accuracy % FPR %FNR % Random Forest DecisionTable RandomTree KStar Bayes Net SMO SimpleLogistic J  Weka  Try each classifier with the ground truth set, 10-fold cross-validation  High Accuracy, Low FPR (legitimate => spam), Medium FNR (spam => legitimate)

31 LOGO Evaluation 31 -Evaluate importance for every feature with Decision Tree Only use one feature for classification each time FeatureAccuracy %FPR %FNR % Account Diversity Ratio Timing Entropy URL Blacklist (Our Result) 82.3 (94.5) 3.2 (4.1) 29.0 (6.6) Avg Account Reputation Active Time Affiliate URL No Manual Device % Tweet Total No Content Self Similarity Spam Word Ratio

32 LOGO Outline 32 Background 1 Measurement 2 Classification 3 Evaluation 4 Conclusion 5

33 LOGO Conclusion 33  Large measurement on Twitter  Formulation of new features  Automatic classification system  Overall accuracy 94.5%

34 LOGO Questions 34

35 Click to edit company slogan.


Download ppt "Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented."

Similar presentations


Ads by Google