Presentation is loading. Please wait.

Presentation is loading. Please wait.

Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.

Similar presentations


Presentation on theme: "Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International."— Presentation transcript:

1 Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International Conference on Security and Cryptography, 2010

2 Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 2 / 37

3 Introduction  Social Network Service ( SNS ) –An online service, platform, or site that focuses on building and reflecting of social networks or social relations among people –The most popular applications of Web 2.0  Twitter –Founded in 2006 –One of the fastest growing SNSs  Surging more than 2,800% in 2009 –Social networking site and microblogging service 3 / 37

4 Introduction  Twitter You can post your latest updates Messages(Tweets) from twitter that you are following( describing ) 4 / 37

5 Introduction  Spammer in Twitter –The goal of Twitter  Allow friends to communicate and stay connected through the exchange of short message –Spammer also use Twitter as a tool to post malicious links –More than 3% messages are spam on Twitter ( Analytics, 2009 ) –The offensive trending topic Attack on February 20 ( CNET, 2009 ) 5 / 37

6 Introduction  Method to report spam –By clicking on the “report as spam” –To post a tweet in the “@spam @username”  This report service is also abused by both hoaxes and spam  Legitimate user can be mistakenly suspended by Twitter’s anti spam action 6 / 37

7 Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 7 / 37

8 Social Graph model  Twitter can be modeled as a directed graph –G = ( V, A ) –V : a set of nodes ( vertices ) –A : a set of arcs ( Edges )  Four types of relationships on Twitter can be defined –Follower  Node is a follower of node if the arc a = ( j, i ) is contained in A –Friend  Node is a friend of node if the arc a = ( i, j ) is contained in A –Mutual Friend  Node and node are mutual friends if both arcs a = ( i, j ) and a = ( j, i ) are contained in A –Stranger  Node and node are strangers if neither arcs a = ( I, j ) nor a = ( j, I ) is contained in A 8 / 37

9 Social Graph model  A simple Twitter graph A follows B A is follower of B B is friend of A B follows C, C follows B B and C are Mutual friend A doesn’t follow C, C doesn’t follow A A and C are stranger 9 / 37

10 Social Graph model  Twitter Social Graph 10 / 37

11 Outline  Introduction  Social Graph model  Features –Graph-based features –Content-based features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 11 / 37

12 Features Graph-based features  Twitter’s spam and abuse policy –“if you have a small number of followers compared to the amount of people you are following, it may be considered as a spam account”  Three features –The number of friends  The indegree of a node –The number of followers  The outdegree of a node –The reputation of a user 12 / 37

13 Features Content-based features  Duplicate Tweets –An account may be considered as a spam if you post duplicate content on one account –Detected by measuring the Levenshtein distance ( edit distance )  Minimum cost of transforming one string into another through a sequence of edit operations ( deletion, insertion and substitution of individual symbols )  Clean the data by stopping the words containing “@”, “#”, “http://” and “www.” –The number of duplicate tweets can be measurement  In the user’s 20 most recent tweets  Two tweets are considered as duplicate only when the are exactly the same 13 / 37

14 Features Content-based features  Need for cleaning 14 / 37

15 Features Content-based features  HTTP Links –It is considered as spam if your updates consist mainly of links and not personal updates –Twitter filters out the URLs linked to known malicious sites  URL shorten services like bit.ly provides opportunity for attacker to spam –The number of tweets containing HTTP links can be measurement http://porno.com Tweet with HTTP link Malicious Site http://bit.ly/ab3cd Tweet with HTTP link Malicious Site http://bit.ly/ab3cd ↓ http:// porno.com URL shorten service ?? 15 / 37

16 Features Content-based features  Replies and Mentions –You can send a reply message to another user  @username + message –You can also mention another @username anywhere in the tweet  Message + @username + message –Twitter automatically collects all tweets containing your username –You can reply anyone no matter they are your friends/followers or not –Spammer abuses this feature –The number of Tweets contain- ing mention or reply can be measurement 16 / 37

17 Features Content-based features  Spam tweets using mention or reply 17 / 37

18 Features Content-based features  Trending topic –The most-mentioned terms on Twitter at that moment, week, month –User can use the hashtag to a tweet  #tagname –If there are many tweets containing the same term,  It may become a trending topic –Twitter considers an account as spam  If you post multiple unrelated updates to a topic using the # symbols 18 / 37

19 Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 19 / 37

20 Data Set  Data Set –3 weeks from January 3 to January 24, 2010 –25,847 users –500k tweets –49M follower/friend relationships 20 / 37

21 Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 21 / 37

22 Spam Detection  Several classification algorithms –Decision tree –Neural network –Support vector machines –K – nearest neighbers –Naïve Bayesian  Naïve Bayesian outperform all other method –Bayesian classifier is noise robust  It uses posterior probability –A spam probability is calculated for each individual user based its behaviors, instead of giving a general rule 22 / 37

23 Spam Detection  Naïve Bayesian classifier –X : each Twitter account is considered as a vector X with feature values –Y : one of two classes, spam and non-spam –The features are conditionally independent 23 / 37

24 Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 24 / 37

25 Experiments  To evaluate the detection method –500 Twitter user accounts are labeled manually to two classes( spam or not )  By reading the 20 most recent tweets  Checking the friends and followers of the user –Result show that there are around 1% spam account in the data set  Additional spam data are added to the data set  To simulate the reality and avoid the bias in the crawling and label methods –The study in Analytics, 2009, shows there is 3% spam on Twitter  Search @spam on Twitter and collect additional spam data –Only small number of result report real spam –The data set is mixed to contain around 3% spam data 25 / 37

26 Experiments  Graph-based features –The number of friends for each Twitter account –Only 30% of spam accounts follow a large amount of user  Spammer doesn’t need to follow other user 26 / 37

27 Experiments  Graph-based features –The number of followers for each Twitter account –Usually the spam accounts do not have a large amount of followers  Some spam accounts having a relatively large amount of followers 27 / 37

28 Experiments  Graph-based features –The reputation for each Twitter account –The reputation of most legitimate users is between 30% to 90%  Some spam accounts have a 100% reputation 28 / 37

29 Experiments  Content-based Features –The number of pairwise duplication –Not all spam accounts post multiple duplicate tweets  We can not only depend on this feature 29 / 37

30 Experiments  Content-based Features –The number of mentions and replies –Most spam accounts have the maximum 20 “@” symbol  This will lure legitimate users to read their spam messages or click their link 30 / 37

31 Experiments  Content-based Features –The number of links –Some legitimate users also include links in all tweets, some companies join Twitter to promote their own web sites 31 / 37

32 Experiments  Content-based Features –The number of Hash tag signs 32 / 37

33 Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 33 / 37

34 Evaluation  The evaluation of the overall process –Confusion matrix –Precision : P = a / ( a + c ) –Recall : R = a / ( a + b ) –F-measure : F = 2PR / ( P + R )  Each classifier is trained 10 times –Each time using the 9 out of the 10 partitions as training data –Computing the confusion matrix using the tenth partition as test data 34 / 37

35 Evaluation  The evaluation results –Naïve Bayesian classifier has the best overall performance  Finally, the Bayesian classifier learned from the labeled data is applied to the entire data set –Information about totally 25,817 users –Precision of the spam detection system  392 users are classified as spam  348 users are real spam account and 44 users are false alarms  89% precision 35 / 37

36 Conclusion  The spam behavior in a popular online SNS, Twitter –To formalize the problem, social graph model is proposed  Novel content-based and graph-based features are proposed –Graph-based features  The number of friends  The number of followers  The reputation of the user –Content-based features  The number of pairwise duplications  The number of Mention and Replies  The number of Links  The number of Hashtags  Analyze the data set and evaluate the performance of the detection system 36 / 37

37 Conclusion  Among the graph-based features –The proposed reputation features has the best performance –No many spam follow large amount of users –Some spammers have many followers  For the content-based features –Most spam accounts have multiple duplicate tweets –But not all spam account post multiple duplicate tweets  We can not rely on this feature  Several popular classification algorithms are studied and evaluated  The naïve classifier achieve a 89% precision 37 / 37


Download ppt "Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International."

Similar presentations


Ads by Google