Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.

Similar presentations


Presentation on theme: "Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio."— Presentation transcript:

1 Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 A Presentation at Advanced Defense Lab

2 Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab2

3 Introduction Twitter has recently emerged as a popular social system. With a simple interface where only 140 character messages can be posted. These services open opportunities for new forms of spam Advanced Defense Lab3

4 Introduction 4-step approach Crawled a near-complete dataset from Twitter. Created a labeled collection with users “manually” classified as spammers and non-spammers. Conducted a study about the characteristics of tweet content and user behavior. Used supervised machine learning method to identify spammers. Advanced Defense Lab4

5 Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab5

6 Background Relationship links are directional. A re-tweeted message usually starts with “RT@username”. Twitter users usually use hashtags (#) to identify certain topics. Trending Topics #musicmonday Advanced Defense Lab6

7 Background A URL to a website containing advertisements completely unrelated to a hashtag on the tweet Re-tweets in which legitimate links are changed to illegitimate ones. Advanced Defense Lab7

8 Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab8

9 Dataset and Labeled Collection We asked Twitter to allow us to collect such data and they white-listed 58 servers located at the MPI-SWS.MPI-SWS Twitter assigns each user a numeric ID which uniquely identifies the user’s profile. We launched our crawler in August 2009 to collect all user IDs ranging from 0 to 80 million. In total 54,981,152 used accounts 1,963,263,821 social links 1,755,925,520 tweets Advanced Defense Lab9

10 Building a labeled collection Three desired properties that need to be considered to create such collection of users labeled as spammers and non-spammers. The collection needs to have a significant number spammers and non-spammers. The labeled collection needs to include spammers who are aggressive in their strategies and mostly affect the system. The users are chosen randomly and not based on their characteristics. Advanced Defense Lab10

11 Building a labeled collection Three trending topics The Michael Jackson’s death Susan Boyle’s emergence The hashtag “#musicmonday” Advanced Defense Lab11

12 Building a labeled collection We developed a website to help volunteers to manually label users as spammers or non-spammers based on their tweets containing #keywords related to the trending topics. In total, 8,207 users were labeled, including 355 spammers and 7,852 non-spammers. We select only 710 of the legitimate users to include in our collection. Advanced Defense Lab12

13 Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab13

14 Indentifying User Attributes Content Attributes the maximum, minimum, average, and median of the following metrics: number of hashtags per number of words on each tweet number of URLs per words number of words of each tweet number of characters of each tweet number of URLs on each tweet number of hashtags on each tweet number of numeric characters that appear on the text number of users mentioned on each tweet number of times the tweet has been re-tweeted the fraction of tweets with at least one word from a popular list of spam words the fraction of tweets that are reply messages the fraction of tweets of the user containing URLs Advanced Defense Lab14 39

15 Identifying User Attributes Total 1065 users. 39% of the spammers posted all their tweets containing spam words, whereas non-spammers typically do not post more than 4% of their tweets containing spam word. Advanced Defense Lab15

16 Indentifying User Attributes User Behavior Attributes the maximum, minimum, average, and median of the following metrics: the time between tweets number of tweets posted per day number of tweets posted per week number of followers number of followees fraction of followers per followees number of tweets age of the user account number of times the user was mentioned number of times the user was replied to number of times the user replied someone number of followees of the user’s followers number tweets receveid from followees existence of spam words on the user’s screename Advanced Defense Lab16 23

17 Identifying User Attributes (a) Spammers have a high ratio of followers per follwees. (b) Spammers usually have new accounts probably because they are constantly being blocked by other users and reported to Twitter. (c) non-spammers receive a much large amount of tweets from their followees in comparison with spammers. Advanced Defense Lab17

18 Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab18

19 Detecting Spammers SVM-light 5-fold cross-validation. In each test, the original sample is partitioned into 5 sub- samples, out of which four are used as training data, and the remaining one is used for testing. Advanced Defense Lab19

20 Detecting Spammers Advanced Defense Lab20

21 Detecting Spammers Advanced Defense Lab21

22 Detecting Spammers X2 Advanced Defense Lab22

23 Detecting Spammers Advanced Defense Lab23

24 Detecting Spammers Advanced Defense Lab24

25 Detecting Spams Consider the following attributes for each tweet: number of words from a list of spam words number of hashtags per words number of URLs per words number of words number of numeric characters on the text number of characters that are numbers number of URLs number of hashtags number of mentions number of times the tweet has been replied Advanced Defense Lab25

26 Detecting Spams Advanced Defense Lab26

27 Detecting Spammers Advanced Defense Lab27

28 Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab28

29 Related Work Spam has been observed in various applications, including e-mail, web search engines, blogs, videos, and opinions. RE: Each user specifies a list of users who they are willing to receive content from. Advanced Defense Lab29

30 Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab30

31 Conclusions Crawled the Twitter site to obtain more than 54 million user profiles. Investigate different tradeoffs for our classification approach and the impact of different attributes sets. Advanced Defense Lab31


Download ppt "Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio."

Similar presentations


Ads by Google