Download presentation
Presentation is loading. Please wait.
Published byMeryl Morton Modified over 9 years ago
1
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 A Presentation at Advanced Defense Lab
2
Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab2
3
Introduction Twitter has recently emerged as a popular social system. With a simple interface where only 140 character messages can be posted. These services open opportunities for new forms of spam Advanced Defense Lab3
4
Introduction 4-step approach Crawled a near-complete dataset from Twitter. Created a labeled collection with users “manually” classified as spammers and non-spammers. Conducted a study about the characteristics of tweet content and user behavior. Used supervised machine learning method to identify spammers. Advanced Defense Lab4
5
Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab5
6
Background Relationship links are directional. A re-tweeted message usually starts with “RT@username”. Twitter users usually use hashtags (#) to identify certain topics. Trending Topics #musicmonday Advanced Defense Lab6
7
Background A URL to a website containing advertisements completely unrelated to a hashtag on the tweet Re-tweets in which legitimate links are changed to illegitimate ones. Advanced Defense Lab7
8
Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab8
9
Dataset and Labeled Collection We asked Twitter to allow us to collect such data and they white-listed 58 servers located at the MPI-SWS.MPI-SWS Twitter assigns each user a numeric ID which uniquely identifies the user’s profile. We launched our crawler in August 2009 to collect all user IDs ranging from 0 to 80 million. In total 54,981,152 used accounts 1,963,263,821 social links 1,755,925,520 tweets Advanced Defense Lab9
10
Building a labeled collection Three desired properties that need to be considered to create such collection of users labeled as spammers and non-spammers. The collection needs to have a significant number spammers and non-spammers. The labeled collection needs to include spammers who are aggressive in their strategies and mostly affect the system. The users are chosen randomly and not based on their characteristics. Advanced Defense Lab10
11
Building a labeled collection Three trending topics The Michael Jackson’s death Susan Boyle’s emergence The hashtag “#musicmonday” Advanced Defense Lab11
12
Building a labeled collection We developed a website to help volunteers to manually label users as spammers or non-spammers based on their tweets containing #keywords related to the trending topics. In total, 8,207 users were labeled, including 355 spammers and 7,852 non-spammers. We select only 710 of the legitimate users to include in our collection. Advanced Defense Lab12
13
Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab13
14
Indentifying User Attributes Content Attributes the maximum, minimum, average, and median of the following metrics: number of hashtags per number of words on each tweet number of URLs per words number of words of each tweet number of characters of each tweet number of URLs on each tweet number of hashtags on each tweet number of numeric characters that appear on the text number of users mentioned on each tweet number of times the tweet has been re-tweeted the fraction of tweets with at least one word from a popular list of spam words the fraction of tweets that are reply messages the fraction of tweets of the user containing URLs Advanced Defense Lab14 39
15
Identifying User Attributes Total 1065 users. 39% of the spammers posted all their tweets containing spam words, whereas non-spammers typically do not post more than 4% of their tweets containing spam word. Advanced Defense Lab15
16
Indentifying User Attributes User Behavior Attributes the maximum, minimum, average, and median of the following metrics: the time between tweets number of tweets posted per day number of tweets posted per week number of followers number of followees fraction of followers per followees number of tweets age of the user account number of times the user was mentioned number of times the user was replied to number of times the user replied someone number of followees of the user’s followers number tweets receveid from followees existence of spam words on the user’s screename Advanced Defense Lab16 23
17
Identifying User Attributes (a) Spammers have a high ratio of followers per follwees. (b) Spammers usually have new accounts probably because they are constantly being blocked by other users and reported to Twitter. (c) non-spammers receive a much large amount of tweets from their followees in comparison with spammers. Advanced Defense Lab17
18
Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab18
19
Detecting Spammers SVM-light 5-fold cross-validation. In each test, the original sample is partitioned into 5 sub- samples, out of which four are used as training data, and the remaining one is used for testing. Advanced Defense Lab19
20
Detecting Spammers Advanced Defense Lab20
21
Detecting Spammers Advanced Defense Lab21
22
Detecting Spammers X2 Advanced Defense Lab22
23
Detecting Spammers Advanced Defense Lab23
24
Detecting Spammers Advanced Defense Lab24
25
Detecting Spams Consider the following attributes for each tweet: number of words from a list of spam words number of hashtags per words number of URLs per words number of words number of numeric characters on the text number of characters that are numbers number of URLs number of hashtags number of mentions number of times the tweet has been replied Advanced Defense Lab25
26
Detecting Spams Advanced Defense Lab26
27
Detecting Spammers Advanced Defense Lab27
28
Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab28
29
Related Work Spam has been observed in various applications, including e-mail, web search engines, blogs, videos, and opinions. RE: Each user specifies a list of users who they are willing to receive content from. Advanced Defense Lab29
30
Outline Introduction Background Dataset and Labeled Collection Identifying User Attributes Detecting Spammers Related Work Conclusion Advanced Defense Lab30
31
Conclusions Crawled the Twitter site to obtain more than 54 million user profiles. Investigate different tradeoffs for our classification approach and the impact of different attributes sets. Advanced Defense Lab31
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.