Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.

Slides:



Advertisements
Similar presentations
Learning more about Facebook and Twitter. Introduction  What we’ve covered in the Social Media webinar series so far  Agenda for this call Facebook.
Advertisements

Twitter – what is it? The School District of Haverford Township |
Influence and Passivity in Social Media Daniel M. Romero, Wojciech Galuba, Sitaram Asur, and Bernardo A. Huberman Social Computing Lab, HP Labs.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Using Twitter By Nancy Hanus Michigan State University School of Journalism Sept. 13, 2010.
Twitter 101 An introduction to Twitter basics and its use to enhance higher education.
Twitter Glossary. #: People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help.
101 WHAT THE TWEET? An introduction to the social network. Tweet ? #Tw101VPA Margaret Jennifer
PSRC Technology Integration Team TWITTER 101.  Twitter is a social networking tool or microblog.  It is composed of short text, pictures, and URLs called.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
A Beginner’s Guide to Social Media Nevada State Board of Nursing September 18-20, 2013 Las Vegas, Nevada.
Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented.
Twitter New way of communicating Like a conference Telling everyone Broadcasting to all This file works best with Internet Explorer.
NHnetWORKS December 14,  Facebook is a global Social Networking website that is operated and privately owned by Facebook, Inc.  Users can add.
Twitter The Basics. What is Twitter? Tweets are: 140 characters or less Quick to follow and view updates Used to share links, photos, videos, music,hot.
Starter for 10 Unit 12: Twitter Transform IT SFT12_Twitter.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Automated malware classification based on network behavior
Twitter for Dairy Farmers Tweets, tweeps & hashtags.
Presented by Karen Porter UM School of Business Administration & ImpactOnlineMarketing.com Google + and Twitter for Biz ImpactOnlineMarketing.com.
Tweet, Tweet, Tweet… Tweeting Assignments & Discussions Kara Damm, Technology Integration Specialist.
Using Social Networks in Education Region One Technology Conference May 11, 2010.
WARNINGBIRD: A Near Real-time Detection System for Suspicious URLs in Twitter Stream.
PSRC Technology Integration Team Twitter 101.  Twitter is a social networking tool or microblog.  It is composed of short text, pictures, and URLs called.
Using Social Media to Communicate and Support Your School A Closer Look at Twitter.
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
C HAPTER Social Networking Using Twitter 7 Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
DETECTING SPAMMERS AND CONTENT PROMOTERS IN ONLINE VIDEO SOCIAL NETWORKS Fabrício Benevenuto ∗, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, and.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
5. Social Networks 96. Creating Your Social Networks 97 P eople O bjectives S trategy T echnology.
Microblogs: Information and Social Network Huang Yuxin.
Social Media Dashboard that allows you to connect to multiple social networks from one website.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Toward Worm Detection in Online Social Networks Wei Xu, Fangfang Zhang, and Sencun Zhu ACSAC
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Twitter 101. What is Twitter? Twitter is a social networking and micro-blogging service that enables its users to send and read other user’s updates.
Minding your business on the internet Kelly Trevino Regional Director October 6,2015.
Social Media: The Basics Teresa Marks School Community Oral Health Conference Friday, October 16, 2015.
Twitter Games: How Successful Spammers Pick Targets Vasumathi Sridharan, Vaibhav Shankar, Minaxi Gupta School of Informatics and Computing, Indiana University.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Socialbots and its implication On ONLINE SOCIAL Networks Md Abdul Alim, Xiang Li and Tianyi Pan Group 18.
Contribution and Proposed Solution Sequence-Based Features Collective Classification with Reports Results of Classification Using Reports Collective Spammer.
Hybrid Intelligent Systems for Network Security Lane Thames Georgia Institute of Technology Savannah, GA
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Speaker : 童耀民 MA1G /3/21 1 Authors: Phone Lin and Pai-Chun Chung, National Taiwan University Yuguang Fang, University of Florida.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Twitter anyone? Sue Newell Chief Operating Officer Faculty of Health and Social Sciences Leeds Metropolitan University.
Victor PTSA Fall Forum Don’t Lose Touch With Your Teen Tuesday, October 22, 2013 – 7PM Social media is now an integral part of our every day lives. For.
Grow Your Business with Social Marketing
Uncovering Social Spammers: Social Honeypots + Machine Learning
Online Social Network: Threats &
The important use of Twitter in the Educators’ World
Dieudo Mulamba November 2017
The World of Social Media
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
GANG: Detecting Fraudulent Users in OSNs
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International Conference on Security and Cryptography, 2010

Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 2 / 37

Introduction  Social Network Service ( SNS ) –An online service, platform, or site that focuses on building and reflecting of social networks or social relations among people –The most popular applications of Web 2.0  Twitter –Founded in 2006 –One of the fastest growing SNSs  Surging more than 2,800% in 2009 –Social networking site and microblogging service 3 / 37

Introduction  Twitter You can post your latest updates Messages(Tweets) from twitter that you are following( describing ) 4 / 37

Introduction  Spammer in Twitter –The goal of Twitter  Allow friends to communicate and stay connected through the exchange of short message –Spammer also use Twitter as a tool to post malicious links –More than 3% messages are spam on Twitter ( Analytics, 2009 ) –The offensive trending topic Attack on February 20 ( CNET, 2009 ) 5 / 37

Introduction  Method to report spam –By clicking on the “report as spam” –To post a tweet in  This report service is also abused by both hoaxes and spam  Legitimate user can be mistakenly suspended by Twitter’s anti spam action 6 / 37

Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 7 / 37

Social Graph model  Twitter can be modeled as a directed graph –G = ( V, A ) –V : a set of nodes ( vertices ) –A : a set of arcs ( Edges )  Four types of relationships on Twitter can be defined –Follower  Node is a follower of node if the arc a = ( j, i ) is contained in A –Friend  Node is a friend of node if the arc a = ( i, j ) is contained in A –Mutual Friend  Node and node are mutual friends if both arcs a = ( i, j ) and a = ( j, i ) are contained in A –Stranger  Node and node are strangers if neither arcs a = ( I, j ) nor a = ( j, I ) is contained in A 8 / 37

Social Graph model  A simple Twitter graph A follows B A is follower of B B is friend of A B follows C, C follows B B and C are Mutual friend A doesn’t follow C, C doesn’t follow A A and C are stranger 9 / 37

Social Graph model  Twitter Social Graph 10 / 37

Outline  Introduction  Social Graph model  Features –Graph-based features –Content-based features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 11 / 37

Features Graph-based features  Twitter’s spam and abuse policy –“if you have a small number of followers compared to the amount of people you are following, it may be considered as a spam account”  Three features –The number of friends  The indegree of a node –The number of followers  The outdegree of a node –The reputation of a user 12 / 37

Features Content-based features  Duplicate Tweets –An account may be considered as a spam if you post duplicate content on one account –Detected by measuring the Levenshtein distance ( edit distance )  Minimum cost of transforming one string into another through a sequence of edit operations ( deletion, insertion and substitution of individual symbols )  Clean the data by stopping the words containing “#”, “ and “ –The number of duplicate tweets can be measurement  In the user’s 20 most recent tweets  Two tweets are considered as duplicate only when the are exactly the same 13 / 37

Features Content-based features  Need for cleaning 14 / 37

Features Content-based features  HTTP Links –It is considered as spam if your updates consist mainly of links and not personal updates –Twitter filters out the URLs linked to known malicious sites  URL shorten services like bit.ly provides opportunity for attacker to spam –The number of tweets containing HTTP links can be measurement Tweet with HTTP link Malicious Site Tweet with HTTP link Malicious Site ↓ porno.com URL shorten service ?? 15 / 37

Features Content-based features  Replies and Mentions –You can send a reply message to another user + message –You can also mention anywhere in the tweet  Message + message –Twitter automatically collects all tweets containing your username –You can reply anyone no matter they are your friends/followers or not –Spammer abuses this feature –The number of Tweets contain- ing mention or reply can be measurement 16 / 37

Features Content-based features  Spam tweets using mention or reply 17 / 37

Features Content-based features  Trending topic –The most-mentioned terms on Twitter at that moment, week, month –User can use the hashtag to a tweet  #tagname –If there are many tweets containing the same term,  It may become a trending topic –Twitter considers an account as spam  If you post multiple unrelated updates to a topic using the # symbols 18 / 37

Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 19 / 37

Data Set  Data Set –3 weeks from January 3 to January 24, 2010 –25,847 users –500k tweets –49M follower/friend relationships 20 / 37

Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 21 / 37

Spam Detection  Several classification algorithms –Decision tree –Neural network –Support vector machines –K – nearest neighbers –Naïve Bayesian  Naïve Bayesian outperform all other method –Bayesian classifier is noise robust  It uses posterior probability –A spam probability is calculated for each individual user based its behaviors, instead of giving a general rule 22 / 37

Spam Detection  Naïve Bayesian classifier –X : each Twitter account is considered as a vector X with feature values –Y : one of two classes, spam and non-spam –The features are conditionally independent 23 / 37

Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 24 / 37

Experiments  To evaluate the detection method –500 Twitter user accounts are labeled manually to two classes( spam or not )  By reading the 20 most recent tweets  Checking the friends and followers of the user –Result show that there are around 1% spam account in the data set  Additional spam data are added to the data set  To simulate the reality and avoid the bias in the crawling and label methods –The study in Analytics, 2009, shows there is 3% spam on Twitter  on Twitter and collect additional spam data –Only small number of result report real spam –The data set is mixed to contain around 3% spam data 25 / 37

Experiments  Graph-based features –The number of friends for each Twitter account –Only 30% of spam accounts follow a large amount of user  Spammer doesn’t need to follow other user 26 / 37

Experiments  Graph-based features –The number of followers for each Twitter account –Usually the spam accounts do not have a large amount of followers  Some spam accounts having a relatively large amount of followers 27 / 37

Experiments  Graph-based features –The reputation for each Twitter account –The reputation of most legitimate users is between 30% to 90%  Some spam accounts have a 100% reputation 28 / 37

Experiments  Content-based Features –The number of pairwise duplication –Not all spam accounts post multiple duplicate tweets  We can not only depend on this feature 29 / 37

Experiments  Content-based Features –The number of mentions and replies –Most spam accounts have the maximum 20 symbol  This will lure legitimate users to read their spam messages or click their link 30 / 37

Experiments  Content-based Features –The number of links –Some legitimate users also include links in all tweets, some companies join Twitter to promote their own web sites 31 / 37

Experiments  Content-based Features –The number of Hash tag signs 32 / 37

Outline  Introduction  Social Graph model  Features  Data Set  Spam Detection  Experiments  Evaluation  Conclusion 33 / 37

Evaluation  The evaluation of the overall process –Confusion matrix –Precision : P = a / ( a + c ) –Recall : R = a / ( a + b ) –F-measure : F = 2PR / ( P + R )  Each classifier is trained 10 times –Each time using the 9 out of the 10 partitions as training data –Computing the confusion matrix using the tenth partition as test data 34 / 37

Evaluation  The evaluation results –Naïve Bayesian classifier has the best overall performance  Finally, the Bayesian classifier learned from the labeled data is applied to the entire data set –Information about totally 25,817 users –Precision of the spam detection system  392 users are classified as spam  348 users are real spam account and 44 users are false alarms  89% precision 35 / 37

Conclusion  The spam behavior in a popular online SNS, Twitter –To formalize the problem, social graph model is proposed  Novel content-based and graph-based features are proposed –Graph-based features  The number of friends  The number of followers  The reputation of the user –Content-based features  The number of pairwise duplications  The number of Mention and Replies  The number of Links  The number of Hashtags  Analyze the data set and evaluate the performance of the detection system 36 / 37

Conclusion  Among the graph-based features –The proposed reputation features has the best performance –No many spam follow large amount of users –Some spammers have many followers  For the content-based features –Most spam accounts have multiple duplicate tweets –But not all spam account post multiple duplicate tweets  We can not rely on this feature  Several popular classification algorithms are studied and evaluated  The naïve classifier achieve a 89% precision 37 / 37