C ROWD T ARGET : T ARGET - BASED D ETECTION OF C ROWDTURFING IN O NLINE S OCIAL N ETWORKS Authors: Jonghyuk Song, Sangho Lee, Jong Kim Dept. of CSE, POSTECH.

Slides:



Advertisements
Similar presentations
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Advertisements

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Sentiment Analysis on Twitter Data
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Design and Evaluation of a Real-Time URL Spam Filtering Service
Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
TWITTER EFFECT: A S OCIAL N ETWORK ? OR A N EWS MEDIA ? Presented by: Bohyun Kim Under the Guidance of: Augustin Chaintreau.
1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,
Preventing Spam For SIP-based Sessions and Instant Messages Kumar Srivastava Henning Schulzrinne June 10, 2004.
Verma - ICISS 2014 R easoning M ining NLP Defense Rakesh M. Verma ReMiND Laboratory Catching Classical and Hijack-based Phishing Attacks.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
WARNINGBIRD: A Near Real-time Detection System for Suspicious URLs in Twitter Stream.
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
Speaker:Chiang Hong-Ren Botnet Detection by Monitoring Group Activities in DNS Traffic.
Our Twitter Profiles, Our Selves: Predicting Personality with Twitter Daniele Quercia, Michal Kosinski, David Stillwell, Jon Crowcroft COMP4332 Wong Po.
nd Joint Workshop between Security Research Labs in JAPAN and KOREA Profile-based Web Application Security System Kyungtae Kim High Performance.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Man vs. Machine: Adversarial Detection of Malicious Crowdsourcing Workers Gang Wang, Tianyi Wang, Haitao Zheng, Ben Y. Zhao, UC Santa Barbara, Usenix Security.
PRIVACY PRESERVING SOCIAL NETWORKING THROUGH DECENTRALIZATION AUTHORS: L.A. CUTILLO, REFIK MOLVA, THORSTEN STRUFE INSTRUCTOR DR. MOHAMMAD ASHIQUR RAHMAN.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Jing (Selena) He and Hisham M. Haddad Department of Computer Science, Kennesaw State University Shouling Ji, Xiaojing Liao, and Raheem Beyah School of.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Prediction of Influencers from Word Use Chan Shing Hei.
Twitter Games: How Successful Spammers Pick Targets Vasumathi Sridharan, Vaibhav Shankar, Minaxi Gupta School of Informatics and Computing, Indiana University.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Reputation Management System
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
I NTEGRO : L EVERAGING V ICTIM P REDICTION FOR R OBUST F AKE A CCOUNT D ETECTION IN OSN S Authors Yazan Boshmaf, Dionysios Logothetis, Georgos Siganos,
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
 DM-Group Meeting Liangzhe Chen, Oct Papers to be present  RSC: Mining and Modeling Temporal Activity in Social Media  KDD’15  A. F. Costa,
CrowdTarget: Target-based Detection of Crowdturfing in Online Social Networks Jenny (Bom Yi) Lee.
A Generic Approach to Big Data Alarms Prioritization
Gross Niv Analyzing Spammer’s Social Networks for Fun and Profit
Uncovering Social Spammers: Social Honeypots + Machine Learning
Introduction to gathering and analyzing data via APIs Gus Cavanaugh
Evaluating Classifiers
Evaluation – next steps
By : Namesh Kher Big Data Insights – INFM 750
Market Intelligence Analysis
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
QlikView Licensing.
Memory Standardization
Source: Procedia Computer Science(2015)70:
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
SOCIAL COMPUTING Homework 3 Presentation
Collective Network Linkage across Heterogeneous Social Platforms
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
Dieudo Mulamba November 2017
Chapter 12: Automated data collection methods
SEG 4630 E-Commerce Data Mining — Final Review —
Anindya Maiti, Murtuza Jadliwala, Jibo He Igor Bilogrevic
iSRD Spam Review Detection with Imbalanced Data Distributions
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Multithreaded Programming
DIGITAL MARKETING AGENCY Digital Marketing.
Using analytics to drive traffic
Autonomous Network Alerting Systems and Programmable Networks
Botnet Detection by Monitoring Group Activities in DNS Traffic
Analyzing Influence of Social Media Through Twitter
Austin Karingada, Jacob Handy, Adviser : Dr
Presentation transcript:

C ROWD T ARGET : T ARGET - BASED D ETECTION OF C ROWDTURFING IN O NLINE S OCIAL N ETWORKS Authors: Jonghyuk Song, Sangho Lee, Jong Kim Dept. of CSE, POSTECH Pohang, Republic of Korea 1 Dr. Murtuza Jadliwala CS898AB Sreymoch Don April 25, 2016

A GENDA 1. Introduction 2. Some existing methods 3. High-level overview 4. Contribution 5. Malicious services background 6. Data collections for analysis 7. Analyze crowdturfing workers 8. Analyze crowdturfing targets 9. CrowdTarget 10. Analysis Results 11. Feature Robustness 12. Conclusion 13. Q & A 2

I NTRODUCTION CrowdSourcing (Crowdturfing) Process of outsourcing tasks to human workers. Detecting challenges Bots vs. human Human vs. human Propose new detecting method “CrowdTarget” Detect target object of crowdturfing (post, page, search key, URL…) Perform by new account or casual worker 3

S OME E XISTING M ETHODS Detecting using account-base features Inspect characteristic of user accounts (number of friends, number of post, ages, etc…) This technique is vulnerable to a simple evasion technique such as perform malicious tasks while doing normal behaviors. Detecting using synchronized-group activities Bot accounts perform synchronized activities Human performs task without schedules or with flexible schedule. 4

H IGH - LEVEL O VERVIEW 5

C ONTRIBUTIONS 6 New detection approach Introducing new “CrowdTarget” Analyze characteristics of crowdturfing targets In-depth analysis Analyze retweet generated by: Normal Crowdturfing Black market High accuracy Evaluation results showed: True-positive: 0.98 False-positive: 0.01

M ALICIOUS S ITES B ACKGROUND 7 Black-market sites Boost popularities (followers, likes, comments). Operate with large number of bots to perform many tasks by deadlines. Charge: various plan with deadline. Synchronize group activities. Crowdturfing sites Boost popularities (followers, likes, comments). Human workers perform boosting tasks. Human workers get paid for performing tasks. No synchronized group activities. Automated tasks performance validation.

D ATA C OLLECTION Twitter Crowdturfing Black-market Tweet receives >= 50 retweets November 2014 – February 2015

D ATA C OLLECTION Tweets and retweets data collection for analysis 5 different black- market sites 1,044 Twitter accounts with >= 100,000 followers 9 different crowdturfing sites Monitor selected verified Twitter accounts to collect tweets and retweets Registered at crowdturfing sites and retrieved tasks requesting retweets - Wrote 282 tweets - Registered at Black- market sites - Purchase retweets for their tweets - Wrote 282 tweets - Registered at Black- market sites - Purchase retweets for their tweets

D ATA C OLLECTION M ETHOD Twitter REST API allows up to 100 latest retweets. Two approaches to get as many as possible: 10 Streaming API to monitor retweets it will receive in the next 3 days. Target tweets recently posted Twitter search function to find as many retweets of the target as possible. Target tweets posted in the past

L EGAL & E THICAL I SSUES Reference Thomas et al.’s approach : Ethically study underground services Design their data collection and experiments that follows guideline from a formal review of Institutional Review Board (IRB). 11 User Account Anonymity Delete details personal information Minimize effects to underground services Retrieve only public tasks posted on crowdturfing. Buy a small number of retweets from black-market sites Avoid checking and contacting site operators Avoid negative effects on Twitter and its users Delete their fake accounts right after data collection is done.

A NALYZE C ROWDTURFING W ORKERS Are they human or bots? 12 Account Popularity Follower to following ratio Number of received retweets per tweet Klout score Synchronized Group Activity Following similarity Retweet similarity

A CCOUNT P OPULARITY – F OLLOWER TO F OLLOWING R ELATIONSHIP 13 Percent of accounts have a larger number of followers than followings. Crowdturfing: 70% Normal: 37% Black-market: 20%

A CCOUNT P OPULARITY – N UMBER OF R ECEIVED R ETWEET PER T WEET 14 Percent of tweets posted in each accounts that are retweeted more than once: Crowdturfing: 43% Normal: 5% Black-market: 4%

A CCOUNT P OPULARITY – K LOUT S CORES 15 Median Klout scores: Crowdturfing: 41 Normal: 33 Black-market: 20

A CCOUNT P OPULARITY - R ESULT Crowdturfing accounts has the highest number in all three features: Follower to following relationship Number of retweets per tweet Klout scores Based on their findings: Crowdturfing accounts successfully boost their popularity by gaining followers and retweets from crwodturfing services. Differ from Black-market accounts. Resemble influential users in OSNs. 16

S YNCHRONIZED G ROUP A CTIVITY 17 Following similarityRetweet Similarity

S YNCHRONIZED G ROUP A CTIVITY - R ESULT 18 Crowdturfing and normal account group have the same pattern (low following similarities & low retweet similarities). Black-market has the highest following and retweet similarities.

A NALYZE C ROWDTURFING T ARGETS Crowdturfing targets: Tweets receiving artificial retweets generated by crowdturfing workers. Characteristics: Retweet time distribution (mean, standard deviation, skewness, and kurtosis). Twitter application Unreachable retweeter Click information 19

R ETWEET T IME DISTRIBUTION Count number of retweets generated every hour from when a tweet is created.

R ETWEET T IME D ISTRIBUTION Mean – average time different between posting and retweeting. Standard deviation – how many retweets are generated around the meantime. Skewness – when a tweet is mostly retweeted. Kurtosis – measure the peak of the distribution.

T WITTER A PPLICATION & U NREACHABLE R ETWEETER 22 Percent of retweets generated by application Crowdturfing: 90% Black-market: 99% Normal: 40% Unreachable retweeter Crowdturfing: 80% tweets have 80% unreachable retweeters Normal: < 10% tweets have 80% unreachable retweeters Each tweet receiving retweets, compute the ratio of the number of retweets generated by the application to the total number of retweets.

C LICK I NFORMATION 23 Extract tweets contain bit.ly and goo.gl from database Normal: 6,024 Crowdturfing: 3,093 Black-market: 282 Crawl the click analytics and extract number of clicks Normal: over 80% of links receives a larger number of clicks than number retweets. Crowdturfing: about 90% of links receive a smaller number of clicks than number of retweets. Black-market: most of tweets are never clicked.

C ROWD T ARGET 24 Prepare Training & Testing Data Set ratio of malicious tweets as 1% of total tweets. Randomly duplicated normal tweets to reach 99 times of malicious tweets CrowdTarget (Classifiers) Mean Standard deviation SkewnessKurtosis Ratio of dominant application used Ratio of unreachable tweeters Ratio of #clicks to #retweet for tweet containing URLs Normalize all feature values: Test several classifiers using scikit-learn library Ada Boost Gaussian naïve Bayes K-nearest neighbors Validate classification results with 10- fold cross- validation

B ASIC C LASSIFICATION N ORMAL VS. M ALICIOUS T WEETS WITHOUT C LICK I NFOMRATION 25 Gaussian naïve Bayes 0.87 Ada Boost 0.95 K-nearest neighbor 0.96 Gausian naïve Bayes 0.99 Ada Boost K-nearest neighbor True Positive Rate Area under receiver characteristics (ROC) curves

C LASSIFICATION WITH CLICK INFORMATION 26 Extract tweets containing bit.ly and goo.gl from dataset Classified them with a link-based feature (#clicks / #retweets) Test with only K-nearest neighbor algorithm TPR: 0.98AUC: 0.993

E RROR A NALYSIS 27 False-negative Misjudge crowdturfing tweets that received small number of retweets 50% of undetected crowdturfing tweets were mostly retweeted by reachable accounts Few links in undetected crowdturfing tweets receive a larger number of clicks than retweets False-positive Few verified accounts were retweeted by automated applications Embedded tweets in websites were classified as malicious

F EATURE R OBURSTNESS Retweet time distribution Arranging a retweet time schedule similar with a normal retweet time distribution is impossible because crowdturfing workers act independently. Manipulate every boosting task of a worker by installing program at the worker devices is not desirable. Services handle every boosting task at the server can be detected by OSNs by the same IP usage. Use bots accounts to secretly perform tasks can be costly. 28

T WITTER A PPLICATION Assigning different applications to different groups of workers, they can eliminate dominant applications. They cannot arbitrary create a large number of Twitter applications due to Twitter restricts number of applications creation per day and per account. Difficult to exactly control the ratio of the most dominant application because workers can retweet any tweet at any time. 29

U NREACHABLE R ETWEETERS It is impossible for crowdturfing service workers to follow the posting user of a tweet. Workers keep on receiving future tweets of posting user. Increasing number of following can decrease the popularity of workers on Twitter. Workers cannot follow the posting user when number of their followers is small or when they recently follow many accounts. 30

C LICK I NFORMATION It is impossible to request crowdturfing workers to click on a link in a tweet while retweeting it. Workers don’t want to click on link that might be a Malicious link. Artificial click expects to have different time, geographical location, user agents etc… 31

C ONCLUSION Manipulation patterns of the target objects maintained regardless of what evasion techniques crowdturfing account used. Through observation, they can distinguish tweets that received retweets by crowdturfing sites from tweets that receiving retweets by normal Twitter users. Evaluation result shows CrowdTarget could detect crowdturfing retweets on Twitter with TPR of 0.98 at FPR of

Q UESTIONS ? C OMMENTS ? C ONCERNS ? S UGGESTIONS ? 33