Man vs. Machine: Adversarial Detection of Malicious Crowdsourcing Workers Gang Wang, Tianyi Wang, Haitao Zheng, Ben Y. Zhao, UC Santa Barbara, Usenix Security.

Slides:



Advertisements
Similar presentations
Foundations of Adversarial Learning Daniel Lowd, University of Washington Christopher Meek, Microsoft Research Pedro Domingos, University of Washington.
Advertisements

Fighting Fire With Fire: Crowdsourcing Security Solutions on the Social Web Christo Wilson Northeastern University
On the Hardness of Evading Combinations of Linear Classifiers Daniel Lowd University of Oregon Joint work with David Stevens.
SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation.
Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,
Asking Questions on the Internet
You Are How You Click Clickstream Analysis for Sybil Detection Gang Wang, Tristan Konolige, Christo Wilson †, Xiao Wang ‡ Haitao Zheng and Ben Y. Zhao.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Fighting Fire With Fire: Crowdsourcing Security Threats and Solutions on the Social Web Gang Wang, Christo Wilson, Manish Mohanlal, Ben Y. Zhao Computer.
Harness Your Internet Activity. Zeroing in On Zero Days DNS OARC Spring 2014 Ralf Weber
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.
WARNINGBIRD: A Near Real-time Detection System for Suspicious URLs in Twitter Stream.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Jay Stokes, Microsoft Research John Platt, Microsoft Research Joseph Kravis, Microsoft Network Security Michael Shilman, ChatterPop, Inc. ALADIN: Active.
Gang Wang, Christo Wilson, Xiaohan Zhao, Yibo Zhu, Manish Mohanlal, Haitao Zheng and Ben Y. Zhao Computer Science Department, UC Santa Barbara Serf and.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
1 Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 Benchmark H. Güneş Kayacık Nur Zincir-Heywood Malcolm I. Heywood.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Security Evaluation of Pattern Classifiers under Attack.
1 Measurement and Classification of Humans and Bots in Internet Chat By Steven Gianvecchio, Mengjun Xie, Zhenyu Wu, and Haining Wang College of William.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Uncovering Social Network Sybils in the Wild Zhi YangChristo WilsonXiao Wang Peking UniversityUC Santa BarbaraPeking University Tingting GaoBen Y. ZhaoYafei.
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
An Overview of Intrusion Detection Using Soft Computing Archana Sapkota Palden Lama CS591 Fall 2009.
1 Impact of IT Monoculture on Behavioral End Host Intrusion Detection Dhiman Barman, UC Riverside/Juniper Jaideep Chandrashekar, Intel Research Nina Taft,
1 Part II: Practical Implementations.. 2 Modeling the Classes Stochastic Discrimination.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Ali Alhamdan, PhD National Information Center Ministry of Interior
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Automatic Detection of Emerging Threats to Computer Networks Andre McDonald.
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Socialbots and its implication On ONLINE SOCIAL Networks Md Abdul Alim, Xiang Li and Tianyi Pan Group 18.
EE515/IS523: Security 101: Think Like an Adversary Evading Anomarly Detection through Variance Injection Attacks on PCA Benjamin I.P. Rubinstein, Blaine.
Classification Ensemble Methods 1
Social Turing Tests: Crowdsourcing Sybil Detection Gang Wang, Manish Mohanlal, Christo Wilson, Xiao Wang Miriam Metzger, Haitao Zheng and Ben Y. Zhao Computer.
Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Active Learning for Network Intrusion Detection ACM CCS 2009 Nico Görnitz, Technische Universität Berlin Marius Kloft, Technische Universität Berlin Konrad.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Uncovering Social Network Sybils in the Wild Zhi YangChristo WilsonXiao Wang Peking UniversityUC Santa BarbaraPeking University Tingting GaoBen Y. ZhaoYafei.
Theme Guidance - Network Traffic Proposed NMLRG IETF 95, April 2016 Sheng Jiang (Speaker, Co-chair) Page 1/6.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
CrowdTarget: Target-based Detection of Crowdturfing in Online Social Networks Jenny (Bom Yi) Lee.
Introduction to Machine Learning, its potential usage in network area,
Learning to Detect and Classify Malicious Executables in the Wild by J
Machine Learning for Computer Security
By : Namesh Kher Big Data Insights – INFM 750
Prof. Dr. Marc Rennhard Head of Information Security Research Group
Source: Procedia Computer Science(2015)70:
KDD 2004: Adversarial Classification
Dieudo Mulamba November 2017
AI in Cyber-security: Examples of Algorithms & Techniques
Dude, where’s that IP? Circumventing measurement-based geolocation
Phillipa Gill University of Toronto
Facebook Immune System
A survey of network anomaly detection techniques
Bolun Wang*, Yuanshun Yao, Bimal Viswanath§ Haitao Zheng, Ben Y. Zhao
Adversarial Evasion-Resilient Hardware Malware Detectors
iSRD Spam Review Detection with Imbalanced Data Distributions
RHMD: Evasion-Resilient Hardware Malware Detectors
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
GANG: Detecting Fraudulent Users in OSNs
Modeling IDS using hybrid intelligent systems
Presentation transcript:

Man vs. Machine: Adversarial Detection of Malicious Crowdsourcing Workers Gang Wang, Tianyi Wang, Haitao Zheng, Ben Y. Zhao, UC Santa Barbara, Usenix Security 2014 Fall 2014 Presenter: Kun Sun, Ph.D. Most slides are borrowed from

Outline  Motivation  Detection of Crowdturfing  Adversarial Machine Learning Attacks  Conclusion 2

Machine Learning for Security  Machine learning (ML) to solve security problems  spam detection  Intrusion/malware detection  Authentication  Identifying fraudulent accounts (Sybils) and content  Example: ML for Sybil detection in social networks 3 Training Classifier Known samples Unknown Accounts

Adversarial Machine Learning  Key vulnerabilities of machine learning systems  ML models derived from fixed datasets  Assuming similar distribution of training and real-world data  Strong adversaries in ML systems  Aware of usage, reverse engineering ML systems  Adaptive evasion, temper with the trained model  Practical adversarial attacks  What are the practical constrains for adversaries?  With constrains, how effective are adversarial attacks? 4

Context: Malicious Crowdsourcing  Astroturfing: using fake names to post positive/negative reviews of your own products online. (unlawful in US, any circumvention?)  New threat: malicious crowdsourcing = crowdturfing  Hiring a large army of real users for malicious attacks  Fake customer reviews, rumors, targeted spam  Most existing defenses fail against real users (CAPTCHA) 5

Online Crowdturfing Systems  Online crowdturfing systems (services)  Connect customers with online users willing to spam for money  Sites located across the globe, e.g. China, US, India  Crowdturfing in China  Largest crowdturfing sites: ZhuBaJie (ZBJ) and SanDaHa (SDH)  Million-dollar industry, tens of millions of tasks finished 6 Customer Crowd workers … Crowdturfing site Target Network

Machine Learning vs. Crowdturfing Machine learning to detect crowdturfing workers  Simple methods usually fail (e.g. CAPTCHA, rate limit)  Machine learning: more sophisticated modeling on user behaviors  “You are how you click” [USENIX’13] Perfect context to study adversarial machine learning 1. Highly adaptive workers seeking evasion 2. Crowdturfing site admins tamper with training data by changing all worker behaviors 7

Goals and Questions  Our goals  Develop defense against crowdturfing on Weibo (Chinese Twitter)  Understand the impact of adversarial countermeasures and the robustness of machine learning classifiers  Key questions  What ML algorithms can accurately detect crowdturfing workers?  What are possible ways for adversaries to evade classifiers?  Can adversaries attack ML models by tampering with training data? 8

Outline  Motivation  Detection of Crowdturfing  Adversarial Machine Learning Attacks  Conclusion 9

Methodology  Detect crowdturf workers on Weibo  Adversarial machine learning attacks  Evasion Attack: workers evade classifiers  Poisoning Attack: crowdturfing admins tamper with training data 10 Classifier Training Data Training (e.g. SVM) Poison Attack Evasion Attack

Ground-truth Dataset  Crowdturfing campaigns targeting Weibo  Two largest crowdturfing sites ZBJ and SDH  Complete historical transaction records for 3 years ( )  20,416 Weibo campaigns: > 1M tasks, 28,947 Weibo accounts  Collect Weibo profiles and their latest tweets  Workers: 28K Weibo accounts used by ZBJ and SDH workers  Baseline users: snowball sampled 371K baseline users 11

Features to Detect Crowd-workers  Search for behavioral features to detect workers  Observations  Aged, well established accounts  Balanced follower-followee ratio  Using cover traffic  Final set of useful features: 35  Baseline profile fields (9)  User interaction (comment, retweet) (8)  Tweeting device and client (5)  Burstiness of tweeting (12)  Periodical patterns (1) 12 Task-driven nature Active at posting but have less bidirectional interactions

Performance of Classifiers  Building classifiers on ground-truth data  Random Forests (RF)  Decision Tree (J48)  SVM radius kernel (SVMr)  SVM polynomial (SVMp)  Naïve Bayes (NB)  Bayes Network (BN)  Classifiers dedicated to detect “professional” workers  Workers who performed > 100 tasks  Responsible for 90% of total spam  More accurate to detect the professionals  99% accuracy 13 Random Forests: 95% accuracy

Outline  Motivation  Detection of Crowdturfing  Adversarial Machine Learning Attacks  Evasion attack  Poisoning attack  Conclusion 14

15 Classifier Training Data Training (e.g. SVM) Evasion Attack Detection Model Training Evasion Attack

Attack #1: Adversarial Evasion  Individual workers as adversaries  Workers seek to evade a classifier by mimicking normal users  Identify the key set of features to modify for evasion  Attack strategy depends on worker’s knowledge on classifier  Learning algorithm, feature space, training data  What knowledge is practically available? How does different knowledge level impact workers’ evasion? 16

A Set of Evasion Models  Optimal evasion scenarios  Per-worker optimal: Each worker has perfect knowledge about the classifier  Global optimal: knows the direction of the boundary  Feature-aware evasion: knows feature ranking  Practical evasion scenario  Only knows normal users statistics  Estimate which of their features are most “abnormal” 17 ? ? ? ? Practical Optimal Classification boundary

Evasion Attack Results  Evasion is highly effective with perfect knowledge, but less effective in practice  Most classifiers are vulnerable to evasion  Random Forests are slightly more robust (J48 Tree the worst) 18 99% workers succeed with 5 feature changes Optimal Attack Practical Attack No single classifier is robust against evasion. The key is to limit adversaries’ knowledge Need to alter 20 features

19 Classifier Training Data Training (e.g. SVM) Poison Attack Detection Model Training Poison Attack

Attack #2: Poisoning Attack  Crowdturfing site admins as adversaries  Highly motivated to protect their workers, centrally control workers  Tamper with the training data to manipulate model training  Two practical poisoning methods  Inject mislabeled samples to training data  wrong classifier  Alter worker behaviors uniformly by enforcing system policies  harder to train accurate classifiers 20 Injection Attack Inject normal accounts, but labeled as worker Wrong model, false positives! Altering Attack Difficult to classify!

Injecting Poison Samples  Injecting benign accounts as “workers” into training data  Aim to trigger false positives during detection 21 J48-Tree is more vulnerable than others Poisoning attack is highly effective More accurate classifier can be more vulnerable 10% of poison samples  boost false positives by 5%

Discussion  Key observations  Accurate machine learning classifiers can be highly vulnerable  No single classifier excels in all attack scenarios, Random Forests and SVM are more robust than Decision Tree.  Adversarial attack impact highly depends on adversaries’ knowledge  Moving forward: improve robustness of ML classifiers  Multiple classifiers in one detector (ensemble learning)  Adversarial analysis in unsupervised learning 22

Questions? 23