BotGraph: Large Scale Spamming Botnet Detection Yao Zhao Yinglian Xie , Fang Yu , Qifa Ke , Yuan Yu , Yan Chen and Eliot Gillum ‡ EECS Department,

Slides:

Advertisements

Similar presentations

Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.

Advertisements

Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.

Zhiyun Qian, Z. Morley Mao (University of Michigan)

Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.

Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.

DECISION TREES. Decision trees  One possible representation for hypotheses.

A Survey of Botnet Size Measurement PRESENTED: KAI-HSIANG YANG ( 楊凱翔 ) DATE: 2013/11/04 1/24.

Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.

Analysis and Modeling of Social Networks Foudalis Ilias.

Indian Statistical Institute Kolkata

Error Tolerant Address Configuration for Data Center Networks with Malfunctioning Devices Xingyu Ma, Chengchen Hu, Kai Chen, Che Zhang, Hongtao Zhang,

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Mutual Information Mathematical Biology Seminar

Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University.

Leveraging Personal Knowledge for Robust Authentication Systems Mentor: Danfeng Yao Anitra Babic Chestnut Hill College Computer Science Department.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Towards Online Spam Filtering in Social Networks Hongyu Gao, Yan Chen, Kathy Lee, Diana Palsetia and Alok Choudhary Lab for Internet and Security Technology.

An Effective Defense Against Spam Laundering Paper by: Mengjun Xie, Heng Yin, Haining Wang Presented at:CCS'06 Presentation by: Devendra Salvi.

SocialFilter: Introducing Social Trust to Collaborative Spam Mitigation Michael Sirivianos Telefonica Research Telefonica Research Joint work with Kyungbaek.

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,

Personalized Spam Filtering for Gray Mail Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Robert McCann Microsoft Corporation.

Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology USENIX Security '08 Presented by Lei Wu.

S PAMMING B OTNETS : S IGNATURES AND C HARACTERISTICS Introduction of AutoRE Framework.

CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.

JOHN P. JOHN FANG YU YINGLIAN XIE MARTÍN ABADI ARVIND KRISHNAMURTHY PRESENTATION BY SAM KLOCK Searching the Searchers with SearchAudit.

BotNet Detection Techniques By Shreyas Sali

John P., Fang Yu, Yinglian Xie, Martin Abadi, Arvind Krishnamurthy University of California, Santa Cruz USENIX SECURITY SYMPOSIUM, August, 2010 John P.,

BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection Guofei Gu, Roberto Perdisci, Junjie Zhang, and.

Architectural Considerations for GEOPRIV/ECRIT Presentation given by Hannes Tschofenig.

1 Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Speaker: Jun-Yi Zheng 2010/03/29.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

ACT: Attachment Chain Tracing Scheme for Virus Detection and Control Jintao Xiong Proceedings of the 2004 ACM workshop on Rapid malcode Presented.

Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.

Studying Spamming Botnets Using Botlab 台灣科技大學資工所楊馨豪 2009/10/201 Machine Learning And Bioinformatics Laboratory.

Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,

BotGraph: Large Scale Spamming Botnet Detection Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum Speaker: 林佳宜.

Speaker: Hom-Jay Hom Date:2009/11/17 Botnet, and the CyberCriminal Underground IEEE 2008 Hsin chun Chen Clinton J. Mielke II.

By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.

Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, Presented.

Spamming Botnets: Signatures and Characteristics Authors:Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten+, Ivan Osipkov+ Presenter: Chia-Li.

A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

SybilGuard: Defending Against Sybil Attacks via Social Networks.

Socialbots and its implication On ONLINE SOCIAL Networks Md Abdul Alim, Xiang Li and Tianyi Pan Group 18.

Tracking Malicious Regions of the IP Address Space Dynamically.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Interactive Control of Avatars Animated with Human Motion Data By: Jehee Lee, Jinxiang Chai, Paul S. A. Reitsma, Jessica K. Hodgins, Nancy S. Pollard Presented.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.

A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi

Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,

2009/6/221 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure- Independent Botnet Detection Reporter : Fong-Ruei, Li Machine.

CERN - IT Department CH-1211 Genève 23 Switzerland t OIS Update on the anti spam system at CERN Pawel Grzywaczewski, CERN IT/OIS HEPIX fall.

How dynamic are IP addresses? Yinglian Xie, Fang Yu, Kannan Achan, Eliot Gillum, Moises Goldszmidt, Ted Wobber SIGCOMM ‘07 Chulhyun Park

An Effective Defense Against Spam Laundering Author: Mengjun Xie, Heng Yin, Haining Wang Presented At: CCS’ 06 Prepared By: Amit Shrivastava.

Published: USENIX HotBots, 2007 Presented: Wei-Cheng Xiao 2016/10/11.

Dec 14, 2014, Harvard University

Written by Qiang Cao, Xiaowei Yang, Jieqi Yu and Christopher Palow

Distributed Network Traffic Feature Extraction for a Real-time IDS

Written by Qiang Cao, Xiaowei Yang, Jieqi Yu and Christopher Palow

Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.

De-anonymizing the Internet Using Unreliable IDs

Dieudo Mulamba November 2017

De-anonymizing the Internet Using Unreliable IDs By Yinglian Xie, Fang Yu, and Martín Abadi Presented by Peng Cheng 03/22/2017.

Generic and Automatic Address Configuration for Data Center Networks

GANG: Detecting Fraudulent Users in OSNs

Clustering The process of grouping samples so that the samples are similar within each group.

Presentation transcript:

BotGraph: Large Scale Spamming Botnet Detection Yao Zhao Yinglian Xie *, Fang Yu *, Qifa Ke *, Yuan Yu *, Yan Chen and Eliot Gillum ‡ EECS Department, Northwestern University Microsoft Research Silicon Valley * Microsoft Cooperation ‡ 1

2 Web-Account Abuse Attack Zombie (Compromised host) Spammer’s Server Captcha solver RDSXXTD3 User/Pwd

Problems and Challenges Detect Web-account Abuse with Hotmail Logs – Input: user activity traces (signup, login, - sending records) – Goal: stop aggressive account signup, limit outgoing spam Challenges – Attack is stealthy: individual account detection difficult – Attack is large scale: finding correlated activities >500 million accounts 300GB-400GB data per month – Low false positive and false negative rate 3

4 The BotGraph System A graph-based approach to attack detection – A large user-user graph to capture bot-account correlations – Identify 26M bot-accounts with a low false positive rate in two months Efficient implementation using Dryad/DryadLINQ – Graph construction/analysis is not easily parallelizable – Hundreds of millions of nodes, hundreds of billions of edges – Process 200GB-300GB data in 1.5 hours with a 240-machine cluster The first to provide a systematic solution to the new botnet-based web-account abuse attack

System Architecture 5 Login data Login graph Graph generation Random graph based clustering Verification & prune Sendmail data Spamming botnets Suspicious clusters Signup data EWMA based change detection Aggressive signups Verification & prune Signup botnets 3. Parallel algorithm on DryadLINQ clusters (ID, IP, time) (ID, time, # of recipients) (ID, IP, time) 1. History based algorithm to detect aggressive signups 2. Graph-based algorithm to find correlations

6 Detect Aggressive Signups Large prediction error Back to normal Date Number of Signup Accounts Jul2-Jul 3-Jul4-Jul 5-Jul6-Jul 7-Jul8-Jul 9-Jul Signup Count EWMA Prediction Simple and efficient Detect 20 million malicious accounts in 2 months Simple and efficient Detect 20 million malicious accounts in 2 months

System Architecture 7 Login data Login graph Graph generation Random graph based clustering Verification & prune Sendmail data Spamming botnets Suspicious clusters Signup data EWMA based change detection Aggressive signups Verification & prune Signup botnets 3. Parallelel Algorithm on DryadLinq clusters (ID, IP, time) (ID, time, # of recipients) (ID, IP, time) 1. History based algorithm on Signup detection 2. Graph-based algorithm on login detection

8 Observation: bot-accounts work collaboratively Normal Users – Share IP addresses in one AS with DHCP assignment Bot-users Detect Stealthy Accounts by Graphs A user-user graph to model behavior similarities

9 Observation: bot-accounts work collaboratively Normal Users – Share IP addresses in one AS with DHCP assignment Bot-users – Likely to share different IPs across ASes Detect Stealthy Accounts by Graphs A user-user graph to model behavior similarities

User-user Graph Node: Hotmail account Edge weight: # of ASes of the shared IP addresses – Consider edges with weight>1 Key Observations – Bot-users form a giant connected-component while normal users do not – Interpreted by the random graph theory 10 2 ASes 3 ASes 5 ASes 1 AS 4 ASes User1 User2 User3 User4 User5 User6

Random Graph Theory Random Graph G(n,p) – n nodes and each pair of nodes has an edge with probability p and average degree d = (n-1) · p Theorem – If d < 1, then with high probability the largest component in the graph has size less than O(log n) No large connected subgraph – If d > 1, with high probability the graph will contain a giant component with size at the order of O(n) Most nodes are in one connected subgraph 11

Graph-based Bot-user Detection Step 1: detect giant connected-components from the user-user graph Step 2: hierarchical algorithm to identify the correct groupings – Different bot-user groups may be mixed – Easier validation with correct group statistics – Difficult to choose a fixed edge-threshold Step 3: prune normal-user groups – Due to national proxies, cell phone users, facebook applications, etc. 12

Graph-based Bot-user Detection Step 1: detect giant connected-components from the user-user graph Step 2: hierarchical algorithm to identify the correct groupings – Different bot-user groups may be mixed – Easier validation with correct group statistics – Difficult to choose a fixed edge-threshold Step 3: prune normal-user groups – Due to national proxies, cell phone users, facebook applications, etc. 13

Hierarchical Bot-Group Extraction A B T=2 CD T=3 T=4 14 A

System Architecture 15 Login data Login graph Graph generation Random graph based clustering Verification & prune Sendmail data Spamming botnets Suspicious clusters Signup data EWMA based change detection Aggressive signups Verification & prune Signup botnets 3. Parallelel Algorithm on DryadLINQ clusters (ID, IP, time) (ID, time, # of recipients) (ID, IP, time) 1. History based algorithm on Signup detection 2. Graph-based algorithm on login detection

Parallel Implementation on DryadLINQ EWMA-based Signup Abuse Detection – Partition data by IP – Can achieve real-time detection User-User Graph Construction – Two algorithms and optimizations – Process 200GB-300GB data in 1.5 hours with 240 machines Connected Component Extraction – Divide and conquer – Process a graph of 8.6 billion edges in 7 minutes

Parallel Implementation on DryadLINQ EWMA-based Signup Abuse Detection – Partition data by IP – Can achieve real-time detection User-User Graph Construction – Two algorithms and optimizations – Process 200GB-300GB data in 1.5 hours with 240 machines Connected Component Extraction – Divide and conquer – Process a graph of 8.6 billion edges in 7 minutes

Graph Construction 1: Simple Data Parallelism Potential Edges – Select ID group by IP (Map) – Generate potential edges ( ID i, ID j, IP k ) (Reduce) Edge Weights – Select IP group by ID pair (Map) – Calculate edge weight (Reduce) Problem – Weight 1 edge is two orders of magnitude more than others – Their computation/communication is unnecessary

Graph Construction 2: Selective Filtering 19

Comparison of Two Algorithms Method 1 – Simple and scalable Method 2 – Optimized to filter out weight 1 edges – Utilize Join functionality, data compression and broadcast optimization 20

Detection Results Data description – Two datasets Jun 2007 and Jan 2008 – Three types of data Signup log (IP, ID, Time) Login log (IP, ID, Time) – 500M users and 200~300GB data per month Sendmail log (ID, time, # of recipients) – About 100GB per month 21

Detection of Signup Abuse 22

Detection by User-user Graph 23

Validations Manual Check – Sampled groups verified by the Hotmail team – Almost no false positives Comparison with Known Spamming Users – Detect 86% of complained accounts – Up to 54% of detected accounts are our new findings Sending Sizes per Group – Most groups have a sharp peak – The remaining contain several peaks False Positive Estimation – Naming pattern (0.44%) – Signup time (0.13%) 24

Possible to Evade BotGraph? Evade signup detection: Be stealthy Evade graph-based detection – Fixed IP/AS binding Low utilization rate Bot-accounts bound to one host are easy to be grouped – Be stealthy (sending as few s as normal user) 25 Severely limit attackers’ spam throughput

Conclusions A graph-based approach to attack detection – Identify 26M bot-accounts with a low false positive rate in two months Efficient implementation using Dryad/DryadLINQ – Process 200GB-300GB data in 1.5 hours with a 240- machine cluster 26 Large-scale data-mining for network security is effective and practical

Q & A? Thanks! 27