Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,

Slides:



Advertisements
Similar presentations
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Advertisements

1 Network-Level Spam Detection Nick Feamster Georgia Tech.
Click Trajectories: End-to-End Analysis of the Spam Value Chain Author : Kirill Levchenko, Andreas Pitsillidis, Neha Chachra, Brandon Enright, M’ark F’elegyh’azi,
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
@ SPAM : T HE U NDERGROUND ON 140 C HARACTERS OR L ESS Chris Grier, Vern Paxson, Michael Zhang University of California, Berkeley Kurt Thomas University.
Hulk: Eliciting Malicious Behavior in Browser Extensions
RB-Seeker: Auto-detection of Redirection Botnet Presenter: Yi-Ren Yeh Authors: Xin Hu, Matthew Knysz, Kang G. Shin NDSS 2009 The slides is modified from.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
Design and Evaluation of a Real-Time URL Spam Filtering Service
ABUSING BROWSER ADDRESS BAR FOR FUN AND PROFIT - AN EMPIRICAL INVESTIGATION OF ADD-ON CROSS SITE SCRIPTING ATTACKS Presenter: Jialong Zhang.
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Yue Zhang University of Pittsburgh Jason I. Hong, Lorrie F. Cranor Carnegie Mellon University.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Prophiler: A fast filter for the large-scale detection of malicious web pages Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao Date : 2011/03/31 1.
Towards Online Spam Filtering in Social Networks Hongyu Gao, Yan Chen, Kathy Lee, Diana Palsetia and Alok Choudhary Lab for Internet and Security Technology.
1. Introduction The underground Internet economy Web-based malware The system analyzing the post-infection network behavior of web-based malware How do.
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,
Norman SecureTide Powerful cloud solution to stop spam and threats before it reaches your network.
Jarhead Analysis and Detection of Malicious Java Applets Johannes Schlumberger, Christopher Kruegel, Giovanni Vigna University of California Annual Computer.
Presentation by Kathleen Stoeckle All Your iFRAMEs Point to Us 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008 Google Technical Report.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
- -Heather Rodriguez, - Shilpa Reddy. Goal - One stop shop for Undergrad Competitions.
May l Washington, DC l Omni Shoreham The ROI of Messaging Security JF Sullivan VP Marketing, Cloudmark, Inc.
WARNINGBIRD: A Near Real-time Detection System for Suspicious URLs in Twitter Stream.
PhishScore: Hacking Phishers’ Minds
Authors: Gianluca Stringhini Christopher Kruegel Giovanni Vigna University of California, Santa Barbara Presenter: Justin Rhodes.
WEB SPOOFING by Miguel and Ngan. Content Web Spoofing Demo What is Web Spoofing How the attack works Different types of web spoofing How to spot a spoofed.
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.
Using Social Networks to Harvest Addresses Reporter: Chia-Yi Lin Advisor: Chun-Ying Huang Mail: 9/14/
SURF:SURF: Detecting and Measuring Search Poisoning Long Lu, Roberto Perdisci, and Wenke Lee Georgia Tech and University of Georgia.
Suspended Accounts in Retrospect: An Analysis of Twitter Spam Kurt Thomas, Chris Grier, Vern Paxson, Dawn Song University of California, Berkeley International.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
1 Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Speaker: Jun-Yi Zheng 2010/03/29.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
FluXOR: Detecting and Monitoring Fast-Flux Service Networks Emanuele Passerini, Roberto Paleari, Lorenzo Martignoni, and Danilo Bruschi 5th international.
Department of Information Engineering The Chinese University of Hong Kong A Framework for Monitoring and Measuring a Large-Scale Distributed System in.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
WebVizOr: A Fault Detection Visualization Tool for Web Applications Goal: Illustrate and evaluate the uses of WebVizOr, a new tool to aid web application.
Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Studying Spamming Botnets Using Botlab 台灣科技大學資工所 楊馨豪 2009/10/201 Machine Learning And Bioinformatics Laboratory.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Understanding the Network-Level Behavior of Spammers Author: Anirudh Ramachandran, Nick Feamster SIGCOMM ’ 06, September 11-16, 2006, Pisa, Italy Presenter:
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.
Delivery for Spam Mitigation Usenix Security 2012 Gianluca Stringhini, Manuel Egele, Apostolis Zarras, Thorsten Holz, Christopher.
Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17 1 Data Mining and Machine Learning Lab.
Studying Spamming Botnets Using Botlab
Trends in Circumventing Web-Malware Detection UTSA Moheeb Abu Rajab, Lucas Ballard, Nav Jagpal, Panayiotis Mavrommatis, Daisuke Nojiri, Niels Provos, Ludwig.
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
The Koobface Botnet and the Rise of Social Malware Kurt Thomas David M. Nicol
Trends and Lessons from Three Years Fighting Malicious Extensions Nav Jagpal, Eric Dingle, Jean-Philippe, Gravel Panayiotis, Mavrommatis Niels, Provos.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
1 Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Speaker: Jun-Yi Zheng 2010/01/18.
CERN - IT Department CH-1211 Genève 23 Switzerland t OIS Update on the anti spam system at CERN Pawel Grzywaczewski, CERN IT/OIS HEPIX fall.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
CPSC FALL 2015TEAM P6 Real-time Detection System for Suspicious URLs Submitted by T.ANUPCHANDRA V.KRANTHI SUDHA CH.KRISHNAPRASAD Under Guidance.
Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering.
Experience Report: System Log Analysis for Anomaly Detection
Learning to Detect and Classify Malicious Executables in the Wild by J
A Virtual Tour of SophosLabs Building next-generation protection
Introduction Web Environments
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Presentation transcript:

Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California, Berkeley International Computer Science Institute

Motivation Social Networks (Facebook, Twitter) Web Mail (Gmail, Live Mail) Blogs, Services (Blogger, Yelp) Spam

Motivation Existing solutions: – Blacklists – Service-specific, account heuristics Develop new spam filter service: – Filter spam: scams, phishing, malware – Real-time, fine-grained, generalizable

Overview Our system – Monarch: – Accepts millions of URLs from web service – Crawls, labels each URL in real-time Spam Classification – Decision based on URL content, page behavior, hosting – Large-scale; distributed collection, classification Implemented as a cloud service

Monarch in Action Social Network 1. Spam Message Spam Account

Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL Spam Account

Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL 3. Fetch Content Spam URL Content Spam Account

Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL 4. Decision 3. Fetch Content Spam URL Content Spam Account

Monarch in Action Monarch Social Network Message Recipients 1. Spam Message 2. Message URL 4. Decision 3. Fetch Content Spam URL Content Spam Account

Challenges Accuracy Real-Time Scalability Tolerant to Feature Evolution

Outline Architecture Results & Performance Limitations Conclusion

System Architecture

URL Aggregation SourceSample Size Spam URLs1.25 million Blacklisted Twitter URLs567,000 Non-spam Twitter URLs9 million Collection period: 9/8/2010 – 10/29/2010

Feature Collection High Fidelity Browser Navigation – Lexical features of URLs (length, subdomains) – Obfuscation (directory operations, nested encoding) Hosting – IP/ASN – A, NS, MX records – Country, city if available

Feature Collection Content – Common HTML templates, keywords – Search engine optimization – Content of request, response headers Behavior – Prevent navigating away – Pop-up windows – Plugin, JavaScript redirects

Classification Distributed Logistic Regression – Data overload for single machine

Classification Distributed Logistic Regression – Data overload for single machine L1-regularization – Reduces feature space, over-fitting – 50 million features -> 100,000 features

Implementation System implemented as a cloud service on Amazon EC2 – Aggregation: 1 machine – Feature Collection: 20 machines Firefox, extension + modified source – Classification & Feature Extraction: 50 machines Hadoop - Spark, Mesos Straightforward to scale the architecture

Result Overview High-level summary: – Performance – Overall accuracy – Highlight important features – Feature evolution – Spam independence between services

Performance Rate: 638,000 URLs/day – Cost: $1,600/mo Process time: 5.54 sec – Network delay: 5.46 sec Can scale to 15 million URLs/day – Estimated $22,000/mo

Measuring Accuracy Dataset: 12 million URLs (<2 million spam) – Sample 500K spam (half tweets, half ) – Sample 500K non-spam Training, Testing – 5-fold validation – Vary training folds non-spam:spam ratio – Test fold equal parts spam, non-spam

Overall Accuracy Training Ratio AccuracyFalse Positive Rate False Negative Rate 1:194%4.23%7.5% 4:191%0.87%17.6% 10:187%0.29%26.5% Non-spam labeled as spam Spam labeled as non-spam Correctly labeled samples

Overall Accuracy Non-spam labeled as spam Spam labeled as non-spam Correctly labeled samples Training Ratio AccuracyFalse Positive Rate False Negative Rate 1:194%4.23%7.5% 4:191%0.87%17.6% 10:187%0.29%26.5%

Error by Feature Error (%) Error = 1 - Accuracy

Error by Feature Error (%) Error = 1 - Accuracy

Error by Feature Error (%) Error = 1 - Accuracy

Feature Evolution – Retraining Required Accuracy (%)

Spam Independence Unexpected result: Twitter, spam qualitatively different Training SetTesting SetAccuracyFalse Negatives Twitter 94%22% Twitter 81%88% Twitter80%99% 99%4%

Spam Independence Unexpected result: Twitter, spam qualitatively different Training SetTesting SetAccuracyFalse Negatives Twitter 94%22% Twitter 81%88% Twitter80%99% 99%4%

Distinct , Twitter Features

Features Shorter Lived

Limitations Adversarial Machine Learning – We provide oracle to spammers – Can adversaries tweak content until passing? Time-based Evasion – Change content after URL submitted for verification Crawler Fingerprinting – Identify IP space of Monarch, fingerprint Monarch browser client – Dual-personality DNS, page behavior

Related Work C. Whittaker, B. Ryner, and M. Nazif, “Large-Scale Automatic Classification of Phishing Pages” J. Ma, L. Saul, S. Savage, and G. Voelker, “Identifying suspicious URLs: an application of large-scale online learning” Y. Zhang, J. Hong, and L. Cranor, “Cantina: a content-based approach to detecting phishing web sites” M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of drive- by-download attacks and malicious JavaScript code”

Conclusion Monarch provides: – Real-time scam, phishing, malware detection – Experiments show 91% accuracy, 0.87% false positives – Readily scalable cloud service – Applicable to all URL-based spam Spam not guaranteed to overlap between web services – Twitter, qualitatively different Despite overlap, can still provide generalizable filtering – Require training data from each service