Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,

Similar presentations


Presentation on theme: "Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,"— Presentation transcript:

1 Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California, Berkeley International Computer Science Institute

2 Motivation Social Networks (Facebook, Twitter) Web Mail (Gmail, Live Mail) Blogs, Services (Blogger, Yelp) Spam

3 Motivation Existing solutions: – Blacklists – Service-specific, account heuristics Develop new spam filter service: – Filter spam: scams, phishing, malware – Real-time, fine-grained, generalizable

4 Overview Our system – Monarch: – Accepts millions of URLs from web service – Crawls, labels each URL in real-time Spam Classification – Decision based on URL content, page behavior, hosting – Large-scale; distributed collection, classification Implemented as a cloud service

5 Monarch in Action Social Network 1. Spam Message Spam Account

6 Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL Spam Account

7 Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL 3. Fetch Content Spam URL Content Spam Account

8 Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL 4. Decision 3. Fetch Content Spam URL Content Spam Account

9 Monarch in Action Monarch Social Network Message Recipients 1. Spam Message 2. Message URL 4. Decision 3. Fetch Content Spam URL Content Spam Account

10 Challenges Accuracy Real-Time Scalability Tolerant to Feature Evolution

11 Outline Architecture Results & Performance Limitations Conclusion

12 System Architecture

13

14

15

16 URL Aggregation SourceSample Size Spam email URLs1.25 million Blacklisted Twitter URLs567,000 Non-spam Twitter URLs9 million Collection period: 9/8/2010 – 10/29/2010

17 Feature Collection High Fidelity Browser Navigation – Lexical features of URLs (length, subdomains) – Obfuscation (directory operations, nested encoding) Hosting – IP/ASN – A, NS, MX records – Country, city if available

18 Feature Collection Content – Common HTML templates, keywords – Search engine optimization – Content of request, response headers Behavior – Prevent navigating away – Pop-up windows – Plugin, JavaScript redirects

19 Classification Distributed Logistic Regression – Data overload for single machine

20 Classification Distributed Logistic Regression – Data overload for single machine L1-regularization – Reduces feature space, over-fitting – 50 million features -> 100,000 features

21 Implementation System implemented as a cloud service on Amazon EC2 – Aggregation: 1 machine – Feature Collection: 20 machines Firefox, extension + modified source – Classification & Feature Extraction: 50 machines Hadoop - Spark, Mesos Straightforward to scale the architecture

22 Result Overview High-level summary: – Performance – Overall accuracy – Highlight important features – Feature evolution – Spam independence between services

23 Performance Rate: 638,000 URLs/day – Cost: $1,600/mo Process time: 5.54 sec – Network delay: 5.46 sec Can scale to 15 million URLs/day – Estimated $22,000/mo

24 Measuring Accuracy Dataset: 12 million URLs (<2 million spam) – Sample 500K spam (half tweets, half email) – Sample 500K non-spam Training, Testing – 5-fold validation – Vary training folds non-spam:spam ratio – Test fold equal parts spam, non-spam

25 Overall Accuracy Training Ratio AccuracyFalse Positive Rate False Negative Rate 1:194%4.23%7.5% 4:191%0.87%17.6% 10:187%0.29%26.5% Non-spam labeled as spam Spam labeled as non-spam Correctly labeled samples

26 Overall Accuracy Non-spam labeled as spam Spam labeled as non-spam Correctly labeled samples Training Ratio AccuracyFalse Positive Rate False Negative Rate 1:194%4.23%7.5% 4:191%0.87%17.6% 10:187%0.29%26.5%

27 Error by Feature Error (%) Error = 1 - Accuracy

28 Error by Feature Error (%) Error = 1 - Accuracy

29 Error by Feature Error (%) Error = 1 - Accuracy

30 Feature Evolution – Retraining Required Accuracy (%)

31 Spam Independence Unexpected result: Twitter, email spam qualitatively different Training SetTesting SetAccuracyFalse Negatives Twitter 94%22% TwitterEmail81%88% EmailTwitter80%99% Email 99%4%

32 Spam Independence Unexpected result: Twitter, email spam qualitatively different Training SetTesting SetAccuracyFalse Negatives Twitter 94%22% TwitterEmail81%88% EmailTwitter80%99% Email 99%4%

33 Distinct Email, Twitter Features

34 Email Features Shorter Lived

35 Limitations Adversarial Machine Learning – We provide oracle to spammers – Can adversaries tweak content until passing? Time-based Evasion – Change content after URL submitted for verification Crawler Fingerprinting – Identify IP space of Monarch, fingerprint Monarch browser client – Dual-personality DNS, page behavior

36 Related Work C. Whittaker, B. Ryner, and M. Nazif, “Large-Scale Automatic Classification of Phishing Pages” J. Ma, L. Saul, S. Savage, and G. Voelker, “Identifying suspicious URLs: an application of large-scale online learning” Y. Zhang, J. Hong, and L. Cranor, “Cantina: a content-based approach to detecting phishing web sites” M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of drive- by-download attacks and malicious JavaScript code”

37 Conclusion Monarch provides: – Real-time scam, phishing, malware detection – Experiments show 91% accuracy, 0.87% false positives – Readily scalable cloud service – Applicable to all URL-based spam Spam not guaranteed to overlap between web services – Twitter, email qualitatively different Despite overlap, can still provide generalizable filtering – Require training data from each service


Download ppt "Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,"

Similar presentations


Ads by Google