Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design and Evaluation of a Real-Time URL Spam Filtering Service

Similar presentations


Presentation on theme: "Design and Evaluation of a Real-Time URL Spam Filtering Service"— Presentation transcript:

1 Design and Evaluation of a Real-Time URL Spam Filtering Service
Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011

2 OUTLINE Introduction - Monarch Related Work System Design
Implementation Evaluation Discussion and Conclusion

3 Spam URL Advertisement Harmful content
Phishing, malware, and scams Use of compromised and fraudulent accounts , web services

4 Monarch Spam URL Filtering as a Service Tens of millions of features

5 Related Work “Detecting spammers on Twitter” (2010)
Post frequency, URLs, friends… “Behind phishing: an examination of phisher modi operandi” (2008) Lexical characteristics of phishing URLs “Cantina: a content-based approach to detecting phishing web sites” (2007) Parse HTML content

6 System Design Monarch’s cloud infrastructure Url Aggregation
providers and Twitter’s streaming API Feature Collection Visits a URL with web browsers to collect page content

7 System Design(cont.) Monarch’s cloud infrastructure Feature Extraction
Transform the raw data into a sparse feature vector Classification Training and testing by distributed logistic regression

8 Collect Raw Features – Web Browser
“A taxonomy of JavaScript redirection spam”(2007) Lightweight browser not enough Poor HTML parsing, lack of JavaScript and plugins Instrumented version of Firefox JavaScript enabled Flash and Java installed Visited a URL and monitor a number of details

9 Raw Features Web Browser
Initial URL and Landing URL, Redirects, Sources and Frames HTML Content, Page Links JavaScript Events, Pop-up Windows, Plugins HTTP Headers DNS Resolver Initial, final, and redirect URLs IP Address Analysis City, country, ASN Proxy and Whitelist (200 domains)

10 Features Vector Raw Features => sparse feature vector
Canonicalize URLs Remove obfuscation Tokenize the text corpus Splitting on non-alphanumeric characters => domain feature [adl,tw] path feature [dada,dada2,php] query parameters feature [a,1,b,3] => (…,adl:true,adm:false,…,dada:true,…,tw:true,……..) total 49,960,691 feature(dimension)… => (1,3,a,adl,b,dada,dada2,php,tw)

11 Distributed Classifier Design
Linear classification : feature vector Determine a weight vector A parallel online learner With regularization to yield a sparse weight vector Labeled data , Testing => -1 => non-spam site 1 => spam site

12 Training the weight vector
Logistic Regression With subgradient L1-Regularization yi(xi.wi) larger => f(w) smaller (Classification margin, hyperplane)

13 Distributed Classifier Algorithm

14 Data Set and assumption
1.25 million spam URLs 567,784 spam Twitter URLs 9 million non-spam Twitter URLs Checking all Twitter URLs against: Google Safebrowsing, SURBL, URIBL, APWG, Phishtank Any of its source URLs become blacklisted

15 Data Set and assumption(cont.)
On Twitter: 36% scams, 60% phishing, 4% malware

16 After regularization

17 Implementation Amazon Web Services(AWS) infrastructure URL Aggregation
A queue, keeps 300,000 URLs Feature Collection 20x6 Firefox(4.0b4) on Ubuntu 10.04 With a custom extension Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views Classifier Hadoop Distributed File System On the 50-node cluster

18 Evaluation – Overall Accuracy
5-fold cross-validation 500,000 spam and non-spam each Training set size to 400,000 example 1:1, 4:1, 10:1 Testing set size to 200,000 example 1:1

19 Evaluation – Single Feature

20 Evaluation – Accuracy Over Time
Training once only <-> Retraining every four days

21 Evaluation – Comparing Email and Tweet Spam
Log odds ratio:

22 Evaluation – The Cost For Twitter, $22,751 per month

23 Discussion and Conclusion
Evasion Feature Evasion Time-based Evasion Crawler Evasion Monarch Real-time system Spam URL Filtering as a Service $22,751 a month


Download ppt "Design and Evaluation of a Real-Time URL Spam Filtering Service"

Similar presentations


Ads by Google