Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

Similar presentations


Presentation on theme: "Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google."— Presentation transcript:

1 Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google May 5, 2010

2 Malicious Web Sites Spam-advertised goods Trojan downloads Phishing: which one is real?

3 3 Visiting Malicious Web Sites Predict what is safe without committing to risky actions Safe URL? Malicious download? Spam-advertised site? Phishing site? URL = Uniform Resource Locator http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll http://fblight.com http://mail.ru http://www.cs.ucsd.edu/~jtma/

4 4 Problem in a Nutshell URL features to identify malicious Web sites No context, no content Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. malicious facebook.comfblight.com

5 5 What we want...

6 6 How to build this service? http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll Blacklist Hand-picked features (properties of URL) Machine learning-based classifier

7 7 State of the Practice Current approaches Blacklists [SORBS, URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense] Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09] Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features Arms race: fast feedback cycle is critical More automated approach?

8 8 Contributions URL classification system Used by large Web mail providers Practical large-scale machine learning in computer security Large feature sets + online learning Related work: Whittaker, Ryner, Nazif, “Large-Scale Automatic Classification of Phishing Pages” (NDSS'10)

9 9 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

10 10 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

11 11 Moving Beyond Blacklists Preliminary study: Do our features work? Batch algorithms for now 10 4 examples and features Outline System overview Features ← focus of this segment Experimental results [Ma, Saul, Savage, Voelker (KDD 2009)]

12 12 URL Classification System LabelExampleHypothesis

13 13 Data Sets Malicious URLs 5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing, etc) Benign URLs 15,000 from Yahoo Web directory 15,000 from DMOZ directory Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set

14 14 Algorithms Logistic regression w/ L1-norm regularization Other models Naive Bayes Support vector machines (linear, RBF kernels) Implicit feature selection Easier to interpret

15 15 Features Example

16 16 Feature Vector Construction http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll WHOIS registration: 3/25/2009 Hosted from 208.78.240.0/22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …] Real-valued Host-basedLexical

17 17 Features to consider? 1)Blacklists 2)Simple heuristics 3)Domain name registration 4)Host properties 5)Lexical

18 18 (1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL, Spamhaus Not comprehensive http://www.bfuduuioo1fp.mobi In blacklist? Yes http://fblight.com No In blacklist? http://www.bfuduuioo1fp.mobi Blacklist queries as features........................................

19 19 stopgap.cn registered 2 May 2010 (2) Manually-Selected Features Considered by previous studies IP address in hostname? Number of dots in URL WHOIS (domain name) registration date [Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008] http://72.23.5.122/www.bankofamerica.com/ http://www.bankofamerica.com.qytrpbcw.stopgap.cn/

20 20 (3) WHOIS Features Domain name registration Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration? http://sleazysalmon.com http://angryalbacore.com http://mangymackerel.com http://yammeringyellowtail.com Registered on 4 May 2010 By SpamMedia

21 21 (4) Host-Based Features Blacklisted? (SORBS, URIBL, SURBL, Spamhaus) WHOIS: registrar, registrant, dates IP address: Which ASes/IP prefixes? DNS: TTL? PTR record exists/resolves? Geography-related: Locale? Connection speed? 75.102.60.0/2269.63.176.0/20 facebook.comfblight.com Bad Parts of the Internet

22 22 (5) Lexical Features Tokens in URL hostname + path Length of URL Number of dots http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll [Kolari et al., 2006]

23 23 Which feature sets? Blacklist Manual WHOIS Host-based Lexical 4,000 # Features 13,000 4 7 17,000 More features → Better accuracy Error rate (%)

24 24 Which feature sets? Blacklist Manual WHOIS Host-based Lexical Full 96—99% accuracy 4,000 # Features 13,000 4 7 17,000 30,000 Error rate (%)

25 25 Which feature sets? Blacklist Manual WHOIS Host-based Lexical Full w/o WHOIS/Blacklist 4,000 # Features 13,000 4 7 17,000 30,000 26,000 Error rate (%) Blacklists and WHOIS are not comprehensive

26 26 Beyond Blacklists Blacklist Full features Yahoo-PhishTank Higher detection rate for given false positive rate

27 27 Summary Detect malicious URLs with high accuracy Only using URL Diverse feature set helps: 99% w/ 30,000+ features Model analysis (more in KDD'09 paper) What about scalable and adaptive malicious URL detection system?

28 28 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

29 29 URL Classification System LabelExampleHypothesis

30 30 Live URL Classification System LabelExampleHypothesis

31 31 Large-Scale Online Learning How do we scale to live, large-scale data? Outline Live training feed Challenges: scale and non-stationarity Need for large, fresh training sets Online learning [Ma, Saul, Savage, Voelker (ICML 2009)]

32 32 Live Training Feed Malicious URLs (spamming and phishing) 6,000—7,500 per day from Web mail provider Benign URLs From Yahoo Web directory Total of 20,000 URLs per day Collected Jan – May, 2009 120 days More than two million examples

33 33 Feature Vector Construction http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll WHOIS registration: 3/25/2009 Hosted from 208.78.240.0/22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …] Real-valued Host-basedLexical 60+ features1.1 million 1.8 million GROWING Day 100

34 34 New Features Encountered All the Time Many binary features Enumerating tokens, ISPs, registrants, etc... 2.9 million features after 100 days

35 35 Live URL Classficiation System Online learning

36 36 Practical Challenges of ML in Systems Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms race w/ criminals) Pivotal decision: batch or online?

37 37 Batch vs. Online Learning Batch/offline learning SVM, logistic regression, decision trees, etc Multiple passes over data No incremental updates Potentially high memory and processing overhead Online learning Perceptron-style algorithms Single pass over data Incremental updates Low memory and processing overhead Online learning addresses scale and non-stationarity

38 38 Evaluations Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms Continuous retraining Growing feature vector

39 39 Need lots of fresh training data? SVM trained once SVM retrained daily Fresh data helps

40 40 Need lots of fresh training data? Fresh data helps More data helps SVM trained once on 2 weeks SVM w/ 2-week sliding window

41 41 Which online algorithm? Perceptron Stochastic Gradient Descent for Logistic Regression Confidence-Weighted Learning

42 42 Perceptron Convergence result: [Rosenblatt, 1958] + − ++ + + + − − − − − Update on each mistake: radius margin Number of mistakes ≤

43 43 Logistic Regression with SGD Log likelihood: where [Bottou, 1998] For every example: Proportional

44 44 Confidence-Weighted Learning Maintain Gaussian distribution over weight vector: [Dredze et al., 2008] [Crammer et al., 2009] Constrained problem: Closed-form update: Update features at different rates Diagonal covariance matrix for evals

45 45 Which online algorithms? Perceptron

46 46 Which online algorithms? LR w/ SGD Proportional update helps Perceptron

47 47 Which online algorithms? Proportional update helps Per-feature confidence really helps Confidence-Weighted LR w/ SGD Perceptron

48 48 Batch... Fresh data helps More data helps BatchBatch

49 49 Batch vs. Online Confidence-Weighted BatchBatch Fresh data helps More data helps Online matches batch

50 50 Why online does well? SVM w/ 2-week sliding window Confidence-Weighted

51 51 Why online does well? Confidence-Weighted once-a-day More data eventually helps Continuous retraining helps SVM w/ 2-week sliding window Confidence-Weighted

52 52 Growing feature vector? Confidence-Weighted fixed features

53 53 Growing feature vector? Confidence-Weighted fixed features Confidence-Weighted growing features Growing feature vector helps

54 54 Fixed vs. Variable Features Perceptron fixed features Growing feature vector helps

55 55 Fixed vs. Variable Features Perceptron fixed features Growing feature vector helps Growing + CW really helps Perceptron growing features

56 56 Proof-of-Concept Plugin

57 57 Examine URL details...

58 58 It sure looks like phishing... The Real Site

59 59 Blacklisted later on

60 60 Summary Detecting malicious URLs Relevant real-world problem Successful application of online learning What helps? More, fresher data Continuous retraining Growing feature vector Confidence-Weighted vs. Batch As accurate More adaptive Fewer resources

61 61 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

62 62 Impact Public data set (UCI ML repo) Industrial impact Mail providers have adopted our approach for classifying URLs in email messages Project Info + Data Set http://sysnet.ucsd.edu/projects/url/

63 63 Final Thoughts: Systems + ML Systems: high-impact, large-scale applications ML: Methodical approaches Systems: Embrace real-world constraints ML: More than “plug-and-play” solutions Systems ↔ Machine Learning


Download ppt "Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google."

Similar presentations


Ads by Google