Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering.

Similar presentations


Presentation on theme: "Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering."— Presentation transcript:

1 Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for ICML 2009 June 15, 2009

2 2 Detecting Malicious Web Sites Predict what is safe without committing to risky actions Safe URL? Web exploit? Spam-advertised site? Phishing site? URL = Uniform Resource Locator http://www.cs.mcgill.ca/~icml2009/abstracts.html http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll http://fblight.com http://mail.ru

3 3 Problem in a Nutshell ● URL features to identify malicious Web sites ● Different classes of URLs ● Benign, spam, phishing, exploits, scams... ● For now, distinguish benign vs. malicious facebook.comfblight.co m

4 4 Today's Talk ● Problem ● Approach ● Learning to detect malicious URLs ● Challenges: scale and non-stationarity ● Evaluations ● Need for large, fresh training sets ● Online learning ● Conclusion

5 5 State of the Practice ● Current approaches ● Blacklists ● Learning on hand-tuned features ● Limitations ● Cannot learn from newest examples quickly ● Cannot quickly adapt to newest features ● Arms race: fast feedback cycle is critical More automated approach?

6 6 Live URL Classification System LabelExampleHypothesis

7 7 Live Training Feed ● Malicious URLs (spamming and phishing) ● 6,000—7,500 per day from Web mail provider ● Benign URLs ● From Yahoo Web directory ● Total of 20,000 URLs per day ● Live collection since Jan. 5, 2009 ● Months of data ● Two million examples after 100 days

8 8 Feature vector construction http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll WHOIS registration: 3/25/2009 Hosted from 208.78.240.0/22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …] Real-valued Host-basedLexical 60+ features1.8 million 1.1 million GROWING Day 100

9 9 Live URL Classficiation System Online learning

10 10 Practical Challenges of ML in Systems ● Industrial concerns ● Scale: millions of examples, features ● Non-stationarity: examples change over time (arms race w/ criminals) ● Pivotal decision: batch or online?

11 11 Batch vs. Online Learning ● Batch/offline learning ● SVM, logistic regression, decision trees, etc ● Multiple passes over data ● No incremental updates ● Potentially high memory and processing overhead ● Online learning ● Perceptron-style algorithms ● Single pass over data ● Incremental updates ● Low memory and processing overheard Online learning addresses scale and non-stationarity

12 12 Evaluations ● Online learning for URL reputation ● Need for large, fresh training sets ● Comparing online algorithms

13 13 Need lots of fresh training data? SVM trained once SVM retrained daily ● Fresh data helps

14 14 Need lots of fresh training data? ● Fresh data helps ● More data helps SVM trained once on 2 weeks SVM w/ 2-week sliding window

15 15 Which online algorithm? ● Perceptron ● Stochastic Gradient Descent for Logistic Regression ● Confidence-Weighted Learning

16 16 Perceptron ● Convergence result: [Rosenblatt, 1958] + − ++ + + + − − − − − ● Update on each mistake: radius margin Number of mistakes ≤

17 17 Logistic Regression with SGD ● Log likelihood: where [Bottou, 1998] ● For every example: Proportional

18 18 Confidence-Weighted Learning ● Maintain Gaussian distribution over weight vector: [Dredze et al., 2008] [Crammer et al., 2009] ● Constrained problem: ● Closed-form update: Treat features differently

19 19 Which online algorithms? Perceptron

20 20 Which online algorithms? LR w/ SGD ● Proportional update helps Perceptron

21 21 Which online algorithms? ● Proportional update helps ● Per-feature confidence really helps Confidence-Weighted LR w/ SGD Perceptron

22 22 Batch... ● Fresh data helps ● More data helps Batch

23 23 Batch vs. Online Confidence-Weighted Batch ● Fresh data helps ● More data helps ● Online matches batch

24 24 Conclusion ● Detecting malicious URLs ● Relevant real-world problem ● Successful application of online learning ● Confidence-Weighted vs. Batch ● As accurate ● More adaptive ● Less resources ● Future work ● Scaling up for deployment


Download ppt "Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering."

Similar presentations


Ads by Google