Download presentation
Presentation is loading. Please wait.
Published byHester Lambert Modified over 9 years ago
1
Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17 1 Data Mining and Machine Learning Lab.
2
Authors: Anh Le, Athina Markopoulou (University of California, Irvine) Michalis Faloutsos (University of California, Riverside) Source: to appear in IEEE INFOCOM 2011 Mini Conference, Shanghai, China, April 10-15, 2011. (poster, tech report) 2011/3/17 2 Data Mining and Machine Learning Lab.
3
Introduction Dataset and Feature Extraction Classification Algorithms Evaluation Results System Deployment Conclusion 2011/3/17 3 Data Mining and Machine Learning Lab.
4
“How well can one detect phishing URLs using only lexical features compared to using full features?” PhishDef Properties: High accuracy: 96%-97% Light-weight: Low latency Imposes a modest overhead Proactive approach As opposed to reactively relying on blacklist Resilience to noise 95%-86% accuracy when there is 5%-45% noise 2011/3/17 4 Data Mining and Machine Learning Lab.
5
Dataset Malicious URLs PhishTank MalwarePatrol Legitimate URLs Yahoo Directory Open Directory (DMOZ) External Feature Collection WHOIS Team Cymru 2011/3/17 5 Data Mining and Machine Learning Lab.
6
Feature Extraction Automatically selected features Delimiters: ‘/’, ’?’, ‘.’, ‘=‘, ‘_’, ‘&’ and ‘-’. Four parts: Domain Name Directory File Name Argument Obfuscation-resistant lexical features Four different URL obfuscation techniques Five categories of hand-selected lexical features 2011/3/17 6 Data Mining and Machine Learning Lab.
7
(I) Obfuscating the host with an IP address (II) Obfuscating the host with another domain (III) Obfuscating with large host names (IV) Domain unknown or misspelled 2011/3/17 7 Data Mining and Machine Learning Lab.
8
Features related to the full URL Length of the URL (Type II) Number of dots in the URL (Type II) Blacklisted words (Type IV) confirm, account, banking, secure, ebayisapi, webscr, login and signin Paypal, free, lucky and bonus Features related to the domain name Length of the domain name (Type III) IP or port number is used in the domain name (Type I) Number of tokens of the domain name (Type III) Number of hyphens used in the domain name (Type III) The length of the longest token (Type III) Features related to the directory Length of the directory (Type II) Number of sub-directory tokens (Type II) Length of the longest sub-directory token (Type II) Maximum number of dots and other delimiters used in a sub-directory token (Type II) 2011/3/17 Data Mining and Machine Learning Lab. 8
9
Features related to the file name Length of the file name (Type II) Number of dots and other delimiters used in the file name (Type II) Features related to the argument part Length of the argument part Number of variables Length of the longest variable value The maximum number of delimiters used in a value Summary of dataset Summary of dataset 2011/3/17 Data Mining and Machine Learning Lab. 9
10
Batch Learning Support Vector Machine (SVM) Online Learning Online Perception (OP) Confidence Weighted (CW) Adaptive Regularization of Weights (AROW) 2011/3/17 Data Mining and Machine Learning Lab. 10
11
Batch-based vs. Online algorithms SVM vs. AROW Yahoo-Phish 2011/3/17 Data Mining and Machine Learning Lab. 11
12
Lexical Features vs. Full Features OP, CW and AROW Yahoo-Phish 2011/3/17 Data Mining and Machine Learning Lab. 12
13
Obfuscation-Resistant Lexical Features Performance of AROW with/without OR features after the last URL 2011/3/17 Data Mining and Machine Learning Lab. 13
14
The resilience of AROW to noisy data AROW and CW Yahoo-Phish 2011/3/17 Data Mining and Machine Learning Lab. 14
15
Minimum/Maximum URL Similarity Distance distribution 2011/3/17 Data Mining and Machine Learning Lab. 15
16
2011/3/17 Data Mining and Machine Learning Lab. 16
17
2011/3/17 Data Mining and Machine Learning Lab. 17 Proposed PhishDef – a proactive defense scheme of phishing attacks PhishDef detecting phishing URLs on-the-fly PhishDef use only lexical features High accuracy (97%) Low overhead Resilient to noisy training data Firefox and Chrome add-ons implementation
18
Q&A? 2011/3/17 Data Mining and Machine Learning Lab. 18
19
2011/3/17 Data Mining and Machine Learning Lab. 19
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.