Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.

Slides:

Advertisements

Similar presentations

1 Network-Level Spam Detection Nick Feamster Georgia Tech.

Advertisements

Flux in Fraud Infrastructures Minaxi Gupta Computer Science Dept. Indiana University, Bloomington.

11 PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26.

Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.

Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.

Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented.

Paper Reading: Reporter: Shao-Yu Peng( 彭少瑜 ) Date: 2013/10/28.

1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.

“Identifying Suspicious URLs: An Application of Large-Scale Online Learning” Paper by Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker.

Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.

Design and Evaluation of a Real-Time URL Spam Filtering Service

Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Yue Zhang University of Pittsburgh Jason I. Hong, Lorrie F. Cranor Carnegie Mellon University.

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

Announcements Blog Projects Next class: spam infrastructure Next next class: Dave Aucsmith 1.

Internet Quarantine: Requirements for Containing Self-Propagating Code David Moore et. al. University of California, San Diego.

Verma - ICISS 2014 R easoning M ining NLP Defense Rakesh M. Verma ReMiND Laboratory Catching Classical and Hijack-based Phishing Attacks.

Prophiler: A fast filter for the large-scale detection of malicious web pages Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao Date : 2011/03/31 1.

Detection of Internet Scam Using Logistic Regression

Hands-On Microsoft Windows Server 2008 Chapter 8 Managing Windows Server 2008 Network Services.

Norman SecureSurf Protect your users when surfing the Internet.

Presentation by Kathleen Stoeckle All Your iFRAMEs Point to Us 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008 Google Technical Report.

URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013.

GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.

PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.

WARNINGBIRD: A Near Real-time Detection System for Suspicious URLs in Twitter Stream.

URL AND DNS A SHORT INTRODUCTION Rachel White7/11/2014.

PhishScore: Hacking Phishers’ Minds

Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)

1 Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Speaker: Jun-Yi Zheng 2010/03/29.

11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.

FluXOR: Detecting and Monitoring Fast-Flux Service Networks Emanuele Passerini, Roberto Paleari, Lorenzo Martignoni, and Danilo Bruschi 5th international.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.

인지구조기반 마이닝 소프트컴퓨팅 연구실 박사 2 학기 박 한 샘 2006 지식기반시스템 응용.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

1 Behind Phishing: An Examination of Phisher Modi Operandi Speaker: Jun-Yi Zheng 2010/05/10.

Spam Detection Ethan Grefe December 13, 2013.

Cross-Analysis of Botnet Victims: New Insights and Implication Seungwon Shin, Raymond Lin, Guofei Gu Presented by Bert Huang.

Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.

Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17 1 Data Mining and Machine Learning Lab.

Trends in Circumventing Web-Malware Detection UTSA Moheeb Abu Rajab, Lucas Ballard, Nav Jagpal, Panayiotis Mavrommatis, Daisuke Nojiri, Niels Provos, Ludwig.

Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)

The Koobface Botnet and the Rise of Social Malware Kurt Thomas David M. Nicol

Post-Ranking query suggestion by diversifying search Chao Wang.

11 Shades of Grey: On the effectiveness of reputation- based “blacklists” Reporter: 林佳宜 /8/16.

Detecting and Characterizing Social Spam Campaigns Yan Chen Lab for Internet and Security Technology (LIST) Northwestern Univ.

“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.

Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.

Uploading Web Page  It would be meaningful to share your web page with the rest of the net user.  Thus, we have to upload the web page to the web server.

Fast Flux Hosting and DNS ICANN SSAC What is Fast Flux Hosting? An evasion technique Goal of all fast flux variants –Avoid detection and take down of.

Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering.

An ANN approach to identify malicious URLs ECE 539 – Final Project Jayneel Gandhi.

Anti-Spam Managing Spam with Kerio Connect

Under the Shadow of sunshine

Learning to Detect and Classify Malicious Executables in the Wild by J

Detection of Internet Scam Using Logistic Regression

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

MALICIOUS URL DETECTION For Machine Learning Coursework

Phillipa Gill University of Toronto

Agenda OAuth Concepts Programming OAuth.

Presentation transcript:

Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for KDD 2009 June 30, 2009

2 Detecting Malicious Web Sites Predict what is safe without committing to risky actions Safe URL? Web exploit? Spam-advertised site? Phishing site? URL = Uniform Resource Locator

3 Problem in a Nutshell URL features to identify malicious Web sites No context, no content Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. malicious facebook.comfblight.com

4 State of the Practice Current approaches Blacklists [SORBS, URIBL, SURBL, Spamhaus] Learning on hand-tuned features [Garera et al, 2007] Limitations Cannot predict unlisted sites Cannot account for new features Arms race More automated approach?

5 Today's Talk Motivation System overview Training data Algorithms Features ← focus of today's talk Experimental results Conclusion

6 URL Classification System LabelExampleHypothesis

7 Data Sets Malicious URLs 5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing, etc) Benign URLs 15,000 from Yahoo Web directory 15,000 from DMOZ directory Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set

8 Algorithms Logistic regression w/ L1-norm regularization Other models Naive Bayes Support vector machines (linear, RBF kernels) Implicit feature selection Easier to interpret

9 Today's Focus Example

10 Feature vector construction WHOIS registration: 3/25/2009 Hosted from /22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … … …] Real-valued Host-basedLexical

11 Features to consider? 1)Blacklists 2)Simple heuristics 3)Domain name registration 4)Host properties 5)Lexical

12 (1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL, Spamhaus In blacklist? Yes No In blacklist? Blacklist queries as features

13 stopgap.cn registered 28 June 2009 (2) Manually-Selected Features Considered by previous studies IP address in hostname? Number of dots in URL WHOIS (domain name) registration date [Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]

14 (3) WHOIS Features Domain name registration Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration? Registered on 29 June 2009 By SpamMedia

15 (4) Host-Based Features Blacklisted? (SORBS, URIBL, SURBL, Spamhaus) WHOIS: registrar, registrant, dates IP address: Which ASes/IP prefixes? DNS: TTL? PTR record exists/resolves? Geography-related: Locale? Connection speed? / /20 facebook.comfblight.com

16 (5) Lexical Features Tokens in URL hostname + path Length of URL Number of dots

17 Which feature sets? Blacklist Manual WHOIS Host-based Lexical 4,000 # Features 13, ,000 More features → Better accuracy

18 Which feature sets? Blacklist Manual WHOIS Host-based Lexical Full 96—99% accuracy 4,000 # Features 13, ,000 30,000

19 Which feature sets? Blacklist Manual WHOIS Host-based Lexical Full w/o WHOIS/Blacklist 4,000 # Features 13, ,000 30,000 26,000

20 Beyond Blacklists Blacklist Full features Yahoo-PhishTank Higher detection rate for given false positive rate

21 Limitations False positives Sites hosted in disreputable ISP Guilt by association False negatives Compromised sites Free hosting sites Redirection (but we consider TinyURL malicious :) Hosted in reputable ISP Future work: Web page content

22 Conclusion Detect malicious URLs with high accuracy Only using URL Diverse feature set helps: 99% w/ 30,000+ features Model analysis (more in paper) Our related efforts Online learning for URL reputation [ICML 2009] Future work Scaling up for deployment