A Neural Network Classifier for Junk E-Mail Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.

Slides:



Advertisements
Similar presentations
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Advertisements

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Dealing With Spam The kind, not the Food product.
----Presented by Di Xu  Introduction  Overview of Spam  Solutions to Spam  Conclusion.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Presented by: Alex Misstear Spam Filtering An Artificial Intelligence Showcase.
6/1/2015 Spam Filtering - Muthiyalu Jothir 1 Spam Filtering Computer Security Seminar N.Muthiyalu Jothir – Media Informatics.
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
Human Visual System Neural Network
Preventing Spam: Today and Tomorrow Zane Bonny Vilaphong Phasiname The Spamsters!
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Deep Belief Networks for Spam Filtering
1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.
Spam May CS239. Taxonomy (UBE)  Advertisement  Phishing Webpage  Content  Links From: Thrifty Health-Insurance Mailed-By: noticeoption.comReply-To:
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Robust Bayesian Classifier Presented by Chandrasekhar Jakkampudi.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Spam Reduction Techniques Using greylisting and SpamAssassin.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos
. If the PHP server is an server or is aware of which server is the server, then one can write code that s information. –For example,
Handwriting Copybook Style Analysis Of Pseudo-Online Data Student and Faculty Research Day Mary L. Manfredi, Dr. Sung-Hyuk Cha, Dr. Charles Tappert, Dr.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
What is Validation Understanding Validation (Different from Verification)
Artificial Neural Networks
Explorations in Neural Networks Tianhui Cai Period 3.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
Small Business Resource Power Point Series How to Avoid Your Marketing Messages Being Labelled as Spam.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
A Technical Approach to Minimizing Spam Mallory J. Paine.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Machine learning system design Prioritizing what to work on
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Marketing Amanda Freeman. Design Guidelines Set your width to pixels Avoid too many tables Flash, JavaScript, ActiveX and movies will not.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
Spam Detection Ethan Grefe December 13, 2013.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Introduction to Neural Networks and Example Applications in HCI Nick Gentile.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Written by Keith C. Ivey Presentation by Jonathan Tang.
OCR Nationals Unit 1 – ICT Skills for Business. Using in business What bad practice can you see in this ? Annotate your copy.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
05/04/07 Using Active Learning to Label Large Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Wikispam, Wikispam, Wikispam PmWiki Patrick R. Michaud, Ph.D. March 4, 2005.
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Spam By Dan Sterrett. Overview ► What is spam? ► Why it’s a problem ► The source of spam ► How spammers get your address ► Preventing Spam ► Possible.
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Machine Learning for Computer Security
CSSE463: Image Recognition Day 11
Classification with Perceptrons Reading:
network of simple neuron-like computing elements
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
Evaluating Classifiers
CSSE463: Image Recognition Day 11
Text Mining Application Programming Chapter 9 Text Categorization
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004

Spam, spam, spam, …

Fighting spam Several commercial applications exist Several commercial applications exist –Server-side: expensive –Client-side: time-consuming No approach is 100% effective No approach is 100% effective –Spammers are aggressive and adaptable –Best solutions are typically hybrids of different approaches and criteria

Common approaches Simple filters Simple filters –Common words or phrases –Unusual punctuation or capitalization Blacklisting: “just say NO” (if you can) Blacklisting: “just say NO” (if you can) –Reject from known spammers Whitelisting: “friends only, please” Whitelisting: “friends only, please” –Accept only from known correspondents Classifiers: examine each and decide Classifiers: examine each and decide –Only a few publications on spam classifiers

Naïve Bayesian classifiers Used in commercial classifiers Used in commercial classifiers Assumes recognition features are independent Assumes recognition features are independent –Max likelihood = product of likelihoods of features classifier – examines each word classifier – examines each word –Training assigns a probability to each word –Look up each word/probability in a dictionary –If the product of the probabilities exceeds a given threshold, it is spam Challenge – creating the “dictionary” Challenge – creating the “dictionary” We compare our Neural Network against two published Naïve Bayesian classifiers We compare our Neural Network against two published Naïve Bayesian classifiers

Naïve Bayesian classifier issues How many features (words), which ones? How many features (words), which ones? How is degradation avoided as spammers’ vocabulary changes? How is degradation avoided as spammers’ vocabulary changes? What values are assigned to new words? What values are assigned to new words? What are the thresholds? What are the thresholds? How to avoid “sabotage” of classifier? How to avoid “sabotage” of classifier?

Which one isn’t spam? (subject headers) 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh Money Back Guarantee_HGH Money Back Guarantee_HGH kindle life pddez liw mzac kindle life pddez liw mzac v a l i u m - D i a z e p a m used to relieve anxiety v a l i u m - D i a z e p a m used to relieve anxiety Fairfield tennis schedule Fairfield tennis schedule :Dramatic E,nhancement fo=r.Men = f"fumqid :Dramatic E,nhancement fo=r.Men = f"fumqid,Refina'nce now. Don't wait,Refina'nce now. Don't wait

Which one isn’t spam? (subject headers) 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh Money Back Guarantee_HGH Money Back Guarantee_HGH kindle life pddez liw mzac kindle life pddez liw mzac v a l i u m - D i a z e p a m used to relieve anxiety v a l i u m - D i a z e p a m used to relieve anxiety Fairfield tennis schedule Fairfield tennis schedule :Dramatic E,nhancement fo=r.Men = f"fumqid :Dramatic E,nhancement fo=r.Men = f"fumqid,Refina'nce now. Don't wait,Refina'nce now. Don't wait

Spammers make patterns The more they try to hide, the easier it is to see them The more they try to hide, the easier it is to see them Therefore, we use common spammer patterns (instead of vocabulary) as features for classification Therefore, we use common spammer patterns (instead of vocabulary) as features for classification Learn these patterns with a Neural Network Learn these patterns with a Neural Network

Neural Network features Total of 17 features Total of 17 features – 6 from the subject header – 2 from priority and content-type headers – 9 from the body

Features from subject header 1. Number of words with no vowels 2. Number of words with at least two of letters J, K, Q, X, Z 3. Number of words with at least 15 characters 4. Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word 5. Number of words with all letters in uppercase 6. Binary feature indicating 3 or more repeated characters

Features from priority and content-type headers 1. Binary feature indicating whether the priority had been set to any level besides normal or medium 2. Binary feature indicating whether a content-type header appeared within the message headers or whether the content type had been set to “text/html”

Features from message body 1. Proportion of alphabetic words with no vowels and at least 7 characters 2. Proportion of alphabetic words with at lease two of letters J, K, Q, X, Z 3. Proportion of alphabetic words at least 15 characters long 4. Binary feature indicating whether the strings “From:” and “To:” were both present 5. Number of HTML opening comment tags 6. Number of hyperlinks (“href=“) 7. Number of clickable images represented in HTML 8. Binary feature indicating whether a text color was set to white 9. Number of URLs in hyperlinks with digits or “&”, “%”, or

Neural Network spam classifier 3-layer, feed-forward network (Perceptron) 3-layer, feed-forward network (Perceptron) –17 input units, variable # hidden layer units, 1 output unit Data – 1,654 s: 854 spam, 800 legitimate Data – 1,654 s: 854 spam, 800 legitimate Use half of each (spam/non-spam) for training, the other half for testing Use half of each (spam/non-spam) for training, the other half for testing Test with variations of hidden nodes (4 to 14) and epochs (100 to 500) Test with variations of hidden nodes (4 to 14) and epochs (100 to 500)

Definitions used for classifier success measures n SS n SS = number of spam classified as spam n SL n SL = number of spam classified as legitimate n LL n LL = number of legitimate classified as legitimate n LS n LS = number of legitimate classified as spam

Measure of success: precision Precision: the percentage of labeled spam/legitimate correctly classified

Measure of success: precision Precision: the percentage of labeled spam/legitimate correctly classified

Measure of success: accuracy Accuracy: the percentage of actual spam/legitimate correctly classified

Measure of success: accuracy Accuracy: the percentage of actual spam/legitimate correctly classified

Neural Network results Best overall results with 12 hidden nodes at 500 epochs Best overall results with 12 hidden nodes at 500 epochs –Spam Precision: 92.45% –Legitimate Precision: 91.32% –Spam Accuracy: 91.80% –Legitimate Accuracy : 92.00% 35 spams misclassified: 8.20% 35 spams misclassified: 8.20% 32 legitimates misclassified: 8.00% 32 legitimates misclassified: 8.00%

Misclassified s Most spam misclassified as legitimate were short in length, with few hyperlinks Most spam misclassified as legitimate were short in length, with few hyperlinks Most legitimate s misclassified as spam had unusual features for personal (that is, they were “spam-like” in appearance) Most legitimate s misclassified as spam had unusual features for personal (that is, they were “spam-like” in appearance)

Comparing Neural Network and Naïve Bayesian Classifiers Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiers Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiers NN classifier required fewer features (17 versus 100 in one study and 500 in another) NN classifier required fewer features (17 versus 100 in one study and 500 in another) NN classifier uses descriptive qualities of words and messages similar to those used by human readers NN classifier uses descriptive qualities of words and messages similar to those used by human readers

Blacklisting Experiment Manually entered IP addresses of incorrectly tagged by NN classifier Manually entered IP addresses of incorrectly tagged by NN classifier –Entered first (original) IP address and, when present, second IP address (e.g., mail server or ISP) Into a website that sends IP addresses to 173 working spam blacklists and returns the # hits, Into a website that sends IP addresses to 173 working spam blacklists and returns the # hits, Counted only hit counts greater than one as spam since single-list hits to be anomalies Counted only hit counts greater than one as spam since single-list hits to be anomalies

Blacklisting Experimental Results Of the 32 legitimate s misclassified by the NN, 53% were identified as spam Of the 32 legitimate s misclassified by the NN, 53% were identified as spam Of the 35 spam s misclassified by the NN, 97% were identified as spam Of the 35 spam s misclassified by the NN, 97% were identified as spam These poor results indicate that the blacklisting strategy, at least for these databases, is inadequate These poor results indicate that the blacklisting strategy, at least for these databases, is inadequate

Conclusions NN competitive to Naïve Bayesian studies despite using a much smaller feature set NN competitive to Naïve Bayesian studies despite using a much smaller feature set Room for refinement of parsing for features Room for refinement of parsing for features Use of descriptive, more human-like features makes NN less subject to degradation than Naïve Bayesian Use of descriptive, more human-like features makes NN less subject to degradation than Naïve Bayesian

Conclusions (cont.) Neural Network approach is useful and accurate, but too many legitimate -> spam Neural Network approach is useful and accurate, but too many legitimate -> spam Should be powerful when used in conjunction with a whitelist to reduce legitimate -> spam (n LS ), increasing spam precision and legitimate accuracy Should be powerful when used in conjunction with a whitelist to reduce legitimate -> spam (n LS ), increasing spam precision and legitimate accuracy Blacklisting strategy is not very helpful Blacklisting strategy is not very helpful