Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Neural Network Classifier for Junk E-Mail Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.

Similar presentations


Presentation on theme: "A Neural Network Classifier for Junk E-Mail Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004."— Presentation transcript:

1 A Neural Network Classifier for Junk E-Mail Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004

2 Spam, spam, spam, …

3 Fighting spam Several commercial applications exist Several commercial applications exist –Server-side: expensive –Client-side: time-consuming No approach is 100% effective No approach is 100% effective –Spammers are aggressive and adaptable –Best solutions are typically hybrids of different approaches and criteria

4 Common approaches Simple filters Simple filters –Common words or phrases –Unusual punctuation or capitalization Blacklisting: “just say NO” (if you can) Blacklisting: “just say NO” (if you can) –Reject e-mail from known spammers Whitelisting: “friends only, please” Whitelisting: “friends only, please” –Accept e-mail only from known correspondents Classifiers: examine each e-mail and decide Classifiers: examine each e-mail and decide –Only a few publications on spam classifiers

5 Naïve Bayesian classifiers Used in commercial classifiers Used in commercial classifiers Assumes recognition features are independent Assumes recognition features are independent –Max likelihood = product of likelihoods of features E-mail classifier – examines each word E-mail classifier – examines each word –Training assigns a probability to each word –Look up each word/probability in a dictionary –If the product of the probabilities exceeds a given threshold, it is spam Challenge – creating the “dictionary” Challenge – creating the “dictionary” We compare our Neural Network against two published Naïve Bayesian classifiers We compare our Neural Network against two published Naïve Bayesian classifiers

6 Naïve Bayesian classifier issues How many features (words), which ones? How many features (words), which ones? How is degradation avoided as spammers’ vocabulary changes? How is degradation avoided as spammers’ vocabulary changes? What values are assigned to new words? What values are assigned to new words? What are the thresholds? What are the thresholds? How to avoid “sabotage” of classifier? How to avoid “sabotage” of classifier?

7 Which one isn’t spam? (subject headers) 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh Money Back Guarantee_HGH Money Back Guarantee_HGH kindle life pddez liw mzac kindle life pddez liw mzac v a l i u m - D i a z e p a m used to relieve anxiety v a l i u m - D i a z e p a m used to relieve anxiety Fairfield tennis schedule Fairfield tennis schedule :Dramatic E,nhancement fo=r.Men = f"fumqid :Dramatic E,nhancement fo=r.Men = f"fumqid,Refina'nce now. Don't wait,Refina'nce now. Don't wait

8 Which one isn’t spam? (subject headers) 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh Money Back Guarantee_HGH Money Back Guarantee_HGH kindle life pddez liw mzac kindle life pddez liw mzac v a l i u m - D i a z e p a m used to relieve anxiety v a l i u m - D i a z e p a m used to relieve anxiety Fairfield tennis schedule Fairfield tennis schedule :Dramatic E,nhancement fo=r.Men = f"fumqid :Dramatic E,nhancement fo=r.Men = f"fumqid,Refina'nce now. Don't wait,Refina'nce now. Don't wait

9 Spammers make patterns The more they try to hide, the easier it is to see them The more they try to hide, the easier it is to see them Therefore, we use common spammer patterns (instead of vocabulary) as features for classification Therefore, we use common spammer patterns (instead of vocabulary) as features for classification Learn these patterns with a Neural Network Learn these patterns with a Neural Network

10 Neural Network features Total of 17 features Total of 17 features – 6 from the subject header – 2 from priority and content-type headers – 9 from the e-mail body

11 Features from subject header 1. Number of words with no vowels 2. Number of words with at least two of letters J, K, Q, X, Z 3. Number of words with at least 15 characters 4. Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word 5. Number of words with all letters in uppercase 6. Binary feature indicating 3 or more repeated characters

12 Features from priority and content-type headers 1. Binary feature indicating whether the priority had been set to any level besides normal or medium 2. Binary feature indicating whether a content-type header appeared within the message headers or whether the content type had been set to “text/html”

13 Features from message body 1. Proportion of alphabetic words with no vowels and at least 7 characters 2. Proportion of alphabetic words with at lease two of letters J, K, Q, X, Z 3. Proportion of alphabetic words at least 15 characters long 4. Binary feature indicating whether the strings “From:” and “To:” were both present 5. Number of HTML opening comment tags 6. Number of hyperlinks (“href=“) 7. Number of clickable images represented in HTML 8. Binary feature indicating whether a text color was set to white 9. Number of URLs in hyperlinks with digits or “&”, “%”, or “@”

14 Neural Network spam classifier 3-layer, feed-forward network (Perceptron) 3-layer, feed-forward network (Perceptron) –17 input units, variable # hidden layer units, 1 output unit Data – 1,654 e-mails: 854 spam, 800 legitimate Data – 1,654 e-mails: 854 spam, 800 legitimate Use half of each (spam/non-spam) for training, the other half for testing Use half of each (spam/non-spam) for training, the other half for testing Test with variations of hidden nodes (4 to 14) and epochs (100 to 500) Test with variations of hidden nodes (4 to 14) and epochs (100 to 500)

15 Definitions used for classifier success measures n SS n SS = number of spam classified as spam n SL n SL = number of spam classified as legitimate n LL n LL = number of legitimate classified as legitimate n LS n LS = number of legitimate classified as spam

16 Measure of success: precision Precision: the percentage of labeled spam/legitimate e-mail correctly classified

17 Measure of success: precision Precision: the percentage of labeled spam/legitimate e-mail correctly classified

18 Measure of success: accuracy Accuracy: the percentage of actual spam/legitimate e-mail correctly classified

19 Measure of success: accuracy Accuracy: the percentage of actual spam/legitimate e-mail correctly classified

20 Neural Network results Best overall results with 12 hidden nodes at 500 epochs Best overall results with 12 hidden nodes at 500 epochs –Spam Precision: 92.45% –Legitimate Precision: 91.32% –Spam Accuracy: 91.80% –Legitimate Accuracy : 92.00% 35 spams misclassified: 8.20% 35 spams misclassified: 8.20% 32 legitimates misclassified: 8.00% 32 legitimates misclassified: 8.00%

21 Misclassified e-mails Most spam misclassified as legitimate were short in length, with few hyperlinks Most spam misclassified as legitimate were short in length, with few hyperlinks Most legitimate e-mails misclassified as spam had unusual features for personal e-mail (that is, they were “spam-like” in appearance) Most legitimate e-mails misclassified as spam had unusual features for personal e-mail (that is, they were “spam-like” in appearance)

22 Comparing Neural Network and Naïve Bayesian Classifiers Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiers Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiers NN classifier required fewer features (17 versus 100 in one study and 500 in another) NN classifier required fewer features (17 versus 100 in one study and 500 in another) NN classifier uses descriptive qualities of words and messages similar to those used by human readers NN classifier uses descriptive qualities of words and messages similar to those used by human readers

23 Blacklisting Experiment Manually entered IP addresses of e-mail incorrectly tagged by NN classifier Manually entered IP addresses of e-mail incorrectly tagged by NN classifier –Entered first (original) IP address and, when present, second IP address (e.g., mail server or ISP) Into a website that sends IP addresses to 173 working spam blacklists and returns the # hits, http://www.declude.com/junkmail/support/ip4r.htm Into a website that sends IP addresses to 173 working spam blacklists and returns the # hits, http://www.declude.com/junkmail/support/ip4r.htm http://www.declude.com/junkmail/support/ip4r.htm Counted only hit counts greater than one as spam since single-list hits to be anomalies Counted only hit counts greater than one as spam since single-list hits to be anomalies

24 Blacklisting Experimental Results Of the 32 legitimate e-mails misclassified by the NN, 53% were identified as spam Of the 32 legitimate e-mails misclassified by the NN, 53% were identified as spam Of the 35 spam e-mails misclassified by the NN, 97% were identified as spam Of the 35 spam e-mails misclassified by the NN, 97% were identified as spam These poor results indicate that the blacklisting strategy, at least for these databases, is inadequate These poor results indicate that the blacklisting strategy, at least for these databases, is inadequate

25 Conclusions NN competitive to Naïve Bayesian studies despite using a much smaller feature set NN competitive to Naïve Bayesian studies despite using a much smaller feature set Room for refinement of parsing for features Room for refinement of parsing for features Use of descriptive, more human-like features makes NN less subject to degradation than Naïve Bayesian Use of descriptive, more human-like features makes NN less subject to degradation than Naïve Bayesian

26 Conclusions (cont.) Neural Network approach is useful and accurate, but too many legitimate -> spam Neural Network approach is useful and accurate, but too many legitimate -> spam Should be powerful when used in conjunction with a whitelist to reduce legitimate -> spam (n LS ), increasing spam precision and legitimate accuracy Should be powerful when used in conjunction with a whitelist to reduce legitimate -> spam (n LS ), increasing spam precision and legitimate accuracy Blacklisting strategy is not very helpful Blacklisting strategy is not very helpful


Download ppt "A Neural Network Classifier for Junk E-Mail Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004."

Similar presentations


Ads by Google