Presentation is loading. Please wait.

Presentation is loading. Please wait.

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.

Similar presentations


Presentation on theme: "SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner."— Presentation transcript:

1 SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner

2 Outline Introduction Related Work Algorithm Measurements Implementation Future Work

3 Introduction Anti-spam efforts Legislation Technology –White listing of Email addresses –Black Listing of Email addresses/domains –Challenge Response mechanisms –Content Filtering Learning Techniques

4 Introduction Learning techniques for Spam classification Feature Extraction Assignment of weights to individual features representing the predictive strength of a feature Combining weights of extracted features during classification to numerically determine whether mail is spam/legitimate

5 Introduction Current algorithms Word or phrases as features Probabilities of occurrence in spam/legitimate collections as weights Bayes rule or one of it’s variants for combining weights

6 Outline Introduction Related Work Algorithm Measurements Implementation Future Work

7 Related Work Cohen (1996): –RIPPER, Rule Learning System –Rules in a human-comprehensible format Pantel & Lin (1998): – Naïve-Bayes with words as features Microsoft Research (1998): –Naïve-Bayes with the mutual information measure to select features with strongest resolving power –Words and domain-specific attributes of spam used as features

8 Related Work Paul Graham (2002): A Plan for spam –Very popular algorithm credited with starting the craze for Bayesian Filters –Uses naïve-bayes with words as features Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm –Most accurate algorithm till date (over 99.7% accuracy) –Distinctive because of it’s powerful feature extraction technique –Uses Bayesian chain rule for combining weights

9 Related Work CRM114 algorithm Feature Extraction –Slide a Window of 5 words over the incoming text –Generate order-preserving sub-phrases containing all combinations of windowed words –For one window, 2 4 = 16 features are generated –Very high computational complexity –E.g. “Click here to buy Viagra” Features generated would be “Click”, “Click here”, “Click to”,“Click buy”, “Click Viagra”, “Click here to”, “Click here buy” etc.

10 Outline Introduction Related Work Algorithm Measurements Implementation Future Work

11 Algorithm Feature Extraction –Sentences in a message are identified by using the delimiting characters ‘.’, ‘?’, ‘!’, ‘;’, ‘ ’ –All possible word-pairings are formed from the sentences –Commonly occuring words are skipped –These word-pairings serve as features to be used for classification

12 Algorithm Feature Extraction (continued….) –If number of words become greater than a constant K, then series of K words is treated as a sentence – Value of K is set to 20 –E.g. “There is a problem in the tables that have been copied to the database” “problem tables”, “tables problem”, “problem copied”, “copied problem”, “problem database”, “database problem” etc. are the features that would be formed out of the sentence

13 Algorithm Feature Extraction (continued….) –Entire subject line is treated as one sentence –For HTML, all content within ‘ ’ is treated as one sentence –For a sentence of n words, ‘scavenger’ creates (n-1)*(n-2) features as compared to 2 n-1 created by CRM114

14 Algorithm Weight Assignment –Weights represent predictive strength of features –Discretized values are assigned as weights to features depending on whether the feature is a ‘strong’ evidence or a ‘weak’ evidence –‘Strong’ pieces of evidence should have high impact on the classification decision and ‘weak’ pieces of evidence should have low impact on the classification decision

15 Algorithm Weight Assignment (Continued…) –Categorization of features into ‘strong’ and ‘weak’ pieces of evidence is done on the basis of frequency of occurrence of the feature in spam/legitimate collections, exclusivity of occurrence and on heuristic rules like distance between words in the word-pairing, whether the feature is from the subject or the body. –Only exclusively occuring features are assigned weights – Features occuring in both spam and legitimate collections are ignored.

16 Algorithm Weight Assignment (Continued…) –What weights to select for the ‘strong’ evidences and the ‘weak’ evidences? –During classification, the class having more pieces of ‘strong’ evidence should ‘win’ regardless of the number of ‘weak’ evidences on either side. –In the absence of ‘strong’ evidences on either side, the class having more pieces of ‘weak’ evidence should ‘win’.

17 Algorithm Weight Assignment (Continued…) –Intuitively, we would like to have as much ‘distance’ between the values we choose for the ‘strong’ and ‘weak’ evidences. –We select 0.9 as the weight for ‘strong’ evidences and 0.1 as the weight for ‘weak’ evidences.

18 Algorithm Combining of weights –Total spam evidence = sum of spam weights of matching features –Total legitimate evidence = sum of legitimate weights of matching features –If Total spam evidence >= M* Total legitimate evidence, then message is spam –M is the thresold parameter which can be used as ‘tuning knob’

19 Outline Introduction Related Work Algorithm Measurements Implementation Future Work

20 Measurements Precision and Recall used as parameters of measurement –Spam Precision=Messages correctly classified as spam / Total Messages classified as spam –Spam Recall = Messages correctly classified as spam / Total Spam Messages in Testing set –Precision gives accuracy with respect to false positives –Recall gives capacity of filter to catch spam

21 Measurements Testing data –Downloaded around 5600 spam messages from http://www.spamarchive.org http://www.spamarchive.org –Used around 960 legitimate mails from Dr. Fink’s mailbox Cross-Validation –K-fold cross-validation for two values of K, K=2 and K=5 –K=2: Dividing data into 2 equal-sized sets –K=5: Dividing data into 5 equal-sized sets

22 Measurements Comparison with Paul Graham’s naïve-bayes algorithm Implemented Graham’s algorithm for two methods of feature extraction –Words+phrases as features –Feature extraction similar to ‘scavenger’

23 Measurements ALGORITHMK=5K=2 SPAM PRECISION (AVERAGE) SPAM RECALL (AVERAGE) SPAM PRECISION (AVERAGE) SPAM RECALL (AVERAGE) Scavenger (M=1)100%99.85%99.92%99.72% Naïve-bayes (words+phrases) 100%98.87%99.80%97.03% Naïve-bayes (with scavenger feature extraction method) 100%99.15%99.65%98.68%

24 Measurements MMissed Spam (%)False Positives (%)Spam Recall (%) 0.25013.23100 0.50.076.1799.93 0.750.112.0599.89 10.280.5899.72 1.250.30.5899.70 1.50.350.5899.66 1.750.410.2999.59 20.49099.51 2.250.6099.4 2.50.74099.36

25 Measurements

26 Why ‘scavenger’ performs better than naïve- bayes? –Powerful feature extraction (as powerful as CRM114) –Calculates predictive strength on basis of frequency of occurrence as well as heuristic rules

27 Outline Introduction Related Work Algorithm Measurements Implementation Future Work

28 Implementation Windows-PC based filter Runs for Individual email accounts in IMAP mail servers Three Modules –Configuration –Training –Classification

29 Implementation Classifier runs as a Windows Service Connects to mail server every ten minutes Downloads new messages, classifies them Moves messages classified as spam to a pre- configured folder on the server

30 Outline Introduction Related Work Algorithm Measurements Implementation Future Work

31 Incorporating message headers during feature extraction step Incorporating domain-specific attributes of spam during weight combination step

32 Publications Dr. William Yerazunis (inventor of CRM114) mentioned the ‘scavenger’ algorithm at the MIT spam conference on Jan 16, 2004 To be published in the ‘First Conference on email and anti-spam’ in Palo-Alto, California in July 2004

33 Acknowledgements Dr. Eugene Fink Dr. Dewey Rundus, Dr. Alan Hevner Dr. Paul Graham, MIT, Boston Dr. William Yerazunis, MERL, Boston


Download ppt "SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner."

Similar presentations


Ads by Google