Presentation is loading. Please wait.

Presentation is loading. Please wait.

Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos

Similar presentations


Presentation on theme: "Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos"— Presentation transcript:

1 Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis (ernani@iit.demokritos.gr), ernani@iit.demokritos.gr Ion Androutsopoulos (ion@aueb.gr), ion@aueb.gr George Paliouras (paliourg@iit.demokritos.gr), paliourg@iit.demokritos.gr George Sakkis (gsakkis@rutgers.edu), gsakkis@rutgers.edu Panagiotis Stamatopoulos (takis@di.uoa.gr) takis@di.uoa.gr Mountain View, CA, July 30 th and 31 st 2004 First Conference on Email and Anti-Spam (CEAS)

2 Outline  Spam Filtering: past, present and future  Anti-spam filtering with Filtron  In Vitro Evaluation  In Vivo Evaluation  Conclusions

3 Spam Filtering: past, present and future Past: Past:  Black-lists and white-lists of e-mail addresses  Handcrafted rules looking for suspicious keywords and patterns in headers Present: Present:  Machine learning-based filters –Mostly using Naïve Bayes classifier –Examples: Mozilla’s spam filter, POPFILE, K9  Signature based filtering (Vipul’s Razor) Future: Future:  Combination of several techniques (SpamAssassin)

4 Filtron: An overview A multi-platform learning-based anti-spam filter. A multi-platform learning-based anti-spam filter. Features for simple the user: Features for simple the user:  Personalized: based on her legitimate messages  Automatically updating black/white lists  Efficient: server-side filtering and interception rules Features for the advanced user and the researcher: Features for the advanced user and the researcher:  Customizable learning component –Through Weka open source machine learning platform  Support for creating publicly available message collections –Privacy-preserving encoding of messages and user profiles Portable: Implemented in Java and Tcl/Tk Portable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way) Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way)

5 Legitimatefolders Spamfolders Preprocessor Vectorizer Learner AttributeSelector Filtron Filtron’s Architecture attribute set training vectors User model induced classifier black list, white list

6 Preprocessing 1. 1.Break down mailbox(es) into distinct messages 2. 2.Remove from every message:   mail headers   html tags   attached files 3. 3.Remove messages with no textual content 4. 4.Store 5 messages per sender   Avoids bias towards regular correspondents. 5. 5.Remove duplicates 6. 6.Encode messages (optional)

7 Message Classification

8 In Vitro Evaluation We investigated the effect of: We investigated the effect of:  Single-token versus multi-token attributes (n-grams for n=1,2,3)  Number of attributes (40-3000)  Learning algorithm (Naïve Bayes, Flexible Bayes, SVMs, LogitBoost)  Training corpus size (~ 10%-100% of full training corpus) Cost-Sensitive Learning Formulation Cost-Sensitive Learning Formulation  Misclassifying a legitimate message as spam (L  S) is λ times more serious an error than misclassifying a spam to legitimate (S  L)  Two usage scenarios (λ = 1, 9)

9 In Vitro Evaluation (cont.) Evaluation: Evaluation:  Four message collections (PU1, PU2, PU3, PUA)  Stratified 10-fold cross validation Results: Results:  No clear winner among learning algorithms wrt accuracy  Efficiency (or other criteria) more important for real usage.  Nevertheless, SVMs consistently among two best  No substantial improvement with n-grams (for n>1) Refer to the TR for more details: Refer to the TR for more details:  Learning to filter unsolicited commercial e-mail, TRN 2004/2, NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/) http://www.iit.demokritos.gr/skel/i-config/

10 Summary of in Vitro Evaluation λ = 1 λ = 9 PrReWAccPrReWAcc 1-grams Naive Bayes Flexible Bayes LogitBoostSVM 90.56 95.55 92.43 94.95 94.73 89.89 90.08 91.43 94.65 95.15 93.64 95.42 91.57 98.88 97.71 98.12 92.17 74.63 74.89 78.33 94.87 97.76 97.24 97.60 1/2/3-grams Flexible Bayes SVM 92.98 94.73 91.89 91.70 93.89 95.05 97.43 98.70 81.36 76.40 96.91 97.67

11 In Vivo Evaluation Seven month live-evaluation by the third author Seven month live-evaluation by the third author Training collection: PU3 Training collection: PU3  2313 legitimate / 1826 spam Learning algorithm: SVM Learning algorithm: SVM Cost scenario: λ = 1 Cost scenario: λ = 1 Retained attributes: 520 1-grams Retained attributes: 520 1-grams  Numeric values (term frequency) No black-list was used No black-list was used

12 Summary of in Vivo Evaluation Days used Messages received Spam messages received Legitimate messages received Legitimate-to-Spam Ratio 212 6732 (avg. 31.75 per day) 1623 (avg. 7.66 per day) 5109 (avg. 24.10 per day) 3.15 Correctly classified legitimate messages (L  L) Incorrectly classified legitimate messages (L  S) Correctly classified spam messages (S  S) Incorrectly classified spam messages (S  L) 5057 52 (avg. 1.72 per week) 1450 173 (avg. 5.71 per week) PrecisionRecallWAcc 96.54% (PU3: 96.43%) 89.34% (PU3: 95.05%) 96.66% (PU3: 96.22%)

13 Post-Mortem Analysis False Positives 52 false positives (out of 6732) 52 false positives (out of 6732) 52%: Automatically generated messages 52%: Automatically generated messages  subscription verifications, virus warnings, etc. 22%: Very short messages 22%: Very short messages  3-5 words in message body  Along with attachments and hyperlinks 26%: Short messages 26%: Short messages  1-2 lines  Written in casual style, often exploited by spammers  With no attachments or hyperlinks

14 Post-Mortem Analysis False Negatives 173 false negatives (out of 6732) 173 false negatives (out of 6732) 30%: “Hard Spam” 30%: “Hard Spam”  Little textual information, avoiding common suspicious word patterns  Many images and hyperlinks  Tricks to confuse tokenizers 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 23%: Non-English messages 23%: Non-English messages  Under-represented in the training corpus 30%: Encoded messages 30%: Encoded messages  BASE64 format; Filtron could not process it at that time 6%: Hoax letters 6%: Hoax letters  Long formal letters (“tremendous business opportunity !”)  Many occurrences of the receiver’s full name 3%: Short messages with unusual content 3%: Short messages with unusual content

15 Conclusions Signs of arms race between spammers and content-based filters Signs of arms race between spammers and content-based filters Filtron’s performance deemed satisfactory, though it can be improved with: Filtron’s performance deemed satisfactory, though it can be improved with:  More elaborate preprocessing to tackle usual countermeasures of spammers (misspellings, uncommon words, text on images)  Regular retraining Currently most promising approach: combination of different filtering approaches along with Machine Learning Currently most promising approach: combination of different filtering approaches along with Machine Learning  Collaborative filtering  Filtering in the transport layer level …………


Download ppt "Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos"

Similar presentations


Ads by Google