Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.sophos.com Adaptive Filtering: One Year On John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile.

Similar presentations


Presentation on theme: "Www.sophos.com Adaptive Filtering: One Year On John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile."— Presentation transcript:

1 www.sophos.com Adaptive Filtering: One Year On John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile

2 Adaptive Filtering Definition: An email filter that can be taught to recognize different types of mail without writing rules. Most use some machine learning technique: Naïve Bayesian Classification 1 knn 2 Support Vector Machines 3 All provide some measure of “spamminess”

3 Machine Learning & Anti- spam A little more than one year Papers Mar 1998: SpamCop: A Spam Classification & Organization Program 1 Jul 1998: A Bayesian Approach to Filtering Junk E- mail 2 2000: An evaluation of Naive Bayesian anti-spam filtering 3 Aug 2002: A Plan for Spam 4 Patents Jun 1998: 6,161,130: Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set Jun 1999: 6,592,627: System and method for organizing repositories of semi-structured documents such as email

4 Why now? The “Grandma Problem” Confluence of events: Spam getting close to 50% of all mail 1 Email reaching 1/3 of adults in US 2 Fast processors can handle the processing load No other good alternatives Laws? Migrate from SMTP? 3

5 Two Routes Open Source Lots of open source anti-spam solutions Many are “wannabe” solutions that simply implemented Paul Graham’s ideas Some are interesting tools (bogofilter, POPFile, SpamBayes) Commercial Vendors now incorporating Adaptive Filtering into their anti-spam products Classic tradeoff: Free, open source, community supported Fee, “productized”, vendor supported

6 Practical Open Source Filters General mail filters 1 Aug 1996: ifile Aug 2002: POPFile Oct 2002: dbacl Spam Filters 2 Bogofilter, SpamBayes, Bayesian Spam Filter, SpamProbe, SpamWizard, BSpam, The Spam Secretary, Expaminator, SqueakyMail, Bayespam, spaminator, Quick Spam Filter, Annoyance Filter, DSPAM, PASP, Spam Blocker, CRM114 SpamAssassin (added Bayesian in 2.5)

7 Mainstream Adaptive Filtering General SwiftFile (for Lotus Notes) 1 Ella Pro (for Microsoft Outlook) 2 Anti-spam Desktop Mozilla 1.3, Eudora 6.0 Microsoft MSN 8, Microsoft Outlook 2003 AOL 9.0, Apple Mail.app (Jaguar) Anti-spam Gateway Sophos PureMessage 4.x Prediction: By end of 2004 every major email client includes adaptive filtering

8 The Problems Man-in-the-street Usability False Positives Over training One man’s spam is another man’s ham Internationalization

9 Usability Proxy, plug-in and external filters are too complex General user needs: To not understand the underlying mechanism Complete integration with mail client Obvious operation (e.g. spam is moved into a folder call Spam) Automatic whitelisting (if I send to Mom, Mom is ok)

10 False Positives False Positive == Good mail identified as bad False Negative == Spam identified as good People tolerate false negatives, but hate false positives Spam filters must guard against false positives: Bias towards False Negatives (“A Plan for Spam”) Cross check results (SpamBayes) High spam threshold

11 Over Training Occurs when user loads up adaptive filter with lots more spam than ham e.g. feeds entire spam archive into filter Some adaptive filters then think everything is spam For Naïve Bayes classifiers the “train on errors” methodology works well in practice. User teaches filter only on mails it incorrectly classified “No, that’s spam or no, that’s ham” button

12 One man’s spam… Can be hard to unsubscribe from legitimate bulk mail Users tell spam filter that legitimate mail is spam Creates false positives for other users in shared systems e.g. I say CNET News email is spam, you want it Ideal system has two parts Gateway spam filter run by IT group Individual preferences on each client

13 Internationalization Tokenization non-trivial for some languages In English words are “space separated” Thisisnotthecaseinsomeotherlanguages: Japanese (POPFile の特別な使い方 ) Different punctuation ¿Español? «Français» UTF-8, Unicode أخبار و تقارير looks like ÃÎÈÇÑ æ ÊÞÇÑíÑ

14 Spammer’s Response Overwhelm filter with “good words” Hide those good words from people Use HTML as trickery toolbox Three techniques: And the Kitchen Sink Invisible Ink Camouflage More in Sophos’s Field Guide to Spam 1

15 And the Kitchen Sink Throw in innocent words before or after the HTML Viagra Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom

16 And the Kitchen Sink Spammer hopes reader concentrates on the spam message part Ineffective because user gets to see the innocent words Spammers need ways to hide the innocent words So they’ve taken inspiration from search engine trickery…

17 Invisible Ink Use HTML font colors to write white on white Viagra Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom

18 Invisible Ink Easily spotted if filter groks HTML Can confuse filters that just drop HTML tags Spammers have noticed that Invisible Ink is being targeted They’ve adapted…

19 Camouflage Use very similar HTML colors Viagra some innocent words

20 Camouflage Hard to see, but “some innocent words” do appear

21 Pythagoras Spots Spam Foreground and background colors are coordinates in 3D Imagine a Red axis, a Green axis and a Blue (00,00,00) Sweet, I rule in 2003 Similar colors are close Dissimilar colors are far apart Pythagoras’ Theorem (3D) 1 gives the color distance (11,33,33) (12,39,39) (FF,FF,00) ● ● ● Blue Red Green

22 Spammers love HTML

23 Trick Trends - Two Increasing

24 Tricks Make Spam Spotting Easier Bad news for spammers: The harder you try to obscure your messages the easier they are to filter Spam trickery becomes the spam fingerprint Bad news for end users: Spammers will react by making spam more innocent Hi, I saw your profile and wanted to get in touch, please check out my site at www.some-viagra- site.com

25 The Filter Paradox Do filters make spam more effective? One spammer claimed on /. “Your filters help cut down on the complaints to ISPs […] you no longer complain to uce@ftc.gov, my access providers, or anyone else who might cause me problems” Time will tell

26 The End Following slides are for reference purposes

27 References Slide 2 1. http://www.wikipedia.org/wiki/Naive_Baye sian_classification 2. http://www.usenix.org/events/sec02/full_ papers/liao/liao_html/node4.html 3. http://citeseer.nj.nec.com/tong00support.html

28 References Slide 3 1. http://citeseer.nj.nec.com/pantel98spamcop.html 2. http://citeseer.nj.nec.com/sahami98bayesian.html 3. http://citeseer.nj.nec.com/androutsopoulos00evaluation.html 4. http://www.paulgraham.com/spam.html Slide 4 1. Wired, p50, September 2003 predicts 50% of all mail will be spam by September 2004 2. US Census Bureau, 2000 3. One proposal is AMTP: http://www.ietf.org/internet-drafts/draft- weinman-amtp-00.txt

29 References Slide 5 1. POPFile: http://popfile.sourceforge.net ifile: http://www.nongnu.org/ifile/ 2. Search SourceForge and Freshmeat Slide 6 1. http://www.research.ibm.com/swiftf ile/ 2. http://www.openfieldsoftware.com/E lla.asp

30 References Slide 17 1. http://www.activestate.com/Product s/PureMessage/Field_Guide_to_Spam/

31 Pythagoras in 3D Distance between two points in space Pythagoras: δ 2 = α 2 + β 2 Pythagoras: α 2 = (x-a) 2 + (z-c) 2 β 2 = (y-b) 2 (a, b, c) (x, y, z) δ α β δ = √ ( (x-a) 2 + (y-b) 2 + (z-c) 2 )


Download ppt "Www.sophos.com Adaptive Filtering: One Year On John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile."

Similar presentations


Ads by Google