Presentation on theme: "Filtering Spam With Justin Mason, SpamAssassin Project & Deersoft"— Presentation transcript:
Filtering Spam With Justin Mason, SpamAssassin Project & Deersoft
What Is Spam? •Best description: "Unsolicited Bulk E- mail" •In human terms: bulk you didn't want, and didn't ask for •Mailing lists, newsletters, "latest offers": not spam, if you asked for them in the first place •Name courtesy of Monty Python: “spam, spam, spam and spam”
Why Bother Filtering Spam? •Seems to be about 30% to 60% of mail traffic, and increasing •Users are forced to waste time wading through their inbox –costs their employers money •Impossible to unsubscribe –“unsubscribe” addresses work only 37% of the time, according to the FTC •Legal retaliation not possible, yet •Just plain irritating!
Spam Volume Is Increasing (data from Brightmail.com)
Filtering: Homebrew Blacklists •First round of "spam filters": internal blacklists, maintained by in-house admin staff •Match addresses, and delete those from known spammers •Later, match "bad words" (Viagra, porn) •Quite hard to configure; centralised; lots of work to keep up to date
Filtering: DNS Blacklists •Identify spam source computers by IP address •Allow mail system to look up a public database on the internet as mail arrives •Block the message, if its sender's address is blacklisted •Now at least 20 DNS blacklists, with varying reliability •Many false positives –eircom.net's main mail server!
SpamAssassin Concepts •Zero-configuration where possible •Lots of rules to determine if a mail is spam or not –"Fuzzy logic": rules are assigned scores, based on our confidence in their accuracy –These are combined to produce an overall score for each message –If over a user-defined threshold, the mail is judged as spam •No one rule, alone, can mark a mail as spam
SpamAssassin Concepts, pt.2 •Combines many systems for a "broad- spectrum" approach: –Detect forged headers –Spam-tool signatures in headers –Text keyword scanner in the message body –DNS blacklists –Razor, DCC (Distributed Checksum Clearinghouse), Pyzor •Spammers cannot aim to defeat 1 system; the others will catch them out
Integration Into Mail Systems •Wrote SpamAssassin with flexibility of integration in mind •Many have been written: –Integration into Mail Transfer Agents (sendmail, qmail, Exim, Postfix, Microsoft Exchange) –Integration into virus-scanner MTA plug-ins (MIMEDefang, amavisd-new) –IMAP/POP proxies and clients –Commercial plug-ins for Windows clients (Eudora, MS Outlook) •And many more I don't know about!
Accuracy and False Positives •The big issue with filtering to date: –not just “how much spam does it catch?” –but “how many legitimate mails get caught, too?” •Many systems do not pay attention to this problem –Some blacklists even use "false positives" as a weapon against service providers selling to spammers •FPs are much worse than spam getting through –much more inconvenient to user
Evolving a Better Filter •SpamAssassin assigns scores using a genetic algorithm –Given a big collection of human-classified mail, determine what tests each mail triggers –Use this to "evolve" an efficient score set –Exactly the kind of problem a genetic algorithm is good at –Allows "shotgun" rules to be scored low, where they cannot do damage
False Positive Rate •SpamAssassin is 98.5% accurate on our test corpora, with default settings –0.6% false positives –91% of all spam caught correctly –with network tests on, spam hit-rate probably increases to about 93-95% •Highest rate available among present tools •Tunable by the user -- reduce FPs by increasing the threshold, ditto vice-versa
Effect of the Threshold Setting
What To Do When You've Caught It •Since classifiers are imperfect, blind deletion is bad •Better to mark the mails, and allow user to check over them infrequently •Also good to mark for legal reasons –In the UK, it may be illegal to hold mail (even spam) for more than 3 days
Features For Large-Scale Use: "spamd" •Client-server interface to SpamAssassin •Pre-loads, so much faster for high volumes •Can load user preferences from an SQL database •Can load-balance -- uses TCP/IP •Deployed at several large organisations and ISPs: The Well, Salon.com, Panix, Transmeta, SourceForge, Stanford
Large-Scale Filtering For Your Network •Different from filtering for yourself •Many users get little spam •Should use conservative settings •Better to use “opt-out by default” –notify that spam filtering is available, and ask them if they want it
How Can Network Administrators Fight Spam? •Scan for Open Relays & Proxies on your network •Block proxy ports at the firewall •Audit web servers for “FormMail” or other insecure web-to-mail scripts •Spam traps reporting to network blacklists: Razor, DCC, Pyzor •Run SpamAssassin, or SpamAssassin Pro!
How Do The Spammers Feel? •Already hurting, according to CBS: –“[I’ve gone through] unbelievable hardships [to keep spamming]... My operating costs have gone up 1,000% this year, just so I can figure out how to get around all these filters” •Spam relies on low overheads and extremely cheap delivery •Disrupt the equation and they will give up!
Future Directions •Learning filters (Bayesian probability etc.) –Learn automatically, to detect what "good" mail to your network looks like •"Hash-cash" –Sending mail currently more-or-less free –With hash-cash, each recipient requires CPU time for the sender –SpamAssassin can provide "bonus points" for hash-cash users
Fin •http://spamassassin.org/ –SpamAssassin for UNIX –(free software) •http://www.deersoft.com/ –SpamAssassin Pro: MS Outlook, Exchange –(commercial version) –(my employers!)