2 What Is Spam? Best description: "Unsolicited Bulk E-mail" In human terms: bulk you didn't want, and didn't ask forMailing lists, newsletters, "latest offers": not spam, if you asked for them in the first placeName courtesy of Monty Python: “spam, spam, spam and spam”
3 Why Bother Filtering Spam? Seems to be about 30% to 60% of mail traffic, and increasingUsers are forced to waste time wading through their inboxcosts their employers moneyImpossible to unsubscribe“unsubscribe” addresses work only 37% of the time, according to the FTCLegal retaliation not possible, yetJust plain irritating!
4 Spam Volume Is Increasing (data from Brightmail.com)
5 Filtering: Homebrew Blacklists First round of "spam filters": internal blacklists, maintained by in-house admin staffMatch addresses, and delete those from known spammersLater, match "bad words" (Viagra, porn)Quite hard to configure; centralised; lots of work to keep up to date
6 Filtering: DNS Blacklists Identify spam source computers by IP addressAllow mail system to look up a public database on the internet as mail arrivesBlock the message, if its sender's address is blacklistedNow at least 20 DNS blacklists, with varying reliabilityMany false positiveseircom.net's main mail server!
7 SpamAssassin Concepts Zero-configuration where possibleLots of rules to determine if a mail is spam or not"Fuzzy logic": rules are assigned scores, based on our confidence in their accuracyThese are combined to produce an overall score for each messageIf over a user-defined threshold, the mail is judged as spamNo one rule, alone, can mark a mail as spam
8 SpamAssassin Concepts, pt.2 Combines many systems for a "broad-spectrum" approach:Detect forged headersSpam-tool signatures in headersText keyword scanner in the message bodyDNS blacklistsRazor, DCC (Distributed Checksum Clearinghouse), PyzorSpammers cannot aim to defeat 1 system; the others will catch them out
9 Integration Into Mail Systems Wrote SpamAssassin with flexibility of integration in mindMany have been written:Integration into Mail Transfer Agents (sendmail, qmail, Exim, Postfix, Microsoft Exchange)Integration into virus-scanner MTA plug-ins (MIMEDefang, amavisd-new)IMAP/POP proxies and clientsCommercial plug-ins for Windows clients (Eudora, MS Outlook)And many more I don't know about!
10 Accuracy and False Positives The big issue with filtering to date:not just “how much spam does it catch?”but “how many legitimate mails get caught, too?”Many systems do not pay attention to this problemSome blacklists even use "false positives" as a weapon against service providers selling to spammersFPs are much worse than spam getting throughmuch more inconvenient to user
11 Evolving a Better Filter SpamAssassin assigns scores using a genetic algorithmGiven a big collection of human-classified mail, determine what tests each mail triggersUse this to "evolve" an efficient score setExactly the kind of problem a genetic algorithm is good atAllows "shotgun" rules to be scored low, where they cannot do damage
12 False Positive RateSpamAssassin is 98.5% accurate on our test corpora, with default settings0.6% false positives91% of all spam caught correctlywith network tests on, spam hit-rate probably increases to about 93-95%Highest rate available among present toolsTunable by the user -- reduce FPs by increasing the threshold, ditto vice-versa
14 What To Do When You've Caught It Since classifiers are imperfect, blind deletion is badBetter to mark the mails, and allow user to check over them infrequentlyAlso good to mark for legal reasonsIn the UK, it may be illegal to hold mail (even spam) for more than 3 days
15 Features For Large-Scale Use: "spamd" Client-server interface to SpamAssassinPre-loads, so much faster for high volumesCan load user preferences from an SQL databaseCan load-balance -- uses TCP/IPDeployed at several large organisations and ISPs: The Well, Salon.com, Panix, Transmeta, SourceForge, Stanford
16 Large-Scale Filtering For Your Network Different from filtering for yourselfMany users get little spamShould use conservative settingsBetter to use “opt-out by default”notify that spam filtering is available, and ask them if they want it
17 How Can Network Administrators Fight Spam? Scan for Open Relays & Proxies on your networkBlock proxy ports at the firewallAudit web servers for “FormMail” or other insecure web-to-mail scriptsSpam traps reporting to network blacklists: Razor, DCC, PyzorRun SpamAssassin, or SpamAssassin Pro!
18 How Do The Spammers Feel? Already hurting, according to CBS:“[I’ve gone through] unbelievable hardships [to keep spamming] ... My operating costs have gone up 1,000% this year, just so I can figure out how to get around all these filters”Spam relies on low overheads and extremely cheap deliveryDisrupt the equation and they will give up!
19 Future Directions Learning filters (Bayesian probability etc.) Learn automatically, to detect what "good" mail to your network looks like"Hash-cash"Sending mail currently more-or-less freeWith hash-cash, each recipient requires CPU time for the senderSpamAssassin can provide "bonus points" for hash-cash users
20 Fin http://spamassassin.org/ http://www.deersoft.com/ SpamAssassin for UNIX(free software)SpamAssassin Pro: MS Outlook, Exchange(commercial version)(my employers!)