Spam Filters
What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political
Spam arriving in Michael’s mail box in August You have won a lottery Your bank needs your account details Money transfer from Nigeria On-line pharmaceuticals Software for sale Alarm systems Looking for a safe, ethical secondary income? Music and film downloads
Why send spam? is fast, cheap,easy Availability of enormous address lists (or guess likely addresses from dictionaries e.g. harvesting) 7% of users have bought something 100 responses to 10 million s will produce a profit Illegal in the EU, but not in all US states
What’s wrong with spam? Wastes time deleting unwanted messages User sees offensive material Fills up file server storage space Some people vulnerable to confidence tricks BrightMail estimate 8% of was spam in 2001, 40% in May stall the internet altogether
Combating spam Blacklisting – maintain a list of addresses of known spammers Greylisting – challenge suspected spam s e.g. by answering a question which is simple for a human but difficult for a computer e.g. how many animals in this picture? Munging - to defeat harvesters, e.g. post your as cormac at dublin dot com on the web Litigation - e.g. anti-spam company Habeas haiku winter into spring, brightly anticipated, like Habeas SWE. EU says all bulk should be opt in unless there is an “existing relationship”.
Spam filters Spam filters are an example of text classification (e.g. topic, language, author) What is worse, saying a legitimate is spam or letting through a spam message ?
Rule-based filters Some systems allow users to handcraft rules, rather than yes/no, best to have an associated probability, e.g. Barcalys 90%, Ivory Coast 70%. But this is time consuming and tedious Users must be “savvy” enough to create them They must be constantly refined as the nature of spam changes
Adaptive filters Learn directly from the data in the user’s mailbox Which words are truly characteristic of spam? Compare with automatic indexing (stemming, mid-frequency words)
Training vs. test sets 1. Learn the rules on the training data 2. See if the rules work on the test data E.g. use the LingSpam corpus (400 spams, 200 legitimate messages sent to the Linguist List Better to build your own corpus – spammers can overcome filters built on just one corpus
Chi-Squared Test Find most characteristic words in spam / non-spam by chi-squared test (also finds difference between men and women’s speech)
Mutual Information (1) [word, category] e.g. how often is the word “download” found in spam? [word] e.g. how many messages altogether contain “download”? [category] e.g. how many messages altogether are spam? N = total number of messages
Mutual Information (2) MI = log2 ( [download,spam] * N / [download] [spam] ) The higher the MI, the more “download” is typical of spam Now we have found which words are most typical of spam and legitimate messages, we must use this information to classify the unseen messages in the test set
Bayesian Modelling Used in expert systems We want to work our the probability of the hypothesis given the evidence, P ( H | E ) E.g. P ( spam | contains “NOW!” ) P ( not spam | contains “NOW!” ) Which is greater? Bayes’ rule: P ( H | E ) = P (E | H) * P (H) / P (E)
Combining Evidence (1) A Naïve Bayesian model assumes that multiple evidence is not conditionally dependent. Compare: Toffee Vodka wins the 2:00 at Newmarket All for Laura wins the 2:35 at Newmarket Nebraska Tornado wins the 3:15 at Newmarket Newcastle beat Birmingham Newcastle lead Birmingham at half-time Shearer scores a hat-trick
Combining Evidence (2) In a Naïve Bayesian model, P ( cheap, v1agra, NOW! | spam) = P (cheap | spam) * P ( v1agra | spam ) * P (NOW! | spam) Now we can find: P ( spam | cheap, v1agra, NOW! ) =a P (not spam | cheap, v1agra, NOW!) = b Odds on spam given that the message contains these three words = a / b In real text, words are conditionally dependent e.g. “click here” Only classify as spam if 100 – 1 on.
Non-word indicators of spam phrases e.g. “free money”, “only $”, “over 21” punctuation!!! domain name of sender:.edu less likely to be spam than.com spam more likely to be sent at night than legitimate If less than 9% non-alphanumeric characters, more likely to be legitimate Look for images, colours, HTML tags
Evaluation of spam filters Junk precision: percentage of messages in the test data classified as junk which truly are junk Junk recall: percentage of junk messages in the test data classified as junk Legitimate precision: percentage of messages in the test data classified as legitimate which truly are legitimate Legitimate recall: percentage of legitimate messages in the test data which are classified as legitimate
Summary The need to create spam filters automatically Find words which are typical of spam, and words which are typical of legitimate s, using training data Use this knowledge to automatically classify new s