Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Email Not necessarily commercial – “flaming”, political.

Similar presentations


Presentation on theme: "Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Email Not necessarily commercial – “flaming”, political."— Presentation transcript:

1 Spam Filters

2 What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Email Not necessarily commercial – “flaming”, political

3 Spam arriving in Michael’s mail box in August You have won a lottery Your bank needs your account details Money transfer from Nigeria On-line pharmaceuticals Software for sale Alarm systems Looking for a safe, ethical secondary income? Music and film downloads

4 Why send spam? Email is fast, cheap,easy Availability of enormous address lists (or guess likely addresses from dictionaries e.g. ireland3@, harvesting) 7% of email users have bought something 100 responses to 10 million emails will produce a profit Illegal in the EU, but not in all US states

5 What’s wrong with spam? Wastes time deleting unwanted messages User sees offensive material Fills up file server storage space Some people vulnerable to confidence tricks BrightMail estimate 8% of email was spam in 2001, 40% in 2002. May stall the internet altogether

6 Combating spam Blacklisting – maintain a list of email addresses of known spammers Greylisting – challenge suspected spam emails e.g. by answering a question which is simple for a human but difficult for a computer e.g. how many animals in this picture? Munging - to defeat harvesters, e.g. post your email as cormac at dublin dot com on the web Litigation - e.g. anti-spam company Habeas haiku winter into spring, brightly anticipated, like Habeas SWE. EU says all bulk email should be opt in unless there is an “existing relationship”.

7 Spam filters Spam filters are an example of text classification (e.g. topic, language, author) What is worse, saying a legitimate email is spam or letting through a spam message ?

8 Rule-based filters Some systems allow users to handcraft rules, rather than yes/no, best to have an associated probability, e.g. Barcalys  90%, Ivory Coast  70%. But this is time consuming and tedious Users must be “savvy” enough to create them They must be constantly refined as the nature of spam changes

9 Adaptive filters Learn directly from the data in the user’s mailbox Which words are truly characteristic of spam? Compare with automatic indexing (stemming, mid-frequency words)

10 Training vs. test sets 1. Learn the rules on the training data 2. See if the rules work on the test data E.g. use the LingSpam corpus (400 spams, 200 legitimate messages sent to the Linguist List Better to build your own corpus – spammers can overcome filters built on just one corpus

11 Chi-Squared Test Find most characteristic words in spam / non-spam by chi-squared test (also finds difference between men and women’s speech)

12 Mutual Information (1) [word, category] e.g. how often is the word “download” found in spam? [word] e.g. how many messages altogether contain “download”? [category] e.g. how many messages altogether are spam? N = total number of messages

13 Mutual Information (2) MI = log2 ( [download,spam] * N / [download] [spam] ) The higher the MI, the more “download” is typical of spam Now we have found which words are most typical of spam and legitimate messages, we must use this information to classify the unseen messages in the test set

14 Bayesian Modelling Used in expert systems We want to work our the probability of the hypothesis given the evidence, P ( H | E ) E.g. P ( spam | contains “NOW!” ) P ( not spam | contains “NOW!” ) Which is greater? Bayes’ rule: P ( H | E ) = P (E | H) * P (H) / P (E)

15 Combining Evidence (1) A Naïve Bayesian model assumes that multiple evidence is not conditionally dependent. Compare: Toffee Vodka wins the 2:00 at Newmarket All for Laura wins the 2:35 at Newmarket Nebraska Tornado wins the 3:15 at Newmarket Newcastle beat Birmingham Newcastle lead Birmingham at half-time Shearer scores a hat-trick

16 Combining Evidence (2) In a Naïve Bayesian model, P ( cheap, v1agra, NOW! | spam) = P (cheap | spam) * P ( v1agra | spam ) * P (NOW! | spam) Now we can find: P ( spam | cheap, v1agra, NOW! ) =a P (not spam | cheap, v1agra, NOW!) = b Odds on spam given that the message contains these three words = a / b In real text, words are conditionally dependent e.g. “click here” Only classify as spam if 100 – 1 on.

17 Non-word indicators of spam phrases e.g. “free money”, “only $”, “over 21” punctuation!!! domain name of sender:.edu less likely to be spam than.com spam more likely to be sent at night than legitimate email If less than 9% non-alphanumeric characters, more likely to be legitimate Look for images, colours, HTML tags

18 Evaluation of spam filters Junk precision: percentage of messages in the test data classified as junk which truly are junk Junk recall: percentage of junk messages in the test data classified as junk Legitimate precision: percentage of messages in the test data classified as legitimate which truly are legitimate Legitimate recall: percentage of legitimate messages in the test data which are classified as legitimate

19 Summary The need to create spam filters automatically Find words which are typical of spam, and words which are typical of legitimate emails, using training data Use this knowledge to automatically classify new emails


Download ppt "Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Email Not necessarily commercial – “flaming”, political."

Similar presentations


Ads by Google