Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem.

Similar presentations


Presentation on theme: "1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem."— Presentation transcript:

1 1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem

2 2 Spam or Ham? FROM: Terry Delaney [removed] TO: (removed) Subject: FDA approved on-line pharmacies! click here (removed) here (removed) Chose your product and site below: Canadian pharmacy (removed) - Cialis Soft Tabs - $5.78, Viagra Professional - $4.07, Soma - $1.38, Human Growth Hormone - $43.37, Meridia - $3.32, Tramadol - $2.17, Levitra - $11.97.

3 3 Quick Reminders Conditional Probability: Events E, F withConditional Probability: Events E, F with Independence: E and F are independent if and only ifIndependence: E and F are independent if and only if

4 4 Baye’s Theorem: A quick Proof

5 Proof cont. 5

6 6 Applying Baye’s Theorem Let our sample space be the set of emails.Let our sample space be the set of emails. Let S be the event a message is spam; hence is the event a message is not spamLet S be the event a message is spam; hence is the event a message is not spam Let E be the event a message contains a word w.Let E be the event a message contains a word w.

7 7 Estimations

8 8 Estimation Continued

9 9 Spam based on single words? Probabilities based on single words: Bad IdeaProbabilities based on single words: Bad Idea –False positives AND false negatives aplenty Calculate based on n words, assuming each event E i |S (E i |S C ) is independent; P(S) = P(S C ).Calculate based on n words, assuming each event E i |S (E i |S C ) is independent; P(S) = P(S C ).

10 Final Approximation 10

11 11 How do we use this? User must train the filter based on messages in his/her inbox to estimate probabilitiesUser must train the filter based on messages in his/her inbox to estimate probabilities The program or user must define a threshold probability r:The program or user must define a threshold probability r: If, the message is considered spam.If, the message is considered spam.

12 12 Example Suppose the filter has the following dataSuppose the filter has the following data Threshold Probability:.9Threshold Probability:.9 “Viagra” occurs in 250 of 2000 spam messages“Viagra” occurs in 250 of 2000 spam messages “Viagra” occurs in only 5 of 1000 non-spam messages“Viagra” occurs in only 5 of 1000 non-spam messages Let’s try to estimate the probability, using the process we just definedLet’s try to estimate the probability, using the process we just defined

13 13 Example Cont. Step 1: Find the probability that the message has the word “Viagra” in it and is spam.Step 1: Find the probability that the message has the word “Viagra” in it and is spam. –p(Viagra) = 250 / 2000 = 0.125 Step 2: Find the probability that the message has the word “Viagra” in it and is not spam.Step 2: Find the probability that the message has the word “Viagra” in it and is not spam. –q(Viagra) = 5 / 1000 = 0.005

14 14 Since we are assuming that it is equally likely that an incoming message is or is not spam, we can estimate the probability with this equation:Since we are assuming that it is equally likely that an incoming message is or is not spam, we can estimate the probability with this equation: –r(Viagra) = p(Viagra) p(Viagra) + q(Viagra) p(Viagra) + q(Viagra) Example Cont.

15 15 0.125 0.125 0.125 + 0.005 0.125 + 0.005 = 0.125 0.130 0.130 = 0.962 Since r(Viagra) is greater than the threshold of 0.9, we can reject this message as spam. Example Cont.

16 16 Single-word detection can lead to a lot of false positives and false negatives.Single-word detection can lead to a lot of false positives and false negatives. To counter this, most spam filters look for the presence of multiple words.To counter this, most spam filters look for the presence of multiple words. Harder Stuff

17 17 Another Example 2000 Spam messages; 1000 real messages2000 Spam messages; 1000 real messages “Viagra” appears in 400 spam messages“Viagra” appears in 400 spam messages “Viagra” appears in 60 real messages“Viagra” appears in 60 real messages “Cialis” appears in 200 spam and 25 real messages“Cialis” appears in 200 spam and 25 real messages Threshold Probability:.9Threshold Probability:.9 Let’s calculate the probability that it’s spam.Let’s calculate the probability that it’s spam.

18 18 Example Cont. Step 1: Find the probability that the message has the word “Viagra” in it and is spam.Step 1: Find the probability that the message has the word “Viagra” in it and is spam. –p(Viagra) = 400 / 2000 = 0.2 Step 2: Find the probability that the message has the word “Viagra” and is not spam.Step 2: Find the probability that the message has the word “Viagra” and is not spam. –q(Viagra) = 60 / 1000 = 0.06

19 19 Example Cont. Step 3: Find the probability that the message contains the word “Cialis” and is spam.Step 3: Find the probability that the message contains the word “Cialis” and is spam. –p(Cialis) = 200 / 2000 = 0.1 Step 4: Find the probability that the message contains the word “Cialis” and is not spam.Step 4: Find the probability that the message contains the word “Cialis” and is not spam. –q(Cialis) = 25 / 1000 = 0.025

20 20 Example Cont Using our approximation, we have:Using our approximation, we have: –r(Viagra,Cialis) = p(Viagra) * p(Cialis) p(Viagra) * p(Cialis) + q(Viagra) * q(Cialis) p(Viagra) * p(Cialis) + q(Viagra) * q(Cialis)

21 21 Example Cont. r(Viagra,Cialis) = (0.2)(0.1)r(Viagra,Cialis) = (0.2)(0.1) (0.2)(0.1) + (0.6)(0.025) (0.2)(0.1) + (0.6)(0.025) = 0.930 = 0.930 This message will be rejected however since we set the threshold probability at 0.9. This message will be rejected however since we set the threshold probability at 0.9.

22 22 Questions?


Download ppt "1 Bayesian Spam Filters Key ConceptsKey Concepts –Conditional Probability –Independence –Bayes Theorem."

Similar presentations


Ads by Google