Spam: An Analysis of Spam Filters Joe Chiarella Jason O’Brien Advisors: Professor Wills and Professor Claypool
Project Goals To analyze the effectiveness of different kinds of spam filters. To analyze the effectiveness of different kinds of spam filters. Focused on SpamAssassin and Bogofilter Focused on SpamAssassin and Bogofilter
SpamAssassin Rule-based filter – over 400 rules. Rule-based filter – over 400 rules. Each Rule has an associated weight. Each Rule has an associated weight. Score of an is sum of weights across all matching rules. Score of an is sum of weights across all matching rules. User adjustable threshold. User adjustable threshold.
Bogofilter Bayesian filter. Bayesian filter. Calculates probability that an is spam using past . Calculates probability that an is spam using past . Looks at frequency of words (not order of words). Looks at frequency of words (not order of words). Accuracy should improve over time. Accuracy should improve over time.
Data Collection collected from students, professors, small business employees, and free accounts. collected from students, professors, small business employees, and free accounts ham s, 5010 spam s, separated into ham and spam mailboxes for each user ham s, 5010 spam s, separated into ham and spam mailboxes for each user.
Methodology Compared accuracy of SpamAssassin and Bogofilter for each user’s . Compared accuracy of SpamAssassin and Bogofilter for each user’s . Tested same number of ham s and spam s from each user. Tested same number of ham s and spam s from each user. Ignored results from first 50 s to allow Bogofilter to learn. Ignored results from first 50 s to allow Bogofilter to learn.
Comparison of Bogofilter and SpamAssassin on Ham CP = Company Person PR = Professor ST = Student FE = Free
Comparison of Bogofilter and SpamAssassin on Spam CP = Company Person PR = Professor ST = Student FE = Free
SpamAssassin Score Analysis
Conclusion Bogofilter and SpamAssassin effectiveness depend greatly on the user. Bogofilter and SpamAssassin effectiveness depend greatly on the user. Neither filter outperformed the other in all cases. Neither filter outperformed the other in all cases. Filtering Spam is hard. Filtering Spam is hard.
Questions?