Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

What is a statistical email filter? Filters email for spam. Supervised learning Not heuristics Bayesian filtering and Bayes

Why is a statistical email filter better? Not based on pre-set values by a human Can use concepts not easily understood by people Learns over time, and therefore adapts Real world tests put accuracy better than 99.9%

How does a statistical email filter work? Three parts  Tokenization / feature extraction  Training  Analysis Also, it must store a persistent state

Tokenization / Feature extraction Emails are made of words Tokens are words, phrases, HTML, timestamps, senders, etc. The goal is to get as many 'features' of the email as is possible, the good ones rise to the top “the orange ball” Becomes: “the”, “orange”, “ball”, “the orange”, “orange ball”, “the orange ball”, “*Font: Albany”, etc.

Training All filters begin blank Trained with a corpus of spam / nonspam Methods for training as email is seen  TEFT  TUNE  TOE

Analysis Email's tokens are compared to training data Some aggregated percentage is created for email Categorized based on that. Bayesian filtering gets its name from Bayes theorem here.

So how does this one work? Designed to be highly modular. Currently has modules for:  TEFT  Chi squared  Robinson's  Graham's Corpus is non changing, just classifications change.

Object Diagram External User Analysis Package Training Package Message Database Token Database Marked Messages Un-marked messages Token Counts Suggestions Un-marked messages External Database (Optional) All database information Email Corpus

Does this one work? To an extent  Test data had very limited feature set  Test data was based on personal writing style  Little time to test/tune 56%-57% accuracy at best  Measured by interesting predicted/interesting actual  Also mistakes/interesting marked More testing will be done  Other projects are more critical

Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Similar presentations

Presentation on theme: "Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Similar presentations

Presentation on theme: "Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST."— Presentation transcript:

Similar presentations

About project

Feedback