Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST.

Similar presentations


Presentation on theme: "Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST."— Presentation transcript:

1 Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

2 What is a statistical email filter? Filters email for spam. Supervised learning Not heuristics Bayesian filtering and Bayes

3 Why is a statistical email filter better? Not based on pre-set values by a human Can use concepts not easily understood by people Learns over time, and therefore adapts Real world tests put accuracy better than 99.9%

4 How does a statistical email filter work? Three parts  Tokenization / feature extraction  Training  Analysis Also, it must store a persistent state

5 Tokenization / Feature extraction Emails are made of words Tokens are words, phrases, HTML, timestamps, senders, etc. The goal is to get as many 'features' of the email as is possible, the good ones rise to the top “the orange ball” Becomes: “the”, “orange”, “ball”, “the orange”, “orange ball”, “the orange ball”, “*Font: Albany”, etc.

6 Training All filters begin blank Trained with a corpus of spam / nonspam Methods for training as email is seen  TEFT  TUNE  TOE

7 Analysis Email's tokens are compared to training data Some aggregated percentage is created for email Categorized based on that. Bayesian filtering gets its name from Bayes theorem here.

8 So how does this one work? Designed to be highly modular. Currently has modules for:  TEFT  Chi squared  Robinson's  Graham's Corpus is non changing, just classifications change.

9 Object Diagram External User Analysis Package Training Package Message Database Token Database Marked Messages Un-marked messages Token Counts Suggestions Un-marked messages External Database (Optional) All database information Email Corpus

10 Does this one work? To an extent  Test data had very limited feature set  Test data was based on personal writing style  Little time to test/tune 56%-57% accuracy at best  Measured by interesting predicted/interesting actual  Also mistakes/interesting marked More testing will be done  Other projects are more critical


Download ppt "Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST."

Similar presentations


Ads by Google