2 Outline 1. Introduction 2. Email sample and data preprocessing –2.1 Email representation –2.2 Feature extraction 3. Anti-spam email LVQ model –3.1 Spam email category. –3.2 Learning vector quantization neural network model –3.3 Anti-spam email LVQ algorithm –3.4 Parameter setting 4. Experiments and result 5. Conclusion
3 1. Introduction(1/2) Spam e-mail waste users time, money, network bandwidth as well as, meanwhile, clutter users' mailboxes, even be harmful, e.g. pornographic content. In America, spam emails make enterprises to be loss up to 9 billions per year. Without appropriate counter-measures, the situation will continue worsening and spam email will eventually undermine the usability of email.
4 1. Introduction(2/2) Duhong Chen et al. compared four algorithms, Bayes, decision tree, neural networks, Boosting, and drew a conclusion that neural network algorithm has higher performance. Experiments have proved that the LVQ-based anti-spare email filter has better performance than Bayes- based and BP neural network.-based approaches.
5 2. Email sample and data preprocessing(1/2) 2.1 Email representation TFIDFi=TFi × log (N/DFi) (1) –TFi ： the frequency that word ti appears in document d 2.2 Feature extraction –N ： the total numbers of training documents –DFi ： represents the numbers of documents which contain word ti
6 2. Email sample and data preprocessing(2/2) 2.2 Feature extraction –A ： the numbers of emails which contain word t and belong to class s –B ： that of emails which contain word but not belong to class s –C ： that of emails which belong to class s but not contain word t –N ： the total email number in training corpus
8 3. Anti-spam email LVQ model(2/5) 3.2 Learning vector quantization neural network model –The model is divided into two layers. The first layer is competitive layer, in which each neuron represents a subclass. –The second is output layer, in which each neuron represents a class.
12 4. Experiments and result(1/4) This project makes use of email corpus from http://www.spamassassin.org/publiccorpus, which is open available source. Select 1000 pieces e-mails randomly from the corpus, including 580 spam e-mails, 420 legitimate e-mails.
13 4. Experiments and result(2/4) Anti-spare email filter performance is often measured in terms of spam precision (SP) and sparn recall (SR).
14 4. Experiments and result(3/4) A criterion F1, which incorporates spam precision and spare recall.
16 5. Conclusion Both neural network-based algorithms are usually better than that based on Bayes. LVQ-based method classify spam emails into several subclasses in content so that the feature words of each subclass of spam email is more related and closer as well as characteristics of each subclass of spam emails are easier to identify.