2 Outline 1. Introduction 2. sample and data preprocessing –2.1 representation –2.2 Feature extraction 3. Anti-spam LVQ model –3.1 Spam category. –3.2 Learning vector quantization neural network model –3.3 Anti-spam LVQ algorithm –3.4 Parameter setting 4. Experiments and result 5. Conclusion
3 1. Introduction(1/2) Spam waste users time, money, network bandwidth as well as, meanwhile, clutter users' mailboxes, even be harmful, e.g. pornographic content. In America, spam s make enterprises to be loss up to 9 billions per year. Without appropriate counter-measures, the situation will continue worsening and spam will eventually undermine the usability of .
4 1. Introduction(2/2) Duhong Chen et al. compared four algorithms, Bayes, decision tree, neural networks, Boosting, and drew a conclusion that neural network algorithm has higher performance. Experiments have proved that the LVQ-based anti-spare filter has better performance than Bayes- based and BP neural network.-based approaches.
5 2. sample and data preprocessing(1/2) 2.1 representation TFIDFi=TFi × log (N/DFi) (1) –TFi ： the frequency that word ti appears in document d 2.2 Feature extraction –N ： the total numbers of training documents –DFi ： represents the numbers of documents which contain word ti
6 2. sample and data preprocessing(2/2) 2.2 Feature extraction –A ： the numbers of s which contain word t and belong to class s –B ： that of s which contain word but not belong to class s –C ： that of s which belong to class s but not contain word t –N ： the total number in training corpus
8 3. Anti-spam LVQ model(2/5) 3.2 Learning vector quantization neural network model –The model is divided into two layers. The first layer is competitive layer, in which each neuron represents a subclass. –The second is output layer, in which each neuron represents a class.
12 4. Experiments and result(1/4) This project makes use of corpus from which is open available source. Select 1000 pieces s randomly from the corpus, including 580 spam s, 420 legitimate s.
13 4. Experiments and result(2/4) Anti-spare filter performance is often measured in terms of spam precision (SP) and sparn recall (SR).
14 4. Experiments and result(3/4) A criterion F1, which incorporates spam precision and spare recall.
15 4. Experiments and result(4/4)
16 5. Conclusion Both neural network-based algorithms are usually better than that based on Bayes. LVQ-based method classify spam s into several subclasses in content so that the feature words of each subclass of spam is more related and closer as well as characteristics of each subclass of spam s are easier to identify.