Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志
Outline Introduction Related Work Method Experiment Conclusion
Introduction Two types of approaches to text categorization Rule based - Define manually in form of if-then-else Advantage 1)High precision Disadvantages 1)Poor recall 2)Poor flexibility
Introduction Machine learning - Using sample labeled documents Advantage 1)Much High recall Disadvantages 1)Slightly lower precision than rule based 2)Poor flexibility
Introduction Focuses on machine learning based, discarding rule based All the raw data should be encoded into numerical vectors Encoding documents leads to two main problems 1)Huge dimensionality 2)Sparse distribution
Introduction Propose two way 1)String vector – Provide more transparency in classification 2)NTC (Neural Text Categorizer) – Classify documents with its sufficient robustness Solves the huge dimensionality
Related Work Machine learning algorithms applied to text categorization 1)KNN (K Nearest Neighbor) 2)NB (Naïve Bayes) 3)SVM (Support Vector Machine) 4)BP (Back Propagation )
Related Work KNN is evaluated as a simple and competitive algorithm with Support Vector Machine by Sebastiani in 2002 Disadvantage 1)Costs very much time for classifying objects
Related Work Evaluated feature selection methods within the application of NB by Mladenic and Grobellink in 1999 NB for implementing a spam mail filtering system as a real system based on text categorization by Androutsopoulos in 2000 Requires encoding documents into numerical vectors
Related Work SVM becomes more popular than the KNN and NB machine learning algorithms Defining a hyper-plane as a boundary of classes Applicable to only linearly separable distribution of training examples Optimizes the weights of the inner products of training examples and input vector, called Lagrange multipliers
Related Work Define two hyper-planes as a boundary of two classes with a maximal margin, figure 1. Figure 1.
Related Work Advantage 1)Tolerant to huge dimensionality of numerical vectors Disadvantage 1)Applicable to only binary classification 1)Fragile in representing documents into numerical vectors
Related Work A hierarchical combination of BPs, called HME (Hierarchical Mixture of Experts), instead of a single BP by Ruiz and Srinivasan in 2002 Observed that HME is the better combination of BPs Disadvantage 1)Cost much time and slowly 2)Not practical
Study Aim Two problems 1)Huge dimensionality 2)Sparse distribution Two successful methods 1)String vectors 2)A new neural network
Method Numerical Vectors Figure 2.
Method : Frequency of the word, w k : Total number of documents in the corpus : The number of documents including the word in the corpus Figure 3.
Method Encoding a document into its string vector Figure 4.
Method Text Categorization Systems Proposed neural network (NTC) Consists of the three layers 1)Input layer 2)Output layer 3)Learning layer
Method Input Layer - C orresponds to each word in the string vector Learning Layer - Corresponding to predefined categories Output Layer - Generates categorical scores, and correspond to predefined categories. Figure 5.
Method String vector is denoted by x = [t 1,t 2,...,td ], t i, 1 ≤ i ≤ d Predefined categories is denoted by C = [c 1,c 2,…..c |c| ], 1≤ j ≤ |C| W ji denote the weight by Figure 6.
Method O j : Output node corresponding to the category, C j Membership of the given input vector, x in the category, C j Figure 7.
Method Each string vector in the training set has its own target label, C j If its classified category, C k,, is identical to target category, C Figure 8.
Method Inhibit weights for its misclassified category Minimize the classification error Figure 9.
Experiment Evaluate the five approaches on test bed, called ‘20NewsGroups Each category contain identical number of test documents Test bed consists of 20 categories and 20,000 documents Using micro-averaged and macro-averaged average methods
Experiment Back propagation is the best approach NB is the worst approach with the decomposition of the task Figure 10. Evaluate the five text classifiers in 20Newsgroup with decomposition
Experiment Classifier answers to each test document by providing one of 20 categories Exits two groups 1)Better group - BP and NTC 2)Worse group – NB and KNN Figure 11. Evaluate the five text classifiers in 20Newsgroup without decomposition
Conclusion Used a full inverted index as the basis for the operation on string vectors, instead of a restricted sized similarity matrix Note trade-off between the two bases for the operation on string vectors NB and BP are considered to be modified into their adaptable versions to string vetors, but may be insufficient for modifying other Future research for modifying other machine learning algorithms