Presentation is loading. Please wait.

Presentation is loading. Please wait.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Similar presentations


Presentation on theme: "Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志."— Presentation transcript:

1 Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

2 Outline  Introduction  Related Work  Method  Experiment  Conclusion

3 Introduction  Two types of approaches to text categorization Rule based - Define manually in form of if-then-else  Advantage 1)High precision  Disadvantages 1)Poor recall 2)Poor flexibility

4 Introduction ‚Machine learning - Using sample labeled documents  Advantage 1)Much High recall  Disadvantages 1)Slightly lower precision than rule based 2)Poor flexibility

5 Introduction  Focuses on machine learning based, discarding rule based  All the raw data should be encoded into numerical vectors  Encoding documents leads to two main problems 1)Huge dimensionality 2)Sparse distribution

6 Introduction  Propose two way 1)String vector – Provide more transparency in classification 2)NTC (Neural Text Categorizer) – Classify documents with its sufficient robustness Solves the huge dimensionality

7 Related Work  Machine learning algorithms applied to text categorization 1)KNN (K Nearest Neighbor) 2)NB (Naïve Bayes) 3)SVM (Support Vector Machine) 4)BP (Back Propagation )

8 Related Work  KNN is evaluated as a simple and competitive algorithm with Support Vector Machine by Sebastiani in 2002  Disadvantage 1)Costs very much time for classifying objects

9 Related Work  Evaluated feature selection methods within the application of NB by Mladenic and Grobellink in 1999  NB for implementing a spam mail filtering system as a real system based on text categorization by Androutsopoulos in 2000  Requires encoding documents into numerical vectors

10 Related Work  SVM becomes more popular than the KNN and NB machine learning algorithms  Defining a hyper-plane as a boundary of classes  Applicable to only linearly separable distribution of training examples  Optimizes the weights of the inner products of training examples and input vector, called Lagrange multipliers

11 Related Work  Define two hyper-planes as a boundary of two classes with a maximal margin, figure 1. Figure 1.

12 Related Work  Advantage 1)Tolerant to huge dimensionality of numerical vectors  Disadvantage 1)Applicable to only binary classification 1)Fragile in representing documents into numerical vectors

13 Related Work  A hierarchical combination of BPs, called HME (Hierarchical Mixture of Experts), instead of a single BP by Ruiz and Srinivasan in 2002  Observed that HME is the better combination of BPs  Disadvantage 1)Cost much time and slowly 2)Not practical

14 Study Aim  Two problems 1)Huge dimensionality 2)Sparse distribution  Two successful methods 1)String vectors 2)A new neural network

15 Method  Numerical Vectors Figure 2.

16 Method : Frequency of the word, w k : Total number of documents in the corpus : The number of documents including the word in the corpus Figure 3.

17 Method  Encoding a document into its string vector Figure 4.

18 Method  Text Categorization Systems  Proposed neural network (NTC)  Consists of the three layers 1)Input layer 2)Output layer 3)Learning layer

19 Method  Input Layer - C orresponds to each word in the string vector  Learning Layer - Corresponding to predefined categories  Output Layer - Generates categorical scores, and correspond to predefined categories. Figure 5.

20 Method  String vector is denoted by x = [t 1,t 2,...,td ], t i, 1 ≤ i ≤ d  Predefined categories is denoted by C = [c 1,c 2,…..c |c| ], 1≤ j ≤ |C|  W ji denote the weight by Figure 6.

21 Method  O j : Output node corresponding to the category, C j  Membership of the given input vector, x in the category, C j Figure 7.

22 Method  Each string vector in the training set has its own target label, C j  If its classified category, C k,, is identical to target category, C Figure 8.

23 Method  Inhibit weights for its misclassified category  Minimize the classification error Figure 9.

24 Experiment  Evaluate the five approaches on test bed, called ‘20NewsGroups  Each category contain identical number of test documents  Test bed consists of 20 categories and 20,000 documents  Using micro-averaged and macro-averaged average methods

25 Experiment  Back propagation is the best approach  NB is the worst approach with the decomposition of the task Figure 10. Evaluate the five text classifiers in 20Newsgroup with decomposition

26 Experiment  Classifier answers to each test document by providing one of 20 categories  Exits two groups 1)Better group - BP and NTC 2)Worse group – NB and KNN Figure 11. Evaluate the five text classifiers in 20Newsgroup without decomposition

27 Conclusion  Used a full inverted index as the basis for the operation on string vectors, instead of a restricted sized similarity matrix  Note trade-off between the two bases for the operation on string vectors  NB and BP are considered to be modified into their adaptable versions to string vetors, but may be insufficient for modifying other  Future research for modifying other machine learning algorithms


Download ppt "Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志."

Similar presentations


Ads by Google