Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm
Intelligent Database Systems Lab Outlines Motivation Objectives Methodology Experiments Conclusions Comments
Intelligent Database Systems Lab Motivation A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier.
Intelligent Database Systems Lab Objectives Two-stage feature selection and feature extraction is used to improve the performance of text categorization.
Intelligent Database Systems Lab Methodology
Intelligent Database Systems Lab Methodology – pre-processing – removing of stop-words – Stemming – term weighting – pruning of the words a, an, and, because, can, do, every, the… computer, computing, computation, computes comput prune the words that appear less than two times in the documents. Terms of the document collection documents
Intelligent Database Systems Lab Methodology – feature ranking with information gain each term within the text is ranked depending on their importance for the classification in decreasing order using the IG method.
Intelligent Database Systems Lab Methodology – dimension reduction methods principal component analysis Genetic algorithm for feature selection Individual’s encoding Fitness function MutationCrossover Selection p ≦ m
Intelligent Database Systems Lab Methodology – text categorization methods KNN classifier C4.5 decision tree classifier
Intelligent Database Systems Lab precisionrecallF-measure Methodology – evaluation of the performance
Intelligent Database Systems Lab Experiments – datasets – Reuters dataset – Classic3 dataset Category nameNumber of document Earn3743 Acquisition2179 Money-fx633 Crude561 Grain542 Trade500 Category nameNumber of document CRANFIELD1398 MEDLINE1033 CISI1460
Intelligent Database Systems Lab Experiments – Reuters A document-term matrix is acquired with a dimension of 8158 × 7542 at the end of pre-processing.
Intelligent Database Systems Lab Experiments – Reuters-21578
Intelligent Database Systems Lab Experiments – Classic3 A document-term matrix is acquired in the dimension of 3891 × 6679 at the end of pre-processing.
Intelligent Database Systems Lab Experiments – Classic3
Intelligent Database Systems Lab Conclusions The success of text categorization performed through the C4.5 decision tree and KNN algorithms using fewer features selected via IG-PCA and IG- GA is higher than the success acquired using features selected via IG. Two-stage feature selection methods can improve the performance of text categorization.
Intelligent Database Systems Lab Comments Advantages - understand the basic methods Applications - text categorization