Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by."— Presentation transcript:

1 Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

2 Intelligent Database Systems Lab Outlines Motivation Objectives Methodology Experiments Conclusions Comments

3 Intelligent Database Systems Lab Motivation A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier.

4 Intelligent Database Systems Lab Objectives Two-stage feature selection and feature extraction is used to improve the performance of text categorization.

5 Intelligent Database Systems Lab Methodology

6 Intelligent Database Systems Lab Methodology – pre-processing – removing of stop-words – Stemming – term weighting – pruning of the words a, an, and, because, can, do, every, the… computer, computing, computation, computes  comput prune the words that appear less than two times in the documents. Terms of the document collection documents

7 Intelligent Database Systems Lab Methodology – feature ranking with information gain each term within the text is ranked depending on their importance for the classification in decreasing order using the IG method.

8 Intelligent Database Systems Lab Methodology – dimension reduction methods principal component analysis Genetic algorithm for feature selection Individual’s encoding Fitness function MutationCrossover 11011 00110 01110 11110 Selection p ≦ m

9 Intelligent Database Systems Lab Methodology – text categorization methods KNN classifier C4.5 decision tree classifier

10 Intelligent Database Systems Lab precisionrecallF-measure Methodology – evaluation of the performance

11 Intelligent Database Systems Lab Experiments – datasets – Reuters dataset-21578 – Classic3 dataset Category nameNumber of document Earn3743 Acquisition2179 Money-fx633 Crude561 Grain542 Trade500 Category nameNumber of document CRANFIELD1398 MEDLINE1033 CISI1460

12 Intelligent Database Systems Lab Experiments – Reuters-21578 A document-term matrix is acquired with a dimension of 8158 × 7542 at the end of pre-processing.

13 Intelligent Database Systems Lab Experiments – Reuters-21578

14 Intelligent Database Systems Lab Experiments – Classic3 A document-term matrix is acquired in the dimension of 3891 × 6679 at the end of pre-processing.

15 Intelligent Database Systems Lab Experiments – Classic3

16 Intelligent Database Systems Lab Conclusions The success of text categorization performed through the C4.5 decision tree and KNN algorithms using fewer features selected via IG-PCA and IG- GA is higher than the success acquired using features selected via IG. Two-stage feature selection methods can improve the performance of text categorization.

17 Intelligent Database Systems Lab Comments Advantages - understand the basic methods Applications - text categorization


Download ppt "Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by."

Similar presentations


Ads by Google