Download presentation
Presentation is loading. Please wait.
Published byAmarion Hearl Modified over 9 years ago
1
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm
2
Intelligent Database Systems Lab Outlines Motivation Objectives Methodology Experiments Conclusions Comments
3
Intelligent Database Systems Lab Motivation A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier.
4
Intelligent Database Systems Lab Objectives Two-stage feature selection and feature extraction is used to improve the performance of text categorization.
5
Intelligent Database Systems Lab Methodology
6
Intelligent Database Systems Lab Methodology – pre-processing – removing of stop-words – Stemming – term weighting – pruning of the words a, an, and, because, can, do, every, the… computer, computing, computation, computes comput prune the words that appear less than two times in the documents. Terms of the document collection documents
7
Intelligent Database Systems Lab Methodology – feature ranking with information gain each term within the text is ranked depending on their importance for the classification in decreasing order using the IG method.
8
Intelligent Database Systems Lab Methodology – dimension reduction methods principal component analysis Genetic algorithm for feature selection Individual’s encoding Fitness function MutationCrossover 11011 00110 01110 11110 Selection p ≦ m
9
Intelligent Database Systems Lab Methodology – text categorization methods KNN classifier C4.5 decision tree classifier
10
Intelligent Database Systems Lab precisionrecallF-measure Methodology – evaluation of the performance
11
Intelligent Database Systems Lab Experiments – datasets – Reuters dataset-21578 – Classic3 dataset Category nameNumber of document Earn3743 Acquisition2179 Money-fx633 Crude561 Grain542 Trade500 Category nameNumber of document CRANFIELD1398 MEDLINE1033 CISI1460
12
Intelligent Database Systems Lab Experiments – Reuters-21578 A document-term matrix is acquired with a dimension of 8158 × 7542 at the end of pre-processing.
13
Intelligent Database Systems Lab Experiments – Reuters-21578
14
Intelligent Database Systems Lab Experiments – Classic3 A document-term matrix is acquired in the dimension of 3891 × 6679 at the end of pre-processing.
15
Intelligent Database Systems Lab Experiments – Classic3
16
Intelligent Database Systems Lab Conclusions The success of text categorization performed through the C4.5 decision tree and KNN algorithms using fewer features selected via IG-PCA and IG- GA is higher than the success acquired using features selected via IG. Two-stage feature selection methods can improve the performance of text categorization.
17
Intelligent Database Systems Lab Comments Advantages - understand the basic methods Applications - text categorization
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.