META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,

META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky Karol.Furdik@intersoft.sk, {Jan.Paralic, Gabriel.Tutoky}@ tuke.sk Technical University of Košice, Slovakia September 24-26, 2008 University of Zagreb, Varaždin, Croatia

2/22 Introduction  Text classification  Method for knowledge extraction from textual documents  Originally, the classification was designed as a semi-automatic procedure, where the users were responsible for selection of proper classification settings  In the most of applications (e.g. in KP-Lab project (http://www.kp-lab.org) ) is requirement for fully automated text classification  Meta-Learning  Allows to automatize text classification process by automatic selection of the proper algorithms K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

Theoretical analyses

4/22 Text classification – two steps process K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008 Creation of the classifier Training set of documents Preprocessing of documents Learning of Classifier Classifier Usage of the classifier Document of unknown category Classifier application Categorized document Preprocessing of current document

5/22 Meta-learning, MUDOF algorithm  MUDOF – Meta-learning Using Document Feature Characteristics  Introduced in 2002 by Wai and Kwok-Yin  Meta-learning targets:  Selection of algorithms for classifiers  Selection of algorithms is on category level (for each category is possible to select other algorithm)  Automatize and optimalize the classifiers creation process K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

6/22 Meta-learning – scheme (1/4) K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008 Construction of the meta-model Training set for creation of the meta-model (TM) Training set for creation of the meta-model (TM) Values of effectiveness Testing set of documents (TE) Usage of the meta-model Meta-model Classifier

7/22 Values of effectiveness  The A 1,... A n algorithms are “one by one” applicated on C 1,... C m categories from training set  The n x m binary classifiers are created  Evaluation of binary classifiers by testing data collection  Efficiency of each algorithm on each category is obtained  The most computational step in the meta-learning K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

8/22 Meta-learning – scheme (2/4) K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008 Construction of the meta-model Training set for creation of the meta-model (TM) Training set for creation of the meta-model (TM) Values of effectiveness Testing set of documents (TE) Feature characteristics of particular categories Feature characteristics of particular categories Usage of the meta-model

9/22 Feature characteristics  The categories are characterized by statistical view  Examples of characteristics:  PosTr – ratio of positive and negative instances  AvgDocLen – average document length  AvgTermVal – average term weight  AvgTopInfoGain – average info gain of best m terms  NumInfoGainThres – numbers of terms over threshold value of info gain K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

10/22 Meta-learning – scheme (3/4) K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008 Construction of the meta-model Training set for creation of the meta-model (TM) Training set for creation of the meta-model (TM) Values of effectiveness Testing set of documents (TE) Feature characteristics of particular categories Feature characteristics of particular categories Usage of the meta-model Meta-model

11/22 Meta-model  Modeling relations between feature characteristics of categories and efficiency of algorithms  Meta-model can be:  Prediction (MUDOF_R) – linear regression  Classification (MUDOF_K) – k-NN  Meta-model advantages:  “Engine” for selection of proper algorithms  Possible to use it for more than one collection of documents  In the ideal case, it is sufficient to learn a meta-model only once and then it can be used for selection of algorithms K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

12/22 Meta-learning – scheme (4/4) K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008 Construction of the meta-model Training set for creation of the meta-model (TM) Training set for creation of the meta-model (TM) Values of effectiveness Testing set of documents (TE) Feature characteristics of particular categories Feature characteristics of particular categories Usage of the meta-model Training set for creation of the classifier (TC) Training set for creation of the classifier (TC) Feature characteristics of particular categories Feature characteristics of particular categories Classifier Selection of algorithms for particular categories Meta-model Learning of classifiers

Experiments

14/22 Data description  Reuters-21578  10 788 documents; 90 categories  TM (3815); TC (3961); TE (3019)  Not balanced data  20 Newsgroups  19 997 documents; 20 categories  TC (10 025); TE (9972)  Well balanced data K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008 No. of categories Approx. no. of positive instances 2 14 44 30 up to 1500 100 – 550 10 – 100 <10 No. of categories Approx. no. of positive instances 201000

15/22 Experiment 1 (1/3)  Testing of the meta-learning approach on single data set (Reuters text collection)  Assumes – training set is divided on:  Training set for creation of the meta-model (TM)  Training set for creation of the classifier (TC)  Target:  Increase of effectiveness of the final classifier in comparison with the base classifiers K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

16/22 Experiment 1 (2/3)  Classifier effectiveness – with F1 optimized measure K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

17/22 Experiment 1 (3/3)  Selection of algorithms – over AVERAGE K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

18/22 Experiment 2 (1/3)  Test the usability of the meta-learning approach on two different sets of documents (Reuters & 20 Newsgroups)  Assumes:  Training set of one data collection is used for creation of the meta-model  Training set of other data collection is used for creation of the classifier  Targets:  Full automatically selection of algorithms without re-learning of meta- model (meta-model learned on other data collection is used)  Better effectiveness of classifier K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

19/22 Experiment 2 (2/3)  Classifier effectiveness – with F1 optimized measure K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

20/22 Experiment 2 (3/3)  Selection of algorithms – over AVERAGE K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

21/22 Conclusion  Advantages of meta-learning  Full automated text categorization – selection of algorithms is automatic  Increasing of effectiveness of the final classifier (on one data collection)  Usability of one meta-model for various data collection  Disadvantages of meta-learning  Is needed big computing and time capacity K. Furdik, J. Paralič, G. Tutoky: Meta-learning for automatic selection of algorithms for text classification CECIIS 2008, University of Zagreb, Varaždin, Croatia, September 24-26, 2008

Thank you for your attention

META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,

Similar presentations

Presentation on theme: "META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,

Similar presentations

Presentation on theme: "META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,"— Presentation transcript:

Similar presentations

About project

Feedback