Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

Similar presentations


Presentation on theme: "UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo."— Presentation transcript:

1 UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo 1,2, I.Sidhu 3 1. Universit à della Calabria (Rende, Italy) 2. Exeura Srl (Rende, Italy) 3. Kenetica Ltd (Chicago, IL-USA) ECML PKDD September 2008, Antwerp, Belgium

2 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Outline – Motivations – The Olex Hypothesis Language – The Genetic Algorithm Approach (Olex-GA) – Experimental Results and Comparative Evaluation – Discussions – Conclusions and Future Work

3 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Motivations Rule learning algorithms have become a successful strategy for classifier induction. Rule-based classifiers provide the desirable property of being readable and, thus, easy to understand (and, possibly, modify). Genetic Algorithms (GAs) are stochastic search methods inspired to the biological evolution. GAs show the capability to provide good solutions for classical optimization tasks (e.g. TSP and Knapsack)

4 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Rule Induction and GAs Rule induction is one of the application fields of GAs. The basic idea is that: – Each individual in the population represents a candidate solution (a classification rule or a classifier) – The fitness of an individual is evaluated in terms of the predictive accuracy. We propose presents a GA approach, called Olex-GA, for the induction of rule-based text classifiers.

5 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA - The hypothesis language A classifier H c (Pos,Neg) is of the form: c category titi term (n-gram) d document Neg H c (Pos,Neg) Pos if any of the terms t 1,…,t n occurs in d and none of the terms t n+1,…,t n+m occurs in d, then classify d under category c

6 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA The hypothesis language The terms in Pos and Neg are chosen among the ones belonging to the local vocabulary: Intuitively, V c (k, f ) is the set of the best k terms for category c according to a given scoring function f.

7 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA Problem statement The Olex-GAs learning problem is stated as an optimization problem: PROBLEM MAX-F Let a category c C and a vocabulary V (k, f) over the training set TS be given. Then, find two subsets of V (k, f), Pos = {t 1,…,t n } and Neg = {t n+1,…,t n+m } with Pos Ø, such that H c (Pos, Neg) applied to TS yields a maximum value of F c, (over TS), for a given [0,1]. Problem MAX-F is NP-Hard.

8 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA A Genetic Algorithm to Solve MAX-F Problem MAX-F is a combinatorial optimization problem aimed at finding a best combination of terms taken from a given vocabulary. MAX-F is a typical problem for which GAs are known to be a good candidate resolution method.

9 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA GA-Olex Our implementation of GA In the following, we describe our choices concerning: – Population Encoding – Fitness Function – Evolutionary Operators

10 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA Population Encoding Each individual represents an entire classifier. An individual is simply a binary representation of the sets Pos and Neg of a classifier H c (Pos, Neg). HcHc t5t5 t4t4 t3t3 t2t2 t1t1 t5t5 t4t4 t3t3 t2t2 t1t1 K Given a vocabulary EXAMPLE

11 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA Population Encoding Each individual represents an entire classifier. An individual is simply a binary representation of the sets Pos and Neg of a classifier H c (Pos, Neg).

12 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA Population Encoding We restrict the search of both positive and negative terms, respectively, to: – Pos*, the set of terms belonging to V c (k, f ) (candidate positive terms); – Neg*, the set of terms which occur in any document containing some candidate positive term and not belonging to the training set TS c of c (candidate negative terms). The reduction of search space allows: – an improvement of the algorithm efficiency – a quick convergence toward good solutions

13 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA Fitness Function The fitness of a chromosome K, representing H c (Pos,Neg) is the value of the F-measure resulting from applying H c (Pos,Neg) to the training set TS. This choice naturally follows from the formulation of problem MAX-F.

14 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA Evolutionary Operators We perform: – selection via the roulette-wheel method, – crossover by the uniform crossover scheme. – mutation, which consists in the flipping of each single bit with a given (low) probability. – elitism, in order to ensure that the best individuals of the current generation are passed to the next one without being altered

15 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Olex-GA Experimentation We have experimentally evaluated our algorithm on two standard benchmark corpora: REUTERS (R10) – It consists of 12,902 documents – They are manually classified with respect to 135 categories. We have considered the subset of the 10 most populated categories. OHSUMED – We used the collection consisting of the first 20,000 documents from the 50,216 medical abstracts of the year – The classification scheme consisted of the 23 MeSH disease categories.

16 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Experimental settings We applied the stratified holdout method: REUTERS: – ModApté split : 9603 documents are used to form the training corpus (seen data) and 3299 to form the test set (unseen data). OHSUMED: – The first 10,000 were used as seen data and the second 10,000 as unseen data. In both cases, we have randomly split the set of seen data into a – training set (70%), on which to run the GA – and a validation set (30%), on which tuning the model parameters.

17 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Experimental settings GA Parameters: For each chromosome K in the population, we initialized K + at random, while we set K ¡ - [t] = 0, for each t Neg* (thus, K initially encodes a classifier H c (Pos,Neg) with no negative terms). ParameterValue Iterations3 Population Size500 Num of Generations200 Cross-over Rate1.0 Mutation Rate0.001 Elitism Probability0.2

18 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Comparative Evaluation On both corpora, we carried out a direct comparison with the following systems: – SVM (both polynomial and radial basis function) – Ripper (with two optimization steps) – C4.5 – Naive Bayes – Olex-Greedy The performances were evaluated using the Weka library of ML algorithms (apart from Olex-Greedy).

19 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Performance Comparison on Reuters Efficacy – SVMpoli > SVMrbf > Ripper Olex-GA > C45 > Olex-Greedy > NB Efficiency – NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper

20 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Performance Comparison on OHSUMED Efficacy – Olex-GA > Ripper > SVMpoli > Olex-Greedy > SVMrbf NB > C45 Efficiency – NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper

21 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Discussions – Relation to other inductive rule learners Conventional Rule Learners (Ripper, C4.5): – Usually rely on a two-stage process: rule induction and rule pruning. – Each of the above step in turn consists of several steps Olex-GA relies on a a single-step process which does not need any post-induction optimization. With respect to Olex-Greedy, Olex-GA provides better predictive accuracy, but is less efficient.

22 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Conclusions Olex-GA encodes a classifier, in a very natural and compact way, as an individual Fitness of an individual is evaluated as the F-measure of the encoded classifiers Experimental results point out: – Olex-GA quickly converges to very accurate classifiers; – Olex-GA performs at a competitive level with standard algorithms; – Time efficiency is lower than Olex-Greedy but higher than the other rule learning methods, such as Ripper and C45.

23 A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA Future work Extension of the proposed technique to deal with classifiers of the form where each T i is a conjunction of simple terms:


Download ppt "UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo."

Similar presentations


Ads by Google