Using IR techniques to improve Automated Text Classification

Name: Using IR techniques to improve Automated Text Classification
Uploaded: 2018-01-14T18:43:58+00:00
Duration: PTM16S42
Description: Using IR techniques to improve Automated Text Classification

Using IR techniques to improve Automated Text Classification
NLDB04 – 9th International Conference on Applications of Natural Language to Information Systems Using IR techniques to improve Automated Text Classification Hello! My name is Teresa Gonçalves, I come from University of Évora, Portugal and I’ll present a paper named Using Information Retrieval techniques to improve Automated Text Classification Teresa Gonçalves, Paulo Quaresma Departamento de Informática Universidade de Évora, Portugal

Using IR techniques to improve ATC
Overview Application area: Text Categorisation Research area: Machine Learning Paradigm: Support Vector Machine Written language: European Portuguese and English Study: Evaluation of Preprocessing Techniques This paper presents a preliminary work on the Automated Text Clssification area using the Support Vector Machines paradigm We evaluate some preprocessing techniques on two sets of documents: one written in the Portuguese language the other, a very well known, in the English one Using IR techniques to improve ATC

Datasets Multilabel classification datasets Each document can be classified into multiple concepts Written language European Portuguese PAGOD dataset English Reuters dataset In both datasets, each document can be classfied into multiple concepts, which turns the problem at hands in a multilabel classification one. The english written dataset is the Reuters and the Portuguese one is the PAGOD dataset. Using IR techniques to improve ATC

PAGOD dataset Represents the decisions of the Portuguese Attorney General’s Office since 1940 Caracteristics 8151 documents, 96 Mbytes of characters 68886 distinct words Averaged document: 1339 words, 306 distinct Taxonomy of 6000 concepts, used around 3000 5 most used concepts: number of documents 909, 680, 497, 410, 409 PAGOD dataset represents the decisions of the Portuguese Attorney General’s Office since 1940. It is a set with around 81 hundred documents, with almost 70 thousand different words. Per document, it has an average of 13 hundred and 39 words, of which 306 are distinct Each document can belong to a set of concepts that are organised in a taxonomy of around 6 thousand of which organised in a hierarchy. Only around an half is used. 909 documents belong to the most used concept – pension for exceptional services, while only 409 documents belong to the 5th most used one – militar Using IR techniques to improve ATC

Reuters dataset Originally collected by the Carnegie group from the Reuters newswire in 1987 Caracteristics 9603 train documents, 3299 test documents (ModApté split) 31715 distinct words Averaged document: 126 words, 70 distinct Taxonomy of 135 concepts, 90 appear in train/test sets 5 most used concepts: number of documents Train set: 2861, 1648, 534, 428, 385 Test set: 1080, 718, 179, 148, 186 On the other hand, for the Reuters dataset we used the ModApté split wich led to 96 hundred and 3 documents for the train set and 32 hundred and 99 documents for the test set. On all dataset, there are almost 32 thousand different words Per document, it has an average of 126 words, of which 70 are distinct Again, each document can belong to a set of 135 concepts from those 90 appear both in train and test sets. The number of documnts that belong to the 5 most used concepts range from 28 hundred to 3 hundred and 85. for the train set, for example. Using IR techniques to improve ATC

Experiments Document representation Bag-of-words Retain words’ frequency Discard words that contain digits Algorithm Linear SVM (WEKA software package) Classes of preprocessing experiments Feature reduction/construction Feature subset selection Term weighting For the experiments, we represented each documnet as a bag of words retaing words’ frequency and diacarding all that contained digits, We used a Linear SVM from the WEKA software package and made 3 classes of preprocessing experiments: some feature reduction/contruction experiments, some feature subset selection and some term weighting Using IR techniques to improve ATC

Feature reduction/construction
Uses linguistic information red1: no use of linguistic information red2: remove a list of non-relevant words (articles, pronouns, adverbs and prepositions) red3: remove red2 word list and transform each word onto its lemma (its stem for the English dataset) Portuguese POLARIS, a Portuguese lexical database English FreeWAIS stop-list Porter algorithm For the feature reduction/construction experiments we used some linguistic information and made 3 experiments: in the first one, we used the original words, in the second, we removed a list of stop words and in the third we also used the Porter stemming algorithm for the English and the POLARIS lexical database to transform each portugues word into its lemma. Using IR techniques to improve ATC

Feature subset selection
Uses a filtering approach Keeps the features (words) that receive higher scores Scoring functions scr1: Term frequency scr2: Mutual information scr3: Gain ratio Threshold value scr1: the number of times each word appears in all documents scr2, scr3: the same number of features as scr1 Experiments sel1, sel50, sel100, sel200, sel400, sel800, sel1200, sel1600 For the feature subset selection we used a filtering approach retaining the words with higher scores. We experimented 3 different scoring functions – the term frequency, the mutual information and the gain ratio (the last two from the information theory field) and tried 8 different threshold values ranging from all words (sel1) until retaining just words that appear, for the term frequency experiment, at least 16 hundred times. Using IR techniques to improve ATC

Number of attributes (with respect to threshold values)
The number of words obtained for each threshold value experiment and for each dataset is showed in these graphs. As you can see most words appear only a few times on the set of all documents Using IR techniques to improve ATC

Term weighting Uses the document, collection and normalisation components wgt1: binary representation with no collection component but normalised to unit lenght wgt2: raw term frequency with no collection nor normalisation component wgt3: term frequency with no collection component but normalised to unit lenght wgt4: term frequency divided by the collection component and normalised to unit lenght For the term weighting we tried 4 experiments: wgt1 is the binary representation wgt2 is the raw term frequency wgt3 is the term frequency normalised to unit lenght, and wgt4 is the popular tfidf measure Using IR techniques to improve ATC

Experimental results Method PAGOD: 10-fold cross-validation Reuters: train and test set (ModApté split) Measures Precision, recall and F1 Micro- and macro-averaging for the top 5 concepts Significance tests with 95% of confidence For the experimental results we used a 10 fold cross validation for the Portuguese dataset and the train/test set for the English one We measured the precision, recall and F1 measures for each classifier and calculated the micro and macro average for the 5 top concepts. All significance tests wer made with 95% of confidence. Using IR techniques to improve ATC

PAGOD dataset These graphs show, for the PAGOD dataset the micro and macro-average F1 measure averaged for all thresold values of each experiment: feature reduction/construction: red1, red2 and red3 feature subset selection scoring function: scr1, scr2, scr3 term weighting: wgt1, wgt2, wgt3, wgt4 As you can see the worst values were obtained for raw term frequencies (visible on both graphs) InfoGain (more easily depicted on the macro-average graph) and lemmatisation experiments The best values were obtained for the experiment that remove the words that belong to the stop list, scores them with respect to its term frequency retaining all words (no threshold value) and weights them according to the term frequency normalised to unit lenght. Using IR techniques to improve ATC

Reuters dataset These graphs show, the values obtained in the same way, for the Reuters dataset In this dataset, the worst values were also obtained for raw term frequencies and InfoGain (visible on both graphs) Feature construction/reduction experiments show similar averaged results. The best values were obtained for the experiment that Mutual Information experiment for the thresold value equal to 4 hundred, with no significant difference between stop and stemming experiments and the tfidf and the term frequency normalised to unit lenght ones. Using IR techniques to improve ATC

Results PAGOD The best combination scr1 – red2 – wgt3 – sel1 Worst values scr3 and wgt2 experiments Reuters scr2 –(red1,red3) – (wgt3,wgt4) – sel400 scr3, and wgt2 experiments Using IR techniques to improve ATC

Discussion SVM Deals well with non informative and non independent features in different languages Results Worse values for PAGOD written language? more difficult concepts to learn? more imbalanced dataset? Best experiments Different for both datasets written language? area of written documents? Concluding, we can say that Support vector machines deal well with non informative and non independent features in different languages The best experiments are different for both datasets. Here we can pose to hypothesis: it is because of the written language or of the area of the written documents (legal vs. news documents) The results are worse for the Portuguese dataset. Once again different reasons come to mind: more difficult concepts to learn more imbalanced dataset written language Using IR techniques to improve ATC

Future work Explore the impact of the imbalance nature of datasets the use of morpho-syntactical information other datasets Try more powerful document representations And now some future work: study the impact of the imbalance nature of these datasets use some morpho-syntatical information on the feature construction/reduction experiments (use just the verbs or the nouns as features) explore other datasets to infer the reason of the different best experiments, and finally, try more powerful document representations than the bag-of-words, using, for example, word order and syntactical and/or semantical information of documents Using IR techniques to improve ATC

Scoring functions scr1: Term frequency The score is the number of times the feature appears in the dataset scr2: Mutual information It evaluates the worth of a feature A by measuring its mutual information, I(C;A) , with respect to the class, C scr3: Gain ratio The worth is the attribute’s gain ratio with respect to the class Using IR techniques to improve ATC

Using IR techniques to improve Automated Text Classification

Similar presentations

Presentation on theme: "Using IR techniques to improve Automated Text Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using IR techniques to improve Automated Text Classification

Similar presentations

Presentation on theme: "Using IR techniques to improve Automated Text Classification"— Presentation transcript:

Similar presentations

About project

Feedback