Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.

Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University of Rome Tor Vergate ECIR 2004

Abstract Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by NLP techniques were observed ineffective to increase retrieval accuracy. Phrases, word senses and syntactic relations derived by NLP techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC), fewer and less definitive studies on the use of advanced document representations are available. For Text Categorization (TC), fewer and less definitive studies on the use of advanced document representations are available.

Abstract In this paper, extensive experimentations on representative classifiers (Rocchio and SVM) have been carried out to study how some NLP techniques impact TC. In this paper, extensive experimentations on representative classifiers (Rocchio and SVM) have been carried out to study how some NLP techniques impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy.

Introduction Several attempts to design complex and effective features for document retrieval and filtering were carried out. Several attempts to design complex and effective features for document retrieval and filtering were carried out. Document Lemmas Document Lemmas Base form of morphological categories Base form of morphological categories Phrases Phrases simple n-grams, e.g. officials said simple n-grams, e.g. officials said Noun Phrases such as Named Entities Noun Phrases such as Named Entities tuples tuples Word senses Word senses Defined by means of an explanation like in a dictionary entry Defined by means of an explanation like in a dictionary entry Defined by using other words that share the same sense, like in a thesaurus, e.g. WordNet Defined by using other words that share the same sense, like in a thesaurus, e.g. WordNet

Phrases and Document Retrieval Both the goals of using phrases and word senses are increasing the precision of concept matching. Both the goals of using phrases and word senses are increasing the precision of concept matching. In TREC conferences, phrases were experimented and some conclusions were made: In TREC conferences, phrases were experimented and some conclusions were made: 1. The higher computational cost of the employed NLP algorithms prevents their application in operative IR scenario. 2. The experimented NLP representations can increase basic retrieval models (e.g. SMART), but no improvement for advanced statistical retrieval models. [Strzalkowski, 1998]

Word Senses and Document Retrieval In [Smeaton, 1998], NLP resources like WordNet were experimented, instead of NLP techniques. In [Smeaton, 1998], NLP resources like WordNet were experimented, instead of NLP techniques. Positive results were obtained only after the sense were manually validated. Positive results were obtained only after the sense were manually validated. The WSD performance ranging between 60-70% was not adequate to improve document retrieval. The WSD performance ranging between 60-70% was not adequate to improve document retrieval. For text indexing and query expansion, semantic information taken directly from WordNet without performing WSD is not helping IR at all. For text indexing and query expansion, semantic information taken directly from WordNet without performing WSD is not helping IR at all. The high computational cost of the adopted NLP algorithms, the small improvement produced and the lack of accurate WSD tools are the reasons for the failure of NLP in IR. The high computational cost of the adopted NLP algorithms, the small improvement produced and the lack of accurate WSD tools are the reasons for the failure of NLP in IR.

Text Categorization and IR Since TC is a subtask of IR, why should we try to use the same NLP techniques for TC? Since TC is a subtask of IR, why should we try to use the same NLP techniques for TC? In TC both set of positive and negative documents describing categories are available. In TC both set of positive and negative documents describing categories are available. Categories differ from queries as they are static, i.e., a predefined set of training documents stably define the target category. Categories differ from queries as they are static, i.e., a predefined set of training documents stably define the target category. Effective WSD algorithms can be applied to documents whereas this was not the case for queries (especially for short queries). Effective WSD algorithms can be applied to documents whereas this was not the case for queries (especially for short queries). Recent evaluation in SENSEVAL has shown accuracies of 70% for verbs, 75% for adjectives and 80% for nouns. Recent evaluation in SENSEVAL has shown accuracies of 70% for verbs, 75% for adjectives and 80% for nouns. As TC is a relatively new research area, there are fewer studies that employ NLP techniques for it, and several researches report noticeable improvements over the bag-of-words. As TC is a relatively new research area, there are fewer studies that employ NLP techniques for it, and several researches report noticeable improvements over the bag-of-words.

The Goal In this paper, the impact of richer document representations on TC has been deeply investigated on four corpora in two languages by using cross validation analysis. In this paper, the impact of richer document representations on TC has been deeply investigated on four corpora in two languages by using cross validation analysis. Phrase and sense representations have been experimented on three classification systems. Phrase and sense representations have been experimented on three classification systems. Rocchio [J. Rocchio, 1971], an efficient classifier Rocchio [J. Rocchio, 1971], an efficient classifier The Parameterized Rocchio Classifier (PRC) [A. Moschitti, 2003] The Parameterized Rocchio Classifier (PRC) [A. Moschitti, 2003] SVM-light [T. Joachims, 1999], one state-of-the-art TC model SVM-light [T. Joachims, 1999], one state-of-the-art TC model Richer representations can be really useful only if: Richer representations can be really useful only if: a.Accuracy increases with respect to the bag-of-word baseline for the different systems, or b.They improve computationally efficient classifiers so that they approach the accuracy of stat-of-art models.

Natural Language Feature Engineering The linguistic features used to train classifiers: The linguistic features used to train classifiers: POS-tag information POS-tag information The Brill tagger is used (~95% accuracy). The Brill tagger is used (~95% accuracy). Phrases Phrases Proper Nouns: persons, locations, artifacts. Proper Nouns: persons, locations, artifacts. Complex nominals expressing domain concepts, e.g., bond issues or beach wagon. Complex nominals expressing domain concepts, e.g., bond issues or beach wagon. Word senses. Word senses.

Automatic Phrase Extraction For proper nouns For proper nouns The detection is achieved by applying a grammar that takes into account capital letters of nouns, e.g., International Bureau of Law. The detection is achieved by applying a grammar that takes into account capital letters of nouns, e.g., International Bureau of Law. For complex nominal For complex nominal Based on an integration of symbolic and statistical model presented in [R. Basili, 1997]. Based on an integration of symbolic and statistical model presented in [R. Basili, 1997]. Three steps: Three steps: 1.The detection of atomic terms ht, e.g. issue. 2.The identification of admissible candidates, I.e. linguistic structure headed by ht (satisfying linguistically principled grammars). 3.The selection of the final complex nominals via a statistical filter such as the mutual information. The phrases were extracted per category. The phrases were extracted per category.

WSD Algorithms Using WordNet to assign noun (higher accuracy) senses. Using WordNet to assign noun (higher accuracy) senses. Three WSD algorithms Three WSD algorithms Baseline : assign each noun it most frequent sense. Baseline : assign each noun it most frequent sense. An algorithm base on glosses information : An algorithm base on glosses information : Exploits the definition of each synset. Exploits the definition of each synset. e.g. { hit, noun } #1 = (a successful stroke in an athletic contest (especially in baseball); “he came all the way around on Williams‘ hit”) Selects the sense whose local context (definition of synset) best matches the global context (context of target noun). Selects the sense whose local context (definition of synset) best matches the global context (context of target noun). Counting the number of nouns that are in both contexts. Counting the number of nouns that are in both contexts. WSD developed by the LCC (Language Computer Corporation) WSD developed by the LCC (Language Computer Corporation) The one won the SENSEVAL. [A. Kilgarriff, 2000] The one won the SENSEVAL. [A. Kilgarriff, 2000]

Experiments on Linguistic Features The evaluation of phrases and POS information The evaluation of phrases and POS information Using Rocchio, PRC and SVM over Reuters3, Ohsumed and ANSA collections Using Rocchio, PRC and SVM over Reuters3, Ohsumed and ANSA collections The evaluation of semantic information (sense) The evaluation of semantic information (sense) Using SVM on Reuters-21578 and 20NewsGroups corpora. Using SVM on Reuters-21578 and 20NewsGroups corpora.

Experimental Set-Up Document collections Document collections The Reuters-21578 corpus, Apt é split: including 12,902 documents for 90 classes with a fixed split between testing and training (3,299 vs. 9,603). The Reuters-21578 corpus, Apt é split: including 12,902 documents for 90 classes with a fixed split between testing and training (3,299 vs. 9,603). The Reuters3 corpus: including 11,099 documents for 93 classes, with a split of 3,309 vs. 7,789 between testing and training. The Reuters3 corpus: including 11,099 documents for 93 classes, with a split of 3,309 vs. 7,789 between testing and training. The ANSA collection: including 16,000 news items in Italian from the ANSA news agency, 8 target categories. The ANSA collection: including 16,000 news items in Italian from the ANSA news agency, 8 target categories. The Ohsumed collection: including 50,216 medical abstracts. Only the first 20,000 documents in 23 MeSH diseases categories are used. The Ohsumed collection: including 50,216 medical abstracts. Only the first 20,000 documents in 23 MeSH diseases categories are used. The 20NewsGroups corpus (20NG): including 19,997 articles for 20 categories taken from the Usenet newsgroups collection, different from Reuters and Ohsumed with it ’ s larger vocabulary. The 20NewsGroups corpus (20NG): including 19,997 articles for 20 categories taken from the Usenet newsgroups collection, different from Reuters and Ohsumed with it ’ s larger vocabulary.

Experimental Set-Up 2 set of tokens are considered as baselines 2 set of tokens are considered as baselines Tokens set which contains a larger number of features and should provide the most general bag-of-words results. Tokens set which contains a larger number of features and should provide the most general bag-of-words results. Linguistic-Tokens (i.e. only the nouns, verbs or adjectives), which are selected using the POS-information. Linguistic-Tokens (i.e. only the nouns, verbs or adjectives), which are selected using the POS-information. +CN indicates the proper nouns and other complex nominals are used as features for the classifiers. +CN indicates the proper nouns and other complex nominals are used as features for the classifiers. +POS indicates features are tokens augmented with their POS tags in context. +POS indicates features are tokens augmented with their POS tags in context. The NLP-derived features are added to the standard token sets, instead of replacing some of them. The NLP-derived features are added to the standard token sets, instead of replacing some of them.

Experimental Set-Up Evaluation (microaverage for global performance) Evaluation (microaverage for global performance) Breakeven Point (BEP): precision ＝ recall Breakeven Point (BEP): precision ＝ recall F 1 measure: 2PR ／ (P ＋ R) F 1 measure: 2PR ／ (P ＋ R)

Cross-Corpora/Classifier Validations of Phrases and POS-information Table 2. Breakeven points of PRC over Reuters3 corpus. The linguistic features are added to the Linguistic-Tokens set. Linguistic features improve the result? Linguistic features improve the result? An alternative feature set could perform higher than the bag- of-words in a single experiment. An alternative feature set could perform higher than the bag- of-words in a single experiment. The classifier parameters could be better suited for a particular training/test-set split. The classifier parameters could be better suited for a particular training/test-set split. 20 random generated splits (70%-30%) for cross validation. 20 random generated splits (70%-30%) for cross validation.

Cross-Corpora/Classifier Validations of Phrases and POS-information Table 3. Rocchio, PRC and SVM performances on different feature sets of the Reuters3 corpus

Cross-Corpora/Classifier Validations of Phrases and POS-information Table 4. Rocchio, PRC and SVM performances on different feature sets of the Ohsumed corpus Neonatal is improved by the extended features → should be consider as the normal record of cases. Neonatal is improved by the extended features → should be consider as the normal record of cases.

Cross Validation on Word Senses Compare performance of SVM over Tokens and over Semantic feature sets (= Tokens + disambiguated noun senses) Compare performance of SVM over Tokens and over Semantic feature sets (= Tokens + disambiguated noun senses) An indicative evaluation for WSD algorithms: An indicative evaluation for WSD algorithms: (250 manually disambiguated nouns from Reuters-21578 docs) Baseline: 78.43% Baseline: 78.43% Algorithm 1 (gloss-based): 77.12% Algorithm 1 (gloss-based): 77.12% Algorithm 2 (LCC): 80.55% Algorithm 2 (LCC): 80.55%

Cross Validation on Word Senses Table 6. Performance of SVM on the Reuters-21578 corpus. Semantic information (WSD) enhance the classifier?

Cross Validation on Word Senses Table 7. SVM μf 1 performances on 20NewsGroups. When the words are richer in term of possible senses, the baseline performs lower than Alg2. When the words are richer in term of possible senses, the baseline performs lower than Alg2. When all the nouns are replaced with their disambiguated senses, lower (from 1 to 3%)performances are obtained than the bag-of-words. When all the nouns are replaced with their disambiguated senses, lower (from 1 to 3%)performances are obtained than the bag-of-words.

Why Do Phrases Not Help? Two possible properties of phrases as explanations. Two possible properties of phrases as explanations. Loss of coverage: Loss of coverage: word information cannot be easily subsumed by the phrase information, e.g. George_Bush → Bush word information cannot be easily subsumed by the phrase information, e.g. George_Bush → Bush Poor effectiveness: Poor effectiveness: The information added by word sequences is poorer than word set. The information added by word sequences is poorer than word set. Two necessary conditions for a phrase to be better than its word set: Two necessary conditions for a phrase to be better than its word set: Words in the sequence should appear not sequentially in some incorrect documents, e.g. George and Bush appear non sequentially in a sport document. Words in the sequence should appear not sequentially in some incorrect documents, e.g. George and Bush appear non sequentially in a sport document. All the correct documents that contain one of the compounding words (e.g. George or Bush) should at the same time contain the whole sequence. All the correct documents that contain one of the compounding words (e.g. George or Bush) should at the same time contain the whole sequence.

Why Do Senses Not Help? The senses of a noun in documents of a category tend to be always the same. The senses of a noun in documents of a category tend to be always the same. Moreover, different categories are characterized by different words rather than different senses. Moreover, different categories are characterized by different words rather than different senses. A general view: textual representations are always very good at capturing the overall semantics of documents, at least as good as linguistically justified representations. A general view: textual representations are always very good at capturing the overall semantics of documents, at least as good as linguistically justified representations. IR methods oriented to textual representations of document semantics should be firstly investigated and they should stress the role of words as vehicles of natural language semantics (as opposed to logic systems of semantic types, like ontologies). IR methods oriented to textual representations of document semantics should be firstly investigated and they should stress the role of words as vehicles of natural language semantics (as opposed to logic systems of semantic types, like ontologies).

Conclusions This paper reports the study of advanced document representation for TC. This paper reports the study of advanced document representation for TC. Several combination of different feature sets have been extensively experimented with three classifiers Rocchio, PRC and SVM over 4 corpora in two languages. Several combination of different feature sets have been extensively experimented with three classifiers Rocchio, PRC and SVM over 4 corpora in two languages. The results have shown that both semantic (word senses) and syntactic information (phrases and POS-tags) cannot achieve the goal of improvement. The results have shown that both semantic (word senses) and syntactic information (phrases and POS-tags) cannot achieve the goal of improvement. The outcome of this careful analysis is not a negative statement on the role of complex linguistic features in TC but suggests that the elementary textual representation based on words is very effective. The outcome of this careful analysis is not a negative statement on the role of complex linguistic features in TC but suggests that the elementary textual representation based on words is very effective.

Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.

Similar presentations

Presentation on theme: "Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.

Similar presentations

Presentation on theme: "Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University."— Presentation transcript:

Similar presentations

About project

Feedback