Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Introduction to Machine Learning Approach Lecture 5.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Mining and Summarizing Customer Reviews
Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
A Language Independent Method for Question Classification COLING 2004.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Supertagging CMSC Natural Language Processing January 31, 2006.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Using Semantic Relations to Improve Information Retrieval
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Language Identification and Part-of-Speech Tagging
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Presentation transcript:

Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University of Rome Tor Vergate ECIR 2004

Abstract Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by NLP techniques were observed ineffective to increase retrieval accuracy. Phrases, word senses and syntactic relations derived by NLP techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC), fewer and less definitive studies on the use of advanced document representations are available. For Text Categorization (TC), fewer and less definitive studies on the use of advanced document representations are available.

Abstract In this paper, extensive experimentations on representative classifiers (Rocchio and SVM) have been carried out to study how some NLP techniques impact TC. In this paper, extensive experimentations on representative classifiers (Rocchio and SVM) have been carried out to study how some NLP techniques impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy.

Introduction Several attempts to design complex and effective features for document retrieval and filtering were carried out. Several attempts to design complex and effective features for document retrieval and filtering were carried out. Document Lemmas Document Lemmas Base form of morphological categories Base form of morphological categories Phrases Phrases simple n-grams, e.g. officials said simple n-grams, e.g. officials said Noun Phrases such as Named Entities Noun Phrases such as Named Entities tuples tuples Word senses Word senses Defined by means of an explanation like in a dictionary entry Defined by means of an explanation like in a dictionary entry Defined by using other words that share the same sense, like in a thesaurus, e.g. WordNet Defined by using other words that share the same sense, like in a thesaurus, e.g. WordNet

Phrases and Document Retrieval Both the goals of using phrases and word senses are increasing the precision of concept matching. Both the goals of using phrases and word senses are increasing the precision of concept matching. In TREC conferences, phrases were experimented and some conclusions were made: In TREC conferences, phrases were experimented and some conclusions were made: 1. The higher computational cost of the employed NLP algorithms prevents their application in operative IR scenario. 2. The experimented NLP representations can increase basic retrieval models (e.g. SMART), but no improvement for advanced statistical retrieval models. [Strzalkowski, 1998]

Word Senses and Document Retrieval In [Smeaton, 1998], NLP resources like WordNet were experimented, instead of NLP techniques. In [Smeaton, 1998], NLP resources like WordNet were experimented, instead of NLP techniques. Positive results were obtained only after the sense were manually validated. Positive results were obtained only after the sense were manually validated. The WSD performance ranging between 60-70% was not adequate to improve document retrieval. The WSD performance ranging between 60-70% was not adequate to improve document retrieval. For text indexing and query expansion, semantic information taken directly from WordNet without performing WSD is not helping IR at all. For text indexing and query expansion, semantic information taken directly from WordNet without performing WSD is not helping IR at all. The high computational cost of the adopted NLP algorithms, the small improvement produced and the lack of accurate WSD tools are the reasons for the failure of NLP in IR. The high computational cost of the adopted NLP algorithms, the small improvement produced and the lack of accurate WSD tools are the reasons for the failure of NLP in IR.

Text Categorization and IR Since TC is a subtask of IR, why should we try to use the same NLP techniques for TC? Since TC is a subtask of IR, why should we try to use the same NLP techniques for TC? In TC both set of positive and negative documents describing categories are available. In TC both set of positive and negative documents describing categories are available. Categories differ from queries as they are static, i.e., a predefined set of training documents stably define the target category. Categories differ from queries as they are static, i.e., a predefined set of training documents stably define the target category. Effective WSD algorithms can be applied to documents whereas this was not the case for queries (especially for short queries). Effective WSD algorithms can be applied to documents whereas this was not the case for queries (especially for short queries). Recent evaluation in SENSEVAL has shown accuracies of 70% for verbs, 75% for adjectives and 80% for nouns. Recent evaluation in SENSEVAL has shown accuracies of 70% for verbs, 75% for adjectives and 80% for nouns. As TC is a relatively new research area, there are fewer studies that employ NLP techniques for it, and several researches report noticeable improvements over the bag-of-words. As TC is a relatively new research area, there are fewer studies that employ NLP techniques for it, and several researches report noticeable improvements over the bag-of-words.

The Goal In this paper, the impact of richer document representations on TC has been deeply investigated on four corpora in two languages by using cross validation analysis. In this paper, the impact of richer document representations on TC has been deeply investigated on four corpora in two languages by using cross validation analysis. Phrase and sense representations have been experimented on three classification systems. Phrase and sense representations have been experimented on three classification systems. Rocchio [J. Rocchio, 1971], an efficient classifier Rocchio [J. Rocchio, 1971], an efficient classifier The Parameterized Rocchio Classifier (PRC) [A. Moschitti, 2003] The Parameterized Rocchio Classifier (PRC) [A. Moschitti, 2003] SVM-light [T. Joachims, 1999], one state-of-the-art TC model SVM-light [T. Joachims, 1999], one state-of-the-art TC model Richer representations can be really useful only if: Richer representations can be really useful only if: a.Accuracy increases with respect to the bag-of-word baseline for the different systems, or b.They improve computationally efficient classifiers so that they approach the accuracy of stat-of-art models.

Natural Language Feature Engineering The linguistic features used to train classifiers: The linguistic features used to train classifiers: POS-tag information POS-tag information The Brill tagger is used (~95% accuracy). The Brill tagger is used (~95% accuracy). Phrases Phrases Proper Nouns: persons, locations, artifacts. Proper Nouns: persons, locations, artifacts. Complex nominals expressing domain concepts, e.g., bond issues or beach wagon. Complex nominals expressing domain concepts, e.g., bond issues or beach wagon. Word senses. Word senses.

Automatic Phrase Extraction For proper nouns For proper nouns The detection is achieved by applying a grammar that takes into account capital letters of nouns, e.g., International Bureau of Law. The detection is achieved by applying a grammar that takes into account capital letters of nouns, e.g., International Bureau of Law. For complex nominal For complex nominal Based on an integration of symbolic and statistical model presented in [R. Basili, 1997]. Based on an integration of symbolic and statistical model presented in [R. Basili, 1997]. Three steps: Three steps: 1.The detection of atomic terms ht, e.g. issue. 2.The identification of admissible candidates, I.e. linguistic structure headed by ht (satisfying linguistically principled grammars). 3.The selection of the final complex nominals via a statistical filter such as the mutual information. The phrases were extracted per category. The phrases were extracted per category.

WSD Algorithms Using WordNet to assign noun (higher accuracy) senses. Using WordNet to assign noun (higher accuracy) senses. Three WSD algorithms Three WSD algorithms Baseline : assign each noun it most frequent sense. Baseline : assign each noun it most frequent sense. An algorithm base on glosses information : An algorithm base on glosses information : Exploits the definition of each synset. Exploits the definition of each synset. e.g. { hit, noun } #1 = (a successful stroke in an athletic contest (especially in baseball); “he came all the way around on Williams‘ hit”) Selects the sense whose local context (definition of synset) best matches the global context (context of target noun). Selects the sense whose local context (definition of synset) best matches the global context (context of target noun). Counting the number of nouns that are in both contexts. Counting the number of nouns that are in both contexts. WSD developed by the LCC (Language Computer Corporation) WSD developed by the LCC (Language Computer Corporation) The one won the SENSEVAL. [A. Kilgarriff, 2000] The one won the SENSEVAL. [A. Kilgarriff, 2000]

Experiments on Linguistic Features The evaluation of phrases and POS information The evaluation of phrases and POS information Using Rocchio, PRC and SVM over Reuters3, Ohsumed and ANSA collections Using Rocchio, PRC and SVM over Reuters3, Ohsumed and ANSA collections The evaluation of semantic information (sense) The evaluation of semantic information (sense) Using SVM on Reuters and 20NewsGroups corpora. Using SVM on Reuters and 20NewsGroups corpora.

Experimental Set-Up Document collections Document collections The Reuters corpus, Apt é split: including 12,902 documents for 90 classes with a fixed split between testing and training (3,299 vs. 9,603). The Reuters corpus, Apt é split: including 12,902 documents for 90 classes with a fixed split between testing and training (3,299 vs. 9,603). The Reuters3 corpus: including 11,099 documents for 93 classes, with a split of 3,309 vs. 7,789 between testing and training. The Reuters3 corpus: including 11,099 documents for 93 classes, with a split of 3,309 vs. 7,789 between testing and training. The ANSA collection: including 16,000 news items in Italian from the ANSA news agency, 8 target categories. The ANSA collection: including 16,000 news items in Italian from the ANSA news agency, 8 target categories. The Ohsumed collection: including 50,216 medical abstracts. Only the first 20,000 documents in 23 MeSH diseases categories are used. The Ohsumed collection: including 50,216 medical abstracts. Only the first 20,000 documents in 23 MeSH diseases categories are used. The 20NewsGroups corpus (20NG): including 19,997 articles for 20 categories taken from the Usenet newsgroups collection, different from Reuters and Ohsumed with it ’ s larger vocabulary. The 20NewsGroups corpus (20NG): including 19,997 articles for 20 categories taken from the Usenet newsgroups collection, different from Reuters and Ohsumed with it ’ s larger vocabulary.

Experimental Set-Up 2 set of tokens are considered as baselines 2 set of tokens are considered as baselines Tokens set which contains a larger number of features and should provide the most general bag-of-words results. Tokens set which contains a larger number of features and should provide the most general bag-of-words results. Linguistic-Tokens (i.e. only the nouns, verbs or adjectives), which are selected using the POS-information. Linguistic-Tokens (i.e. only the nouns, verbs or adjectives), which are selected using the POS-information. +CN indicates the proper nouns and other complex nominals are used as features for the classifiers. +CN indicates the proper nouns and other complex nominals are used as features for the classifiers. +POS indicates features are tokens augmented with their POS tags in context. +POS indicates features are tokens augmented with their POS tags in context. The NLP-derived features are added to the standard token sets, instead of replacing some of them. The NLP-derived features are added to the standard token sets, instead of replacing some of them.

Experimental Set-Up Evaluation (microaverage for global performance) Evaluation (microaverage for global performance) Breakeven Point (BEP): precision = recall Breakeven Point (BEP): precision = recall F 1 measure: 2PR / (P + R) F 1 measure: 2PR / (P + R)

Cross-Corpora/Classifier Validations of Phrases and POS-information Table 2. Breakeven points of PRC over Reuters3 corpus. The linguistic features are added to the Linguistic-Tokens set. Linguistic features improve the result? Linguistic features improve the result? An alternative feature set could perform higher than the bag- of-words in a single experiment. An alternative feature set could perform higher than the bag- of-words in a single experiment. The classifier parameters could be better suited for a particular training/test-set split. The classifier parameters could be better suited for a particular training/test-set split. 20 random generated splits (70%-30%) for cross validation. 20 random generated splits (70%-30%) for cross validation.

Cross-Corpora/Classifier Validations of Phrases and POS-information Table 3. Rocchio, PRC and SVM performances on different feature sets of the Reuters3 corpus

Cross-Corpora/Classifier Validations of Phrases and POS-information Table 4. Rocchio, PRC and SVM performances on different feature sets of the Ohsumed corpus Neonatal is improved by the extended features → should be consider as the normal record of cases. Neonatal is improved by the extended features → should be consider as the normal record of cases.

Cross Validation on Word Senses Compare performance of SVM over Tokens and over Semantic feature sets (= Tokens + disambiguated noun senses) Compare performance of SVM over Tokens and over Semantic feature sets (= Tokens + disambiguated noun senses) An indicative evaluation for WSD algorithms: An indicative evaluation for WSD algorithms: (250 manually disambiguated nouns from Reuters docs) Baseline: 78.43% Baseline: 78.43% Algorithm 1 (gloss-based): 77.12% Algorithm 1 (gloss-based): 77.12% Algorithm 2 (LCC): 80.55% Algorithm 2 (LCC): 80.55%

Cross Validation on Word Senses Table 6. Performance of SVM on the Reuters corpus. Semantic information (WSD) enhance the classifier?

Cross Validation on Word Senses Table 7. SVM μf 1 performances on 20NewsGroups. When the words are richer in term of possible senses, the baseline performs lower than Alg2. When the words are richer in term of possible senses, the baseline performs lower than Alg2. When all the nouns are replaced with their disambiguated senses, lower (from 1 to 3%)performances are obtained than the bag-of-words. When all the nouns are replaced with their disambiguated senses, lower (from 1 to 3%)performances are obtained than the bag-of-words.

Why Do Phrases Not Help? Two possible properties of phrases as explanations. Two possible properties of phrases as explanations. Loss of coverage: Loss of coverage: word information cannot be easily subsumed by the phrase information, e.g. George_Bush → Bush word information cannot be easily subsumed by the phrase information, e.g. George_Bush → Bush Poor effectiveness: Poor effectiveness: The information added by word sequences is poorer than word set. The information added by word sequences is poorer than word set. Two necessary conditions for a phrase to be better than its word set: Two necessary conditions for a phrase to be better than its word set: Words in the sequence should appear not sequentially in some incorrect documents, e.g. George and Bush appear non sequentially in a sport document. Words in the sequence should appear not sequentially in some incorrect documents, e.g. George and Bush appear non sequentially in a sport document. All the correct documents that contain one of the compounding words (e.g. George or Bush) should at the same time contain the whole sequence. All the correct documents that contain one of the compounding words (e.g. George or Bush) should at the same time contain the whole sequence.

Why Do Senses Not Help? The senses of a noun in documents of a category tend to be always the same. The senses of a noun in documents of a category tend to be always the same. Moreover, different categories are characterized by different words rather than different senses. Moreover, different categories are characterized by different words rather than different senses. A general view: textual representations are always very good at capturing the overall semantics of documents, at least as good as linguistically justified representations. A general view: textual representations are always very good at capturing the overall semantics of documents, at least as good as linguistically justified representations. IR methods oriented to textual representations of document semantics should be firstly investigated and they should stress the role of words as vehicles of natural language semantics (as opposed to logic systems of semantic types, like ontologies). IR methods oriented to textual representations of document semantics should be firstly investigated and they should stress the role of words as vehicles of natural language semantics (as opposed to logic systems of semantic types, like ontologies).

Conclusions This paper reports the study of advanced document representation for TC. This paper reports the study of advanced document representation for TC. Several combination of different feature sets have been extensively experimented with three classifiers Rocchio, PRC and SVM over 4 corpora in two languages. Several combination of different feature sets have been extensively experimented with three classifiers Rocchio, PRC and SVM over 4 corpora in two languages. The results have shown that both semantic (word senses) and syntactic information (phrases and POS-tags) cannot achieve the goal of improvement. The results have shown that both semantic (word senses) and syntactic information (phrases and POS-tags) cannot achieve the goal of improvement. The outcome of this careful analysis is not a negative statement on the role of complex linguistic features in TC but suggests that the elementary textual representation based on words is very effective. The outcome of this careful analysis is not a negative statement on the role of complex linguistic features in TC but suggests that the elementary textual representation based on words is very effective.