1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Decision Trees Decision tree representation ID3 learning algorithm
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Ensemble Learning: An Introduction
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen University of Toronto University of Minnesota.
1 Noun compounds (NCs) Any sequence of nouns that itself functions as a noun asthma hospitalizations asthma hospitalization rates health care personnel.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.
Three kinds of learning
Part of speech (POS) tagging
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Part 4: Supervised Methods of Word Sense Disambiguation.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
Benk Erika Kelemen Zsolt
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Page 1 SenDiS Sectoral Operational Programme "Increase of Economic Competitiveness" "Investments for your future" Project co-financed by the European Regional.
CLASSIFICATION: Ensemble Methods
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Intelligent Database Systems Lab Presenter : Kung, Chien-Hao Authors : Yoong Keok Lee and Hwee Tou Ng 2002,EMNLP An Empirical Evaluation of Knowledge Sources.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Using Semantic Relatedness for Word Sense Disambiguation
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Part-of-speech tagging
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Language Identification and Part-of-Speech Tagging
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Using UMLS CUIs for WSD in the Biomedical Domain
Introduction Task: extracting relational facts from text
Ensemble learning.
Statistical NLP : Lecture 9 Word Sense Disambiguation
THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU
Presentation transcript:

1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto University of Minnesota, Duluth

2 Supervised Learning: Decision Trees Decision Trees used as classifiers. One classifier for each task (word). Assign sense by asking a series of questions. Questions correspond to features of the instance. Weka’s C4.5 algorithm employed.

3 Decision Tree Feature 4? Feature 2 ? Feature 3 ? Feature 2 ? SENSE 4 SENSE 3SENSE 2 SENSE 1 SENSE Feature 1 ? SENSE 1

4 Why Decision Trees? Context is a rich source of discrete features. The learned model likely meaningful. May provide insight into the interaction of features. Drawbacks – training data fragmentation. Pedersen[2001]*: Choosing the right features is of greater significance than the learning algorithm itself. * “ A Decision Tree of Bigrams is an Accurate Predictor of Word Sense” Pedersen, T., In the Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-01), June 2-7, 2001, Pittsburgh, PA.

5 Features Represent context with lexical and simple syntactic features. Lexical Features Bigrams of words Syntactic Features Parts of Speech

6 Lexical Features: Bigrams Bigrams Two-word sequences in text. The interest rate is low Bigrams: the interest, interest rate, rate is, is low Ngram Statistics Package*: Identify statistically significant bigrams. * “The Design, Implementation and Use of the Ngram Statistics Package”, Banerjee, S. and Pedersen, T., Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, 2003, February, Mexico City.

7 Part-of-Speech Features A word used in different senses is likely to have different sets of PoS tags around it. Why did Jack turn /VB against /IN his /PRP$ team /NN Why did Jack turn /VB left /NN at /IN the /DT crossing Individual word PoS: P -2, P -1, P 0, P 1, P 2. P 1 = JJ implies that the word to the right of the target word is an adjective.

8 Part-of-Speech Tagging Brill Tagger – widely used tool. Certain target words mis-tagged to completely different parts of speech. Pre-tagging is the act of manually assigning tags to selected words in a text prior to tagging. Brill tagger does not guarantee pre-tagging. BrillPatch* guarantees pre-tagging. * “Guaranteed Pre-Tagging for the Brill Tagger”, Mohammad, S. and Pedersen, T., In Proceedings of Fourth International Conference of Intelligent Systems and Text Processing, February 2003, Mexico.

9 SyntaLex Entries SyntaLex-1 Parts of Speech of a narrow window of words around the target word. P -1 = or P 0 = or P 1 =. SyntaLex-2 Parts of Speech: broader window P -2 = or P -1 = or P 0 = or P 1 = or P 2 =. represents a Part of Speech.

10 SyntaLex Entries (continued) SyntaLex-3 Individual classifiers based on bigrams and parts of speech (narrow context) learned. Given a test instance, both classifiers assign probabilities to every sense. Sense with the highest sum chosen. SyntaLex-4 Single unified decision tree. Either bigrams or part of speech features (narrow context) may exist at a particular node.

11 Fine-Grained Test Results OverallNounsVerbsAdjectives Majority 56.5%55.0%58.0%54.1% SyntaLex-1 PoS (narrow) 62.4%58.7%67.0%48.0% SyntaLex-2 PoS (broad) 61.8%57.7%66.5%55.0% SyntaLex-3 Ens. PoS, Bigram. 64.6%62.5%67.6%51.6% SyntaLex-4 Com. PoS, Bigram. 63.3%62.2%65.3%49.1%

12 Fine-Grained Test Results OverallNounsVerbsAdjectives Majority 56.5%55.0%58.0%54.1% SyntaLex-1 PoS (narrow) 62.4%58.7%67.0%48.0% SyntaLex-2 PoS (broad) 61.8%57.7%66.5%55.0% SyntaLex-3 Ens. PoS, Bigram. 64.6%62.5%67.6%51.6% SyntaLex-4 Com. PoS, Bigram. 63.3%62.2%65.3%49.1%

13 Coarse-Grained Test Results OverallNounsVerbsAdjectives SyntaLex-1 PoS (narrow) 69.1%65.1%73.3%61.7% SyntaLex-2 PoS (broad) 68.4%64.1%73.1%60.1% SyntaLex-3 Ens. PoS, Bigram. 72.0%69.6%74.9%64.2% SyntaLex-4 Com. PoS, Bigram. 71.1%69.5%73.4%62.0%

14 Observations SyntaLex-1 (narrow context) just as good as SyntaLex-2 (broader context). Low training data per task ratio. Weak indicators (P -2 and P 2 ) overwhelmed. Results* on Senseval-1, Senseval-2, line, hard, serve and interest data back the hypothesis. * "Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation", Mohammad, S. and Pedersen, T., In Proceedings of the Eighth Conference on Natural Language Learning at HLT-NAACL, May 2004, Boston.

15 Observations (continued) Pedersen’s entry (Duluth-ELSS*) Only lexical features Fine grained accuracy: 61.7%. Comparable to SyntaLex-2 (61.8%). Use of both kinds of features (SyntaLex-3 & 4) has provided some improvement. * "The Duluth Lexical Sample Systems in Senseval-3", Pedersen, T., In Proceedings of the Eighth Conference on Natural Language Learning at HLT-NAACL, May 2004, Boston.

16 Thoughts… Lexical and syntactic features perform comparably. Do they get the same instances right ? How much are the individual feature sets redundant. Are there instances correctly disambiguated by one feature set and not by the other ? How much are the individual feature sets complementary. Can we do better than the results obtained by a simple ensemble? Is the effort to combine lexical and syntactic features justified?

17 Measures Baseline Ensemble: accuracy of a hypothetical ensemble which predicts the sense correctly only if both individual feature sets do so. Quantifies redundancy amongst feature sets. SENSEVAL-3, with SyntaLex: 52.9% (fine grained). Optimal Ensemble: accuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so. Difference with individual accuracies quantifies complementarity. SENSEVAL-3, with SyntaLex: 72.1% (fine grained).

18 Conclusions Significant amount of complementarity across the studied lexical and syntactic features. Combination of the two justified. We show that bigram and part of speech features can achieve state-of-the-art results. How best to capitalize on the complementarity still an open issue. Part of speech of words immediately adjacent to target word more suitable than that of larger contexts.

19 Software SyntaLex A system to do WSD using lexical and syntactic features. Weka’s decision tree learning algorithm is utilized. posSenseval Assigns Part-of-Speech to any data in Senseval-2 data format. Outputs tagged data in Senseval-2 data format. Employs the Brill Tagger. BrillPatch Patch to the Brill Tagger to guarantee pre-tagging. Thank You.