1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen University of Toronto University of Minnesota.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Decision Trees Decision tree representation ID3 learning algorithm
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Three kinds of learning
Part of speech (POS) tagging
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Part 4: Supervised Methods of Word Sense Disambiguation.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
9/8/20151 Natural Language Processing Lecture Notes 1.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen University of Houston-Downtown Wei Ding University of Massachusetts-Boston.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
11 Chapter 20 Computational Lexical Semantics. Supervised Word-Sense Disambiguation (WSD) Methods that learn a classifier from manually sense-tagged text.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Intelligent Database Systems Lab Presenter : Kung, Chien-Hao Authors : Yoong Keok Lee and Hwee Tou Ng 2002,EMNLP An Empirical Evaluation of Knowledge Sources.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Prior Knowledge Driven Domain Adaptation Gourab Kundu, Ming-wei Chang, and Dan Roth Hyphenated compounds are tagged as NN. Example: H-ras Digit letter.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Introduction to Machine Learning Dmitriy Dligach.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Language Identification and Part-of-Speech Tagging
Sample Selection for Statistical Parsing
Using UMLS CUIs for WSD in the Biomedical Domain
LING/C SC 581: Advanced Computational Linguistics
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Automatic Detection of Causal Relations for Question Answering
David Kauchak CS159 – Spring 2019
Presentation transcript:

1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen University of Toronto University of Minnesota http//: http//:

2 Word Sense Disambiguation Harry cast a bewitching spell Humans immediately understand spell to mean a charm or incantation. reading out letter by letter or a period of time ? Words with multiple senses – polysemy, ambiguity! Utilize background knowledge and context. Machines lack background knowledge. A utomatically i dentifying the intended sense of a word in written text, based on its context, remain s a hard problem. Best accuracies in recent international event, around 65%.

3 Why do we need WSD ! Information Retrieval Query: cricket bat Documents pertaining to the insect and the mammal, irrelevant. Machine Translation Consider English to Hindi translation. head to sar (upper part of the body) or adhyaksh (leader)? Machine-hu man interaction Instructions to machines. Interactive home system: turn on the lights Domestic Android: get the door Applications are widespread and will affect our way of life.

4 Terminology Harry cast a bewitching spell Target word – the word whose intended sense is to be identified. spell Context – the sentence housing the target word and possibly, 1 or 2 sentences around it. Harry cast a bewitching spell Instance – target word along with its context. WSD is a classification problem wherein the occurrence of the target word is assigned to one of its many possible senses.

5 Corpus-Based Supervised Machine Learning A computer program is said to learn from experience … if its performance at tasks … improves with experience. - Mitchell Task : Word Sense Disambiguation of given test instances. Performance : Ratio of instances correctly disambiguated to the total test instances – accuracy. Experience : Manually created instances such that target words are marked with intended sense – training instances. Harry cast a bewitching spell / incantation

6 Decision Trees A kind of classifier. Assigns a class by asking a series of questions. Questions correspond to features of the instance. Question asked depends on answer to previous question. Inverted tree structure. Interconnected nodes. Top most node is called the root. Each node corresponds to a question / feature. Each possible value of feature has corresponding branch. Leaves terminate every path from root. Each leaf is associated with a class.

7 WSD Tree Feature 4? Feature 2 ? Feature 3 ? Feature 2 ? SENSE 4 SENSE 3SENSE 2 SENSE 1 SENSE Feature 1 ? SENSE 1

8 Choice of Learning Algorithm Why use decision trees for WSD ? It has drawbacks – training data fragmentation What about other learning algorithms such as neural networks? Context is a rich source of discrete features. The learned model likely meaningful. May provide insight into the interaction of features. Pedersen[2001]*: Choosing the right features is of greater significance than the learning algorithm itself A Decision Tree of Bigrams is an Accurate Predictor of Word Sense A Decision Tree of Bigrams is an Accurate Predictor of Word Sense T. Pedersen, In the Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-01), June 2-7, 2001, Pittsburgh, PA.

9 Lexical Features Surface form A word we observe in text. Case(n) 1. Object of investigation 2. frame or covering 3. A weird person Surface forms : case, cases, casing An occurrence of casing suggests sense 2. Unigrams and Bigrams One word and two word sequences in text. The interest rate is low Unigrams: the, interest, rate, is, low Bigrams: the interest, interest rate, rate is, is low

10 Part of Speech Tagging Brill Tagger – most widely used tool. Accuracy around 95%. Source code available. Easily understood rules. Pre-tagging is the act of manually assigning tags to selected words in a text prior to tagging. Brill tagger does not guaranteed pre-tagging. A patch to the tagger provided – BrillPatch*. * ”Guaranteed Pre-Tagging for the Brill Tagger”, Mohammad, S. and Pedersen, T., In Proceedings of Fourth International Conference of Intelligent Systems and Text Processing, February 2003, Mexico.

11 Part of Speech Features A word used in different senses is likely to have different sets of pos tags around it. Why did jack turn /VB against /IN his /PRP$ team /NN Why did jack turn /VB left /NN at /IN the /DT crossing Features used Individual word POS: P -2, P -1, P 0, P 1, P 2 P 1 = JJ implies that the word to the right of the target word is an adjective. A combination of the above.

12 Parse Features Collins Parser * used to parse the data. Source code available. Uses part of speech tagged data as input. Head word of a phrase. the hard work, the hard surface Phrase itself : noun phrase, verb phrase and so on. Parent : Head word of the parent phrase. fasten the line, cross the line Parent phrase. *

13 Sample Parse Tree VERB PHRASENOUN PHRASE Harry NOUN PHRASE SENTENCE spell cast a bewitching NNP VBD DT JJ NN

14 Sense-Tagged Data Senseval-2 data 4,328 instances of test data and 8,611 instances of training data ranging over 73 different noun, verb and adjectives. Senseval-1 data 8,512 test instances and 13,276 training instances, ranging over 35 nouns, verbs and adjectives. line, hard, interest, serve data 4149, 4337, 4378 and 2476 sense-tagged instances with line, hard, serve and interest as the head words. Around 50,000 sense-tagged instances in all!

15 Experiments

16 Lexical: Senseval-1 & Senseval-2 Sval-2Sval-1linehardserveinterest Majority 47.7%56.3%54.3%81.5%42.2%54.9% Surface Form 49.3%62.9%54.3%81.5%44.2%64.0% Unigram 55.3%66.9%74.5%83.4%73.3%75.7% Bigram 55.1%66.9%72.9%89.5%72.1%79.9%

17 Individual Word POS (Senseval-1) AllNounsVerbsAdj. Majority56.3%57.2%56.9%64.3% P %58.2%58.6%64.0 P %62.2%58.2%64.3% P0P0 60.3%62.5%58.2%64.3% P1P1 63.9%65.4%64.4%66.2% P %60.0%60.8%65.2%

18 Individual Word POS (Senseval-2) AllNounsVerbsAdj. Majority47.7%51.0%39.7%59.0% P %51.9%38.0%57.9% P %55.2%40.2%59.0% P0P0 49.9%55.7%40.6%58.2% P1P1 53.1%53.8%49.1%61.0% P %50.2%43.2%59.4%

19 Combining POS Features Sval-2Sval-1linehardserveinterest Majority47.7%56.3%54.3%81.5%42.2%54.9% P 0, P %66.7%54.1%81.9%60.2%70.5% P -1, P 0, P %68.0%60.4%84.8%73.0%78.8% P -2, P -1, P 0, P 1, P %67.8%62.3%86.2%75.7%80.6%

20 Parse Features (Senseval-1) AllNounsVerbsAdj. Majority56.3%57.2%56.9%64.3% Head64.3%70.9%59.8%66.9% Parent60.6%62.6%60.3%65.8% Phrase58.5%57.5%57.2%66.2% Par. Phr.57.9%58.1%58.3%66.2%

21 Parse Features (Senseval-2) AllNounsVerbsAdj. Majority47.7%51.0%39.7%59.0% Head51.7%58.5%39.8%64.0% Parent50.0%56.1%40.1%59.3% Phrase48.3%51.7%40.3%59.5% Par. Phr.48.5%53.0%39.1%60.3%

22 Thoughts… Both lexical and syntactic features perform comparably. But do they get the same instances right ? How much are the individual feature sets redundant. Are there instances correctly disambiguated by one feature set and not by the other ? How much are the individual feature sets complementary. Is the effort to combine of lexical and syntactic features justified?

23 Measures Baseline Ensemble: accuracy of a hypothetical ensemble which predicts the sense correctly only if both individual feature sets do so. Quantifies redundancy amongst feature sets. Optimal Ensemble : a ccuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so. Difference with individual accuracies quantifies complementarity. We used a simple ensemble which sums up the probabilities for each sense by the individual feature sets to decide the intended sense.

24 Best Combinations DataSet 1Set 2BaseEns.Opt.Best Sval % Unigrams 55.3% P -1,P 0, P % 43.6%57.0%67.9%66.7% Sval % Unigrams 66.9% P -1,P 0, P % 57.6%71.1%78.0%81.1% line 54.3% Unigrams 74.5% P -1,P 0, P % 55.1%74.2%82.0%88.0% hard 81.5% Bigrams 89.5% Head, Par 87.7% 86.1%88.9%91.3%83.0% serve 42.2% Unigrams 73.3% P -1,P 0, P % 58.4%81.6%89.9%83.0% interest 54.9% Bigrams 79.9% P -1,P 0, P % 67.6%83.2%90.1%89.0%

25 Conclusions Significant amount of complementarity across lexical and syntactic features. Combination of the two justified. We show that simple lexical and part of speech features can achieve state of the art results. How best to capitalize on the complementarity still an open issue.

26 Conclusions (continued) Part of speech of word immediately to the right of target word found most useful. Pos of words immediately to the right of target word best for verbs and adjectives. Nouns helped by tags on either side. (P 0, P 1 ) found to be most potent in case of small training data per instance (Sval data). Larger pos context size (P -2, P -1, P 0, P 1, P 2 ) shown to be beneficial when training data per instance is large (line, hard, serve and interest data) Head word of phrase particularly useful for adjectives Nouns helped by both head and parent.

27 Code, Data & Resources SyntaLex : A system to do WSD using lexical and syntactic features. Weka’s decision tree learning algorithm is utilized. posSenseval : part of speech tags any data in Senseval-2 data format. Brill Tagger used. parseSenseval : parses data in a format as output by the Brill Tagger. Output is in Senseval-2 data format with part of speech and parse information as xml tags. Uses Collins Parser. Packages to convert line hard, serve and interest data to Senseval-1 and Senseval-2 data formats. BrillPatch : Patch to Brill Tagger to employ Guaranteed Pre-Tagging.

28 Senseval-3 (Mar-1 to April 15, 2004) Around 8000 training and 4000 test instances. Results expected shortly. Thank You