1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen University of Toronto University of Minnesota.
Three kinds of learning
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
© 2005 by Prentice Hall Chapter 9 Structuring System Requirements: Logic Modeling Modern Systems Analysis and Design Fourth Edition.
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Chapter 9 Structuring System Requirements: Logic Modeling
Part-of-Speech Tagging
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Some Advances in Transformation-Based Part of Speech Tagging
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
11 Chapter 20 Computational Lexical Semantics. Supervised Word-Sense Disambiguation (WSD) Methods that learn a classifier from manually sense-tagged text.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,
© 2005 by Prentice Hall Chapter 9 Structuring System Requirements: Logic Modeling Modern Systems Analysis and Design Fourth Edition Jeffrey A. Hoffer Joey.
GRAMMARS & PARSING. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program, referred to as a.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
Introduction to Machine Learning and Text Mining
Using UMLS CUIs for WSD in the Biomedical Domain
Chapter 9 Structuring System Requirements: Logic Modeling
Automatic Detection of Causal Relations for Question Answering
CS246: Information Retrieval
Chapter 9 Structuring System Requirements: Logic Modeling
David Kauchak CS159 – Spring 2019
Presentation transcript:

1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen University of Minnesota, Duluth Date: August 1, 2003

2 Path Map Introduction Background Data Experiments Conclusions

3 Word Sense Disambiguation Harry cast a bewitching spell Humans immediately understand spell to mean a charm or incantation reading out letter by letter or a period of time ? Words with multiple senses – polysemy, ambiguity Utilize background knowledge and context Machines lack background knowledge A utomatically i dentifying the intended sense of a word in written text, based on its context, remain s a hard problem Features are identified from the context Best accuracies in latest international event, around 65%

4 Why do we need WSD ! Information Retrieval Query: cricket bat Documents pertaining to the insect and the mammal, irrelevant Machine Translation Consider English to Hindi translation head to sar (upper part of the body) or adhyaksh (leader) Machine Hu man interaction Instructions to machines Interactive home system: turn on the lights Domestic Android: get the door Applications are widespread and will affect our way of life

5 Terminology Harry cast a bewitching spell Target word – the word whose intended sense is to be identified spell Context – the sentence housing the target word and possibly, 1 or 2 sentences around it Harry cast a bewitching spell Instance – target word along with its context WSD is a classification problem wherein the occurrence of the target word is assigned to one of its many possible senses

6 Corpus-Based Supervised Machine Learning A computer program is said to learn from experience … if its performance at tasks … improves with experience - Mitchell Task : Word Sense Disambiguation of given test instances Performance : Ratio of instances correctly disambiguated to the total test instances - accuracy Experience : Manually created instances such that target words are marked with intended sense – training instances Harry cast a bewitching spell / incantation

7 Path Map Introduction Background Data Experiments Conclusions

8 Decision Trees A kind of classifier Assigns a class by asking a series of questions Questions correspond to features of the instance Question asked depends on answer to previous question Inverted tree structure Interconnected nodes Top most node is called the root Each node corresponds to a question / feature Each possible value of feature has corresponding branch Leaves terminate every path from root Each leaf is associated with a class

9 Automating Toy Selection for Max Moving Parts ? Color ? Size ? Car ? Size ? Car ? LOVE SO LOVEHATE SO HATE No Yes Blue Big Red Small Other SmallBig ROOT NODES LEAVES

10 WSD Tree Feature 4? Feature 2 ? Feature 3 ? Feature 2 ? SENSE 4 SENSE 3SENSE 2 SENSE 1 SENSE Feature 1 ? SENSE 1

11 Issues… Why use decision trees for WSD ? How are decision trees learnt ? ID3 and C4.5algorithms What is bagging and its advantages Drawbacks of decision trees bagging Pedersen[2002]: Choosing the right features is of greater significance than the learning algorithm itself

12 Lexical Features Surface form A word we observe in text Case(n) 1. Object of investigation 2. frame or covering 3. A weird person Surface forms : case, cases, casing An occurrence of casing suggests sense 2 Unigrams and Bigrams One word and two word sequences in text The interest rate is low Unigrams: the, interest, rate, is, low Bigrams: the interest, interest rate, rate is, is low

13 Part of Speech Tagging Pre-requisite for many Natural Language Tasks Parsing, WSD, Anaphora resolution Brill Tagger – most widely used tool Accuracy around 95% Source code available Easily understood rules Harry /NNP cast /VBD a /DT bewitching /JJ spell / NN NNP proper noun, VBD verb past, DT determiner, NN noun

14 Pre-Tagging Pre-tagging is the act of manually assigning tags to selected words in a text prior to tagging Mona will sit in the pretty chair //NN this time chair is the pre-tagged word, NN is its pre-tag Reliable anchors or seeds around which tagging is done Brill Tagger facilitates pre-tagging Pre-tag not always respected ! Mona /NNP will /MD sit /VB in /IN the /DT pretty /RB chair //VB this /DT time /NN

15 Contextual Rules Initial state tagger – assigns most frequent tag for a type based on entries in a Lexicon (pre-tag respected) Final state tagger – may modify tag of word based on context (pre-tag not given special treatment) Relevant Lexicon Entries Type Most frequent tagOther possible tags chairNN (noun) VB (verb) prettyRB ( adverb ) JJ (adjective ) Relevant Contextual Rules Current TagNew TagWhen NNVBNEXTTAG DT RBJJNEXTTAG NN

16 Guaranteed Pre-Tagging A patch to the tagger provided – BrillPatch Application of contextual rules to the pre-tagged words bypassed Application of contextual rules to non pre-tagged words unchanged. Mona /NNP will /MD sit /VB in /IN the /DT pretty /JJ chair //NN this /DT time /NN Tag of chair retained as NN Contextual rule to change tag of chair from NN to VB not applied Tag of pretty transformed Contextual rule to change tag of pretty from RB to JJ applied

17 Part of Speech Features A word in different parts of speech has different senses A word used in different senses is likely to have different sets of pos around it Why did jack turn /VB against /IN his /PRP$ team /NN Why did jack turn /VB left /VBN at /IN the /DT crossing Features used Individual word POS: P -2, P -1, P 0, P 1, P 2 * P 2 = JJ implies P 2 is an adjective Sequential POS: P -1 P 0, P -1 P 0 P 1, and so on P -1 P 0 = NN, VB implies P -1 is a noun and P 0 is a verb A combination of the above

18 Parse Features Collins Parser used to parse the data Source code available Uses part of speech tagged data as input Head word of a phrase the hard work, the hard surface Phrase itself : noun phrase, verb phrase and so on Parent : Head word of the parent phrase fasten the line, cross the line Parent Phrase

19 Sample Parse Tree VERB PHRASENOUN PHRASE Harry NOUN PHRASE SENTENCE spell cast a bewitching NNP VBD DT JJ NN

20 Path Map Introduction Background Data Experiments Conclusions

21 Sense-Tagged Data Senseval2 data 4328 instances of test data and 8611 instances of training data ranging over 73 different noun, verb and adjectives. Senseval1 data 8512 test instances and 13,276 training instances, ranging over 35 nouns, verbs and adjectives. Line, hard, interest, serve data 4,149, 4,337, 4378 and 2476 sense-tagged instances with line, hard, serve and interest as the head words. Around 50,000 sense-tagged instances in all !

22 Data Processing Packages to convert line hard, serve and interest data to Senseval-1 and Senseval-2 data formats refine preprocesses data in Senseval-2 data format to make it suitable for tagging Restore one sentence per line and one line per sentence, pre-tag the target words, split long sentences posSenseval part of speech tags any data in Senseval-2 data format Brill tagger along with Guaranteed Pre-tagging utilized parseSenseval parses data in a format as output by the Brill Tagger restores xml tags, creating a parsed file in Senseval-2 data format Uses the Collins Parser

23 Sample line data instance Original instance: art} aphb : " There's none there. " He hurried outside to see if there were any dry ones on the line. Senseval-2 data format: " There's none there. " He hurried outside to see if there were any dry ones on the line.

24 Sample Output from parseSenseval Harry cast a bewitching spell Harry cast a bewitching spell

25 Issues… How is the target word identified in line, hard and serve data How the data is tokenized for better quality pos tagging and parsing How is the data pre-tagged How is parse output of Collins Parser interpreted How is the parsed output XML’ized and brought back to Senseval-2 data format Idiosyncrasies of line, hard, serve, interest, Senseval- 1 and Senseval-2 data and how they are handled

26 Path Map Introduction Background Data Experiments Conclusions

27 Surface Forms Senseval-1 & Senseval-2 Senseval-2Senseval-1 Majority47.7%56.3% Surface Form 49.3%62.9% Unigrams55.3%66.9% Bigrams55.1%66.9%

28 Individual Word POS (Senseval-1) AllNounsVerbsAdj. Majority56.3%57.2%56.9%64.3% P %58.2%58.6%64.0 P %62.2%58.2%64.3% P0P0 60.3%62.5%58.2%64.3% P1P1 63.9%65.4%64.4%66.2% P %60.0%60.8%65.2%

29 Individual Word POS (Senseval-2) AllNounsVerbsAdj. Majority47.7%51.0%39.7%59.0% P %51.9%38.0%57.9% P %55.2%40.2%59.0% P0P0 49.9%55.7%40.6%58.2% P1P1 53.1%53.8%49.1%61.0% P %50.2%43.2%59.4%

30 Combining POS Features Senseval-2Senseval-1line Majority47.7%56.3%54.3% P 0, P %66.7%54.1% P -1, P 0, P %68.0%60.4% P -2, P -1, P 0, P 1, P %67.8%62.3%

31 Effect Guaranteed Pre-tagging on WSD Guar. P.Reg. P.Guar. P.Reg. P P -1, P %62.1%50.8%50.9% P 0, P % 54.3%53.8% P -1, P 0, P %67.6%54.6%54.7% P -1 P 0, P 0 P %66.3%54.0%53.7% P -2, P -1, P 0, P 1, P %66.1%54.6%54.1% Senseval-1 Senseval-2

32 Parse Features (Senseval-1) AllNounsVerbsAdj. Majority56.3%57.2%56.9%64.3% Head64.3%70.9%59.8%66.9% Parent60.6%62.6%60.3%65.8% Phrase58.5%57.5%57.2%66.2% Par. Phr.57.9%58.1%58.3%66.2%

33 Parse Features (Senseval-2) AllNounsVerbsAdj. Majority 47.7%51.0%39.7%59.0% Head51.7%58.5%39.8%64.0% Parent50.0%56.1%40.1%59.3% Phrase48.3%51.7%40.3%59.5% Par. Phr.48.5%53.0%39.1%60.3%

34 Thoughts… Both lexical and syntactic features perform comparably But do they get the same instances right ? How much are the individual feature sets redundant Are there instances correctly disambiguated by one feature set and not by the other ? How much are the individual feature sets complementary Is the effort to combine of lexical and syntactic features justified ?

35 Measures Baseline Ensemble: accuracy of a hypothetical ensemble which predicts the sense correctly only if both individual feature sets do so Quantifies redundancy amongst feature sets Optimal Ensemble : a ccuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so Difference with individual accuracies quantifies complementarity We used a simple ensemble which sums up the probabilities for each sense by the individual feature sets to decide the intended sense

36 Best Combinations DataSet 1Set 2BaseMaj.Ens.Opt. Sval2 Unigrams 55.3% P -1,P 0, P % 43.6%47.7%57.0%67.9% Sval1 Unigrams 66.9% P -1,P 0, P % 57.6%56.3%71.1%78.0% line Unigrams 74.5% P -1,P 0, P % 55.1%54.3%74.2%82.0% hard Bigrams 89.5% Head, Par 87.7% 86.1%81.5%88.9%91.3% serve Unigrams 73.3% P -1,P 0, P % 58.4%42.2%81.6%89.9% Interest Bigrams 79.9% P -1,P 0, P % 67.6%54.9%83.2%90.1%

37 Path Map Introduction Background Data Experiments Conclusions

38 Conclusions Significant amount of complementarity across lexical and syntactic features Combination of the two justified Part of speech of word immediately to the right of target word found most useful Pos of words immediately to the right of target word best for verbs and adjectives Nouns helped by tags on either side Head word of phrase particularly useful for adjectives Nouns helped by both head and parent

39 Other Contributions Converted line, hard, serve and interest data into Senseval-2 data format Part of speech tagged and Parsed the Senseval2, Senseval-1, line, hard, serve and interest data Developed the Guaranteed Pre-tagging mechanism to improve quality of pos tagging Showed that guaranteed pre-tagging improves WSD

40 Code, Data, Resources and Publication posSenseval : part of speech tags any data in Senseval-2 data format parseSenseval : parses data in a format as output by the Brill Tagger. Output is in Senseval-2 data format with part of speech and parse information as xml tags. Packages to convert line hard, serve and interest data to Senseval-1 and Senseval-2 data formats BrillPatch : Patch to Brill Tagger to employ Guaranteed Pre-Tagging Brill Tagger: Collins Parser: “Guaranteed Pre-Tagging for the Brill Tagger”, Mohammad and Pedersen, Fourth International Conference of Intelligent Systems and Text Processing, February 2003, Mexico

41 Thank You