CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Sense Disambiguation semantic tagging of text, for Confusion Set Disambiguation.
Advertisements

Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Max-Margin Matching for Semantic Role Labeling David Vickrey James Connor Daphne Koller Stanford University.
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -
5/16/2015CPSC503 Winter CPSC 503 Computational Linguistics Computational Lexical Semantics Lecture 14 Giuseppe Carenini.
 Aim to get back on Tuesday  I grade on a curve ◦ One for graduate students ◦ One for undergraduate students  Comments?
What is Statistical Modeling
Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.
CS 4705 Relationships among Words, Semantic Roles, and Word- Sense Disambiguation.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS 4705 Semantic Roles and Disambiguation. Today Semantic Networks: Wordnet Thematic Roles Selectional Restrictions Selectional Association Conceptual.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/
Semi-Supervised Learning
Word Sense Disambiguation. Word Sense Disambiguation (WSD) Given A word in context A fixed inventory of potential word senses Decide which sense of the.
Natural Language Processing Lecture 22—11/14/2013 Jim Martin.
Lexical Semantics CSCI-GA.2590 – Lecture 7A
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.
Word Sense Disambiguation Many words have multiple meanings –E.g, river bank, financial bank Problem: Assign proper sense to each ambiguous word in text.
Text Classification, Active/Interactive learning.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Lexical Semantics & Word Sense Disambiguation CMSC Natural Language Processing May 15, 2003.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Word Sense Disambiguation Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Classification Techniques: Bayesian Classification
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Lecture 21 Computational Lexical Semantics Topics Features in NLTK III Computational Lexical Semantics Semantic Web USCReadings: NLTK book Chapter 10 Text.
Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 24 (14/04/06) Prof. Pushpak Bhattacharyya IIT Bombay Word Sense Disambiguation.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Finding Predominant Word Senses in Untagged Text Diana McCarthy & Rob Koeling & Julie Weeds & Carroll Department of Indormatics, University of Sussex {dianam,
KNN & Naïve Bayes Hongning Wang
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
Lecture 21 Computational Lexical Semantics
Word Sense Disambiguation
Statistical NLP: Lecture 9
Unsupervised Word Sense Disambiguation Using Lesk algorithm
Statistical NLP : Lecture 9 Word Sense Disambiguation
Statistical NLP: Lecture 10
Presentation transcript:

CS 4705 Lecture 19 Word Sense Disambiguation

Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based techniques

Disambiguation via Selectional Restrictions A step toward semantic parsing –Different verbs select for different thematic roles wash the dishes (takes washable-thing as patient) serve delicious dishes (takes food-type as patient) Method: rule-to-rule syntactico-semantic analysis –Semantic attachment rules are applied as sentences are syntactically parsed VP --> V NP V  serve {theme:food-type} –Selectional restriction violation: no parse

Requires: –Write selectional restrictions for each sense of each predicate Serve alone has 15 verb senses –Hierarchical type information about each argument (a la WordNet) How many hypernyms does dish have? How many lexemes are hyponyms of dish? But also: –Sometimes selectional restrictions don’t restrict enough (Which dishes do you like?) –Sometimes speakers violate them on purpose (Eat dirt, worm! I’ll eat my hat!)

Can we take a more probabilistic approach? How likely is dish/crockery to be the object of serve? dish/food? A simple approach: predict the most likely sense –Why might this work? –When will it fail? A better approach: learn from a tagged corpus –What needs to be tagged? An even better approach: Resnik’s selectional association (1997, 1998) –Estimate conditional probabilities of word senses from a corpus tagged only with verbs and their arguments (e.g. dish is an object of serve) -- Jane served/V ragout/Obj

How do we get the word sense probabilities? –For each verb’s object Look up hypernym classes in WordNet Distribute “credit” for this object occurring with this verb among all the classes to which the object belongs Brian served/V the dish/Obj Jane served/V food/Obj If dish has N hypernym classes in WordNet, add 1/N to each class count as object of serve If food has M hypernym classes in WordNet, add 1/M to each class count as object of serve –Pr(C|v) is the count(c,v)/count(v) –How can this work? Ambiguous words have many superordinate classes John served food/the dish/tuna/curry There is a common sense among these which gets “credit” in each instance, eventually dominating the likelihood score

To determine most likely sense of ‘tuna’ in Bill served tuna –Find the hypernym classes of tuna –Choose the class C with the highest probability, given that the verb is serve Results: –Baselines: random choice of word sense is 26.8% choose most frequent sense (requires sense-labeled training corpus) is 58.2% –Resnik’s: 44% correct with only pred/arg relations labeled

Machine Learning Approaches Learn a classifier to assign one of possible word senses for each word –Acquire knowledge from labeled or unlabeled corpus –Human intervention only in labeling corpus and selecting set of features to use in training Input: feature vectors –Target (dependent variable) –Context (set of independent variables) Output: classification rules for unseen text

Supervised Learning Training and test sets with words labeled as to correct sense (It was the biggest [fish: bass] I’ve seen.) –Obtain independent vars automatically (POS, co- occurrence information, etc.) –Run classifier on training data –Test on test data –Result: Classifier for use on unlabeled data

Input Features for WSD POS tags of target and neighbors Surrounding context words (stemmed or not) Partial parsing to identify thematic/grammatical roles and relations Collocational information: –How likely are target and left/right neighbor to co- occur Co-occurrence of neighboring words –Intuition: How often does sea or words with bass

–How operationalize? Look at the M most frequent content words occurring within window of M in training data Which accurately predict the correct tag? –Which other features might be useful in general for WSD? Input to learner, e.g. Is the bass fresh today? [w-2, w-2/pos, w-1,w-/pos,w+1,w+1/pos,w+2,w+2/pos… [is,V,the,DET,fresh,RB,today,N...

Types of Classifiers Naïve Bayes –ŝ = p(s|V), or –Where s is one of the senses possible and V the input vector of features –Assume features independent, so probability of V is the product of probabilities of each feature, given s, so – and p(V) same for any s –Then

Rule Induction Learners (e.g. Ripper) Given a feature vector of values for independent variables associated with observations of values for the training set (e.g. [fishing,NP,3,…] + bass 2 ) Produce a set of rules that perform best on the training data, e.g. –bass 2 if w-1==‘fishing’ & pos==NP –…

–like case statements applying tests to input in turn fish within window--> bass 1 striped bass--> bass 1 guitar within window--> bass 2 bass player--> bass 1 … –Yarowsky ‘96’s approach orders tests by individual accuracy on entire training set based on log-likelihood ratio Decision Lists

Bootstrapping I –Start with a few labeled instances of target item as seeds to train initial classifier, C –Use high confidence classifications of C on unlabeled data as training data –Iterate Bootstrapping II –Start with sentences containing words strongly associated with each sense (e.g. sea and music for bass), either intuitively or from corpus or from dictionary entries –One Sense per Discourse hypothesis

Unsupervised Learning Cluster feature vectors to ‘discover’ word senses using some similarity metric (e.g. cosine distance) –Represent each cluster as average of feature vectors it contains –Label clusters by hand with known senses –Classify unseen instances by proximity to these known and labeled clusters Evaluation problem –What are the ‘right’ senses?

–Cluster impurity –How do you know how many clusters to create? –Some clusters may not map to ‘known’ senses

Dictionary Approaches Problem of scale for all ML approaches –Build a classifier for each sense ambiguity Machine readable dictionaries (Lesk ‘86) –Retrieve all definitions of content words in context of target (e.g. the happy seafarer ate the bass) –Compare for overlap with sense definitions of target (bass 2 : a type of fish that lives in the sea) –Choose sense with most overlap Limits: Entries are short --> expand entries to ‘related’ words

Summary Many useful approaches developed to do WSD –Supervised and unsupervised ML techniques –Novel uses of existing resources (WN, dictionaries) Future –More tagged training corpora becoming available –New learning techniques being tested, e.g. co-training Next class: –Homework 2 due –Read Ch 15:5-6;Ch 17:3-5