Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Similar presentations


Presentation on theme: "CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised."— Presentation transcript:

1 CS 4705 Lecture 19 Word Sense Disambiguation

2 Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based techniques

3 Disambiguation via Selectional Restrictions A step toward semantic parsing –Different verbs select for different thematic roles wash the dishes (takes washable-thing as patient) serve delicious dishes (takes food-type as patient) Method: rule-to-rule syntactico-semantic analysis –Semantic attachment rules are applied as sentences are syntactically parsed VP --> V NP V  serve {theme:food-type} –Selectional restriction violation: no parse

4 Requires: –Write selectional restrictions for each sense of each predicate Serve alone has 15 verb senses –Hierarchical type information about each argument (a la WordNet) How many hypernyms does dish have? How many lexemes are hyponyms of dish? But also: –Sometimes selectional restrictions don’t restrict enough (Which dishes do you like?) –Sometimes speakers violate them on purpose (Eat dirt, worm! I’ll eat my hat!)

5 Can we take a more probabilistic approach? How likely is dish/crockery to be the object of serve? dish/food? A simple approach: predict the most likely sense –Why might this work? –When will it fail? A better approach: learn from a tagged corpus –What needs to be tagged? An even better approach: Resnik’s selectional association (1997, 1998) –Estimate conditional probabilities of word senses from a corpus tagged only with verbs and their arguments (e.g. dish is an object of serve) -- Jane served/V ragout/Obj

6 How do we get the word sense probabilities? –For each verb’s object Look up hypernym classes in WordNet Distribute “credit” for this object occurring with this verb among all the classes to which the object belongs Brian served/V the dish/Obj Jane served/V food/Obj If dish has N hypernym classes in WordNet, add 1/N to each class count as object of serve If food has M hypernym classes in WordNet, add 1/M to each class count as object of serve –Pr(C|v) is the count(c,v)/count(v) –How can this work? Ambiguous words have many superordinate classes John served food/the dish/tuna/curry There is a common sense among these which gets “credit” in each instance, eventually dominating the likelihood score

7 To determine most likely sense of ‘tuna’ in Bill served tuna –Find the hypernym classes of tuna –Choose the class C with the highest probability, given that the verb is serve Results: –Baselines: random choice of word sense is 26.8% choose most frequent sense (requires sense-labeled training corpus) is 58.2% –Resnik’s: 44% correct with only pred/arg relations labeled

8 Machine Learning Approaches Learn a classifier to assign one of possible word senses for each word –Acquire knowledge from labeled or unlabeled corpus –Human intervention only in labeling corpus and selecting set of features to use in training Input: feature vectors –Target (dependent variable) –Context (set of independent variables) Output: classification rules for unseen text

9 Supervised Learning Training and test sets with words labeled as to correct sense (It was the biggest [fish: bass] I’ve seen.) –Obtain independent vars automatically (POS, co- occurrence information, etc.) –Run classifier on training data –Test on test data –Result: Classifier for use on unlabeled data

10 Input Features for WSD POS tags of target and neighbors Surrounding context words (stemmed or not) Partial parsing to identify thematic/grammatical roles and relations Collocational information: –How likely are target and left/right neighbor to co- occur Co-occurrence of neighboring words –Intuition: How often does sea or words with bass

11 –How operationalize? Look at the M most frequent content words occurring within window of M in training data Which accurately predict the correct tag? –Which other features might be useful in general for WSD? Input to learner, e.g. Is the bass fresh today? [w-2, w-2/pos, w-1,w-/pos,w+1,w+1/pos,w+2,w+2/pos… [is,V,the,DET,fresh,RB,today,N...

12 Types of Classifiers Naïve Bayes –ŝ = p(s|V), or –Where s is one of the senses possible and V the input vector of features –Assume features independent, so probability of V is the product of probabilities of each feature, given s, so – and p(V) same for any s –Then

13 Rule Induction Learners (e.g. Ripper) Given a feature vector of values for independent variables associated with observations of values for the training set (e.g. [fishing,NP,3,…] + bass 2 ) Produce a set of rules that perform best on the training data, e.g. –bass 2 if w-1==‘fishing’ & pos==NP –…

14 –like case statements applying tests to input in turn fish within window--> bass 1 striped bass--> bass 1 guitar within window--> bass 2 bass player--> bass 1 … –Yarowsky ‘96’s approach orders tests by individual accuracy on entire training set based on log-likelihood ratio Decision Lists

15 Bootstrapping I –Start with a few labeled instances of target item as seeds to train initial classifier, C –Use high confidence classifications of C on unlabeled data as training data –Iterate Bootstrapping II –Start with sentences containing words strongly associated with each sense (e.g. sea and music for bass), either intuitively or from corpus or from dictionary entries –One Sense per Discourse hypothesis

16 Unsupervised Learning Cluster feature vectors to ‘discover’ word senses using some similarity metric (e.g. cosine distance) –Represent each cluster as average of feature vectors it contains –Label clusters by hand with known senses –Classify unseen instances by proximity to these known and labeled clusters Evaluation problem –What are the ‘right’ senses?

17 –Cluster impurity –How do you know how many clusters to create? –Some clusters may not map to ‘known’ senses

18 Dictionary Approaches Problem of scale for all ML approaches –Build a classifier for each sense ambiguity Machine readable dictionaries (Lesk ‘86) –Retrieve all definitions of content words in context of target (e.g. the happy seafarer ate the bass) –Compare for overlap with sense definitions of target (bass 2 : a type of fish that lives in the sea) –Choose sense with most overlap Limits: Entries are short --> expand entries to ‘related’ words

19 Summary Many useful approaches developed to do WSD –Supervised and unsupervised ML techniques –Novel uses of existing resources (WN, dictionaries) Future –More tagged training corpora becoming available –New learning techniques being tested, e.g. co-training Next class: –Homework 2 due –Read Ch 15:5-6;Ch 17:3-5


Download ppt "CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised."

Similar presentations


Ads by Google