Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

Similar presentations


Presentation on theme: "CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based."— Presentation transcript:

1 CS 4705 Word Sense Disambiguation

2 Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based techniques

3 Disambiguation via Selectional Restrictions A step toward semantic parsing –Different verbs select for different thematic roles wash the dishes (takes washable-thing as patient) serve delicious dishes (takes food-type as patient) Method: rule-to-rule syntactico-semantic analysis –Semantic attachment rules are applied as sentences are syntactically parsed VP --> V NP V  serve {theme:food-type} –Selectional restriction violation: no parse

4 Requires: –Write selectional restrictions for each sense of each predicate – or use FrameNetFrameNet Serve alone has 15 verb senses –Hierarchical type information about each argument (a la WordNet) WordNet How many hypernyms does dish have? How many lexemes are hyponyms of dish?hyponyms But also: –Sometimes selectional restrictions don’t restrict enough (Which dishes do you like?) –Sometimes they restrict too much (Eat dirt, worm! I’ll eat my hat!)

5 Can we take a more statistical approach? How likely is dish/crockery to be the object of serve? dish/food? A simple approach (baseline): predict the most likely sense –Why might this work? –When will it fail? A better approach: learn from a tagged corpus –What needs to be tagged? An even better approach: Resnik’s selectional association (1997, 1998) –Estimate conditional probabilities of word senses from a corpus tagged only with verbs and their arguments (e.g. ragout is an object of served -- Jane served/V ragout/Obj

6 How do we get the word sense probabilities? –For each verb object (e.g. ragout) Look up hypernym classes in WordNet Distribute “credit” for this object sense occurring with this verb among all the classes to which the object belongs Brian served/V the dish/Obj Jane served/V food/Obj If ragout has N hypernym classes in WordNet, add 1/N to each class count (including food) as object of serve If tureen has M hypernym classes in WordNet, add 1/M to each class count (including dish) as object of serve –Pr(Class|v) is the count(c,v)/count(v) –How can this work? Ambiguous words have many superordinate classes John served food/the dish/tuna/curry There is a common sense among these which gets “credit” in each instance, eventually dominating the likelihood score

7 To determine most likely sense of ‘bass’ in Bill served bass –Having previously assigned ‘credit’ for the occurrence of all hypernyms of things like fish and things like musical instruments to all their hypernym classes (e.g. ‘fish’ and ‘musical instruments’) –Find the hypernym classes of bass (including fish and musical instruments) –Choose the class C with the highest probability, given that the verb is serve Results: –Baselines: random choice of word sense is 26.8% choose most frequent sense (NB: requires sense-labeled training corpus) is 58.2% –Resnik’s: 44% correct with only pred/arg relations labeled

8 Machine Learning Approaches Learn a classifier to assign one of possible word senses for each word –Acquire knowledge from labeled or unlabeled corpus –Human intervention only in labeling corpus and selecting set of features to use in training Input: feature vectors –Target (dependent variable) –Context (set of independent variables) Output: classification rules for unseen text

9 Supervised Learning Training and test sets with words labeled as to correct sense (It was the biggest [fish: bass] I’ve seen.) –Obtain values of independent variables automatically (POS, co-occurrence information, …) –Run classifier on training data –Test on test data –Result: Classifier for use on unlabeled data

10 Input Features for WSD POS tags of target and neighbors Surrounding context words (stemmed or not) Punctuation, capitalization and formatting Partial parsing to identify thematic/grammatical roles and relations Collocational information: –How likely are target and left/right neighbor to co- occur Co-occurrence of neighboring words –Intuition: How often does sea or words with bass

11 –How do we proceed? Look at a window around the word to be disambiguated, in training data Which features accurately predict the correct tag? Can you think of other features might be useful in general for WSD? Input to learner, e.g. Is the bass fresh today? [w-2, w-2/pos, w-1,w-/pos,w+1,w+1/pos,w+2,w+2/pos… [is,V,the,DET,fresh,RB,today,N...

12 Types of Classifiers Naïve Bayes –ŝ = p(s|V), or –Where s is one of the senses possible and V the input vector of features –Assume features independent, so probability of V is the product of probabilities of each feature, given s, so – and p(V) same for any ŝ –Then

13 Rule Induction Learners (e.g. Ripper) Given a feature vector of values for independent variables associated with observations of values for the training set (e.g. [fishing,NP,3,…] + bass 2 ) Produce a set of rules that perform best on the training data, e.g. –bass 2 if w-1==‘fishing’ & pos==NP –…

14 –like case statements applying tests to input in turn fish within window--> bass 1 striped bass--> bass 1 guitar within window--> bass 2 bass player--> bass 1 … –Yarowsky ‘96’s approach orders tests by individual accuracy on entire training set based on log-likelihood ratio Decision Lists

15 Bootstrapping I –Start with a few labeled instances of target item as seeds to train initial classifier, C –Use high confidence classifications of C on unlabeled data as training data –Iterate Bootstrapping II –Start with sentences containing words strongly associated with each sense (e.g. sea and music for bass), either intuitively or from corpus or from dictionary entries –One Sense per Discourse hypothesis

16 Unsupervised Learning Cluster feature vectors to ‘discover’ word senses using some similarity metric (e.g. cosine distance) –Represent each cluster as average of feature vectors it contains –Label clusters by hand with known senses –Classify unseen instances by proximity to these known and labeled clusters Evaluation problem –What are the ‘right’ senses?

17 –Cluster impurity –How do you know how many clusters to create? –Some clusters may not map to ‘known’ senses

18 Dictionary Approaches Problem of scale for all ML approaches –Build a classifier for each sense ambiguity Machine readable dictionaries (Lesk ‘86) –Retrieve all definitions of content words occurring in context of target (e.g. the happy seafarer ate the bass) –Compare for overlap with sense definitions of target entry (bass 2 : a type of fish that lives in the sea) –Choose sense with most overlap Limits: Entries are short --> expand entries to ‘related’ words

19 Summary Many useful approaches developed to do WSD –Supervised and unsupervised ML techniques –Novel uses of existing resources (WN, dictionaries) Future –More tagged training corpora becoming available –New learning techniques being tested, e.g. co-training Next class: –Ch 17:3-5


Download ppt "CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based."

Similar presentations


Ads by Google