Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/ ervised/ Cover other applications of semi- supervised learning? Volunteers? Every week or bi-weekly? Time change? 1pm? Noon?
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Author: David Yarowsky (1995) Presented by: Andy Carlson
Word Sense Disambiguation Determining what sense of a word is meant in a given sentence “Toyota is considering opening a plant in Detroit.” “The banana plant is grown all over the tropics for its fruit.” Different from sense induction– we assume we already know distinct senses
Using unlabeled data Two properties of language let us use unlabeled data: One sense per collocation –Nearby words provide strong and consistent clues One sense per discourse –With a document, the sense of a word is highly consistent We can base an iterative bootstrapping algorithm on these two properties
One sense per discourse How accurate? How frequently does it apply?
Decision Lists List of rules of the form “collocation => sense” Example: life (within 2-10 words) => biological sense of plant Rules are ordered by log-likelihood ratio
The algorithm – step 1 Find all occurrences of the given polysemous word We follow examples for the word plant
Step 2 – Initial Labeling For each sense of the word, identify a small number of training examples Strategies: dictionary words, human- labelling of most frequent collocates, or human-chosen collocates Example: the words life and manufacturing are used as seed collocations
Labeled as ‘living’ plant
Unlabeled examples
Labeled as ‘factory’ plant
Sample initial state
Step 3a Train the decision list based on the current labeling of the state space
Step 3b Apply learned classifier to all examples
Step 3c Optionally, apply the one-sense-per- discourse constraint
Step 3c
After steps 3b and 3c
Step 3d Repeat step 3 iteratively Details – grow window size for collocations, and randomly perturb the class inclusion threshold
Step 4 Stop. The algorithm converges to a stable residual set.
Sample final state
Final decision list
Results