1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Published byModified over 3 years ago
Presentation on theme: "1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004."— Presentation transcript:
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004
2 Outline Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions
3 Introduction What is Homograph? One or two or more words spelled alike but different in meaning What is Noun Homograph Disambiguation? Determine which of a set of pre-determined senses should be assigned to that noun Why Noun Homograph Disambiguation is useful?
7 How to do? -- Motivations Intuition1 Human can identify word sense by local context Intuition2 Human’s identification ability comes from familiarity with frequent contexts Intuition3 Different senses can be distinguished by: -- different high-frequency context -- different syntactic, orthographic, or lexical features Combine Intuition 1, 2, 3 Similar-sense terms will tend to have similar contexts!
8 Feature Selection Principles: Selective & General Example: “bank” Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects are found on the west bank of the Nile [“direction”] bank of the “proper name” Headed the Chase Manhattan Bank in New York Name + Capitalization Neighbor word not enough Need syntactic information!
10 Crucial Problem: need large annotated data? Problem: Cost of manual tagging is high The size of corpus is usually large Statistics vary a great deal across different domains Automating the tagging of the training corpus will result in “Circularity problem” ( Dagan and Itai, 1994) Solution: Construct the training corpus incrementally An initial model M1, is trained using small corpus C1 M1 is used to disambiguate the rest of ambiguous words All words that can be disambiguated with strong confidence will be combined with C1 to form C2 M2 is trained using C2; and repeat.
11 Test Algorithm Manually label a small set of samples Record context features Training Check context feature of target noun Choose sense with most evidence Input Output Compare Evidence Samples with high Comparative Evidence Segmented into phrases & POS tagging
12 Comparative Evidence Definition Max (CE) where: and CE: Comparative Evidence; n: number of senses m: number of evidence features found in test sentences f ij : frequency (feature j is recorded in a sentence containing sense i) Procedure Choose sense with maximum comparative evidence If the largest CE is not larger than the second largest CE by threshold the sentence cannot be classified! (Margin)
18 Conclusions and future work Most advantage: using bootstrapping to alleviate tagging bottleneck; No sizable sense-tagged corpus is needed Results show the method is successful Unsupervised Learning helps to improve general words has limitations on difficult words like “country”. also helps to reduce work amount Use of partial syntactic information: richer than common statistics techniques Proposed Improvements Bootstrapping from Bilingual Corpora Improve Evidence Metric (adjust weight automatically; weight on the entire corpus and each sense; add more types) Integrate WordNet
19 Discussion 1: Initial Training A good training base need to be already obtained, Namely initial hand tagging is required. But once the training is complete, Noun Homograph Disambiguation is fast; This initial set is still large(20-30 occurrences for each sense) the cost of tagging is still high!
20 Discussion 2: Resources Advantage of unrestricted corpus compared to dictionaries, includes sufficient contextual variety Can automatically integrate unfamiliar words Assumption The context around an instance of a sense of the homograph is meaningfully related to that sense Need Semantic Lexicon? Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects
21 References Marti A. Hearst(1991). Noun Homograph Disambiguation Using Local Context in Large Text Corpora Yarowsky(1992). Word-Sense Disambiguation Using Statistical Models of Roget’s.. Chin(1999). Word Sense Disambiguation Using Statistical Techniques Peh, Ng(1997). Domain-Specific Semantic Class Disambiguation Using WordNet Dagan, I. and Itai(1994). Word Sense Disambiguation using a second language monolingual corpus