1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen University of Houston-Downtown Wei Ding University of Massachusetts-Boston.
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
National Taiwan University, Taiwan
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
Improving Translation Selection using Conceptual Vectors LIM Lian Tze Computer Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Supertagging CMSC Natural Language Processing January 31, 2006.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Natural Language Processing (NLP)
Statistical NLP: Lecture 9
Natural Language Processing (NLP)
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004

2 Outline Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions

3 Introduction What is Homograph?  One or two or more words spelled alike but different in meaning What is Noun Homograph Disambiguation?  Determine which of a set of pre-determined senses should be assigned to that noun Why Noun Homograph Disambiguation is useful?

4 Noun Compound Interpretation

5 Noun Compound Interpretation Improve Information Retrieval Results ORG stick ORG stick

6 Extend key words? ORG stick

7 How to do? -- Motivations Intuition1  Human can identify word sense by local context Intuition2  Human’s identification ability comes from familiarity with frequent contexts Intuition3  Different senses can be distinguished by: -- different high-frequency context -- different syntactic, orthographic, or lexical features Combine Intuition 1, 2, 3 Similar-sense terms will tend to have similar contexts!

8 Feature Selection Principles: Selective & General Example: “bank”  Numerous residences, banks, and libraries parallel buildings  They use holes in trees, banks, or rocks for nests parallel nature objects  are found on the west bank of the Nile [“direction”] bank of the “proper name”  Headed the Chase Manhattan Bank in New York Name + Capitalization Neighbor word not enough  Need syntactic information!

9 Feature Set

10 Crucial Problem: need large annotated data? Problem: Cost of manual tagging is high  The size of corpus is usually large  Statistics vary a great deal across different domains  Automating the tagging of the training corpus will result in “Circularity problem” ( Dagan and Itai, 1994) Solution: Construct the training corpus incrementally  An initial model M1, is trained using small corpus C1  M1 is used to disambiguate the rest of ambiguous words  All words that can be disambiguated with strong confidence will be combined with C1 to form C2  M2 is trained using C2; and repeat.

11 Test Algorithm Manually label a small set of samples Record context features Training Check context feature of target noun Choose sense with most evidence Input Output Compare Evidence Samples with high Comparative Evidence Segmented into phrases & POS tagging

12 Comparative Evidence Definition Max (CE) where: and CE: Comparative Evidence; n: number of senses m: number of evidence features found in test sentences f ij : frequency (feature j is recorded in a sentence containing sense i) Procedure  Choose sense with maximum comparative evidence  If the largest CE is not larger than the second largest CE by threshold  the sentence cannot be classified! (Margin)

13 Experiment Result – “tank”

14 Experiment Result – “bank”

15 Experiment Result – “bass”

16 Experiment Result – “country”

17 Experiment Result – “Record” Record1: “archived event”  “pinnacle achievement” Record2: “archived event”  “musical disk”

18 Conclusions and future work Most advantage: using bootstrapping to alleviate tagging bottleneck; No sizable sense-tagged corpus is needed Results show the method is successful Unsupervised Learning  helps to improve general words  has limitations on difficult words like “country”.  also helps to reduce work amount Use of partial syntactic information: richer than common statistics techniques Proposed Improvements  Bootstrapping from Bilingual Corpora  Improve Evidence Metric (adjust weight automatically; weight on the entire corpus and each sense; add more types)  Integrate WordNet

19 Discussion 1: Initial Training A good training base need to be already obtained, Namely initial hand tagging is required. But once the training is complete, Noun Homograph Disambiguation is fast; This initial set is still large(20-30 occurrences for each sense)  the cost of tagging is still high!

20 Discussion 2: Resources Advantage of unrestricted corpus  compared to dictionaries, includes sufficient contextual variety  Can automatically integrate unfamiliar words Assumption  The context around an instance of a sense of the homograph is meaningfully related to that sense Need Semantic Lexicon?  Numerous residences, banks, and libraries parallel buildings  They use holes in trees, banks, or rocks for nests parallel nature objects

21 References Marti A. Hearst(1991). Noun Homograph Disambiguation Using Local Context in Large Text Corpora Yarowsky(1992). Word-Sense Disambiguation Using Statistical Models of Roget’s.. Chin(1999). Word Sense Disambiguation Using Statistical Techniques Peh, Ng(1997). Domain-Specific Semantic Class Disambiguation Using WordNet Dagan, I. and Itai(1994). Word Sense Disambiguation using a second language monolingual corpus