Presentation is loading. Please wait.

Presentation is loading. Please wait.

Word Sense Disambiguation and Information Retrieval ByGuitao Gao Qing Ma Prof:Jian-Yun Nie.

Similar presentations


Presentation on theme: "Word Sense Disambiguation and Information Retrieval ByGuitao Gao Qing Ma Prof:Jian-Yun Nie."— Presentation transcript:

1 Word Sense Disambiguation and Information Retrieval ByGuitao Gao Qing Ma Prof:Jian-Yun Nie

2 Outline  Introduction  WSD Approches  Conclusion

3 Introduction  Task of Information Retrieval  Content Repesentation  Indexing  Bag of words indexing  Problems: –Synonymy: query expansion –Polysemy: Word Sense Disambiguation

4 WSD Approaches  Disambiguation based on manually created rules  Disambiguation using machine readable dictionaries  Disambiguation using thesauri  Disambiguation based on unsupervised machine learning with corpora

5 Disambiguation based on manually created rules  Weiss’ approach [Lesk 1988] : –set of rules to disambiguate five words –context rule: within 5 words –template rule: specific location –accuracy : 90% –IR improvement: 1%  Small & Rieger’s approach [Small 1982] : –Expert system

6 Disambiguation using machine readable dictionaries  Lesk’s approach [Lesk 1988] : –Senses are represented by different definitions –Looked up context words definitions –Find co-occurring words –Select most similar sense –Accuracy: 50% - 70%. –Problem: no enough overlapping words between definitions

7 Disambiguation using machine readable dictionaries  Wilks’ approach [Wilks 1990] : –Attempt to solve Lesk’s problem –Expanding dictionary definition –Use Longman Dictionary of Contemporary English ( LDOCE ) –more word co-occurring evidence collected –Accuracy: between 53% and 85%.

8 Wilks’ approach [Wilks 1990] Commonly co-occurring words in LDOCE. [Wilks 1990]

9 Disambiguation using machine readable dictionaries  Luk’s approach [Luk 1995]: –Statistical sense disambiguation –Use definitions from LDOCE – co-occurrence data collected from Brown corpus –defining concepts : 1792 words used to write definitions of LDOCE –LDOCE pre-processed :conceptual expansion

10 Luk’s approach [Luk 1995]: Entry in LDOCEConceptual expansion 1. (an order given by a judge which fixes) a punishment for a criminal found guilty in court found guilty in court { {order, judge, punish, crime, criminal,find, guilt, court}, 2. a group of words that forms a statement, command, exclamation, or question, usu. contains a subject and a verb, and (in writing) begins with a capital letter and ends with one of the marks. ! ? { group, word, form, statement, command, question, contain, subject, verb, write, begin, capital, letter, end, mark} } Noun “sentence” and its conceptual expansion [Luk 1995]

11 Luk’s approach [Luk 1995] cont.  Collect co-occurrence data of defining concepts by constructing a two-dimensional Concept Co-occurrence Data Table (CCDT) –Brown corpus divided into sentences –collect conceptual co-occurrence data for each defining concept which occurs in the sentence –Insert collect data in the Concept Co- occurrence Data Table.

12 Luk’s approach [Luk 1995] cont. –Score each sense S with respect to context C [Luk 1995]

13 Luk’s approach [Luk 1995] cont. –Select sense with the highest score –Accuracy: 77% –Human accuracy: 71%

14 Approaches using Roget's Thesaurus [Yarowsky 1992]  Resources used: –Roget's Thesaurus –Grolier Multimedia Encyclopedia  Senses of a word: categories in Roget's Thesaurus  1042 broad categories covering areas like, tools/machinery or animals/insects

15 Approaches using Roget's Thesaurus [Yarowsky 1992] cont. tool, implement, appliance, contraption, apparatus, utensil, device, gadget, craft, machine, engine, motor, dynamo, generator, mill, lathe, equipment, gear, tackle, tackling, rigging, harness, trappings, fittings, accoutrements, paraphernalia, equipage, outfit, appointments, furniture, material, plant, appurtenances, a wheel, jack, clockwork, wheel- work, spring, screw, Some words placed into the tools/machinery category [Yarowsky 1992]

16 Approaches using Roget's Thesaurus [Yarowsky 1992] cont.  Collect context for each category: –From Grolier Encyclopedia –each occurrence of each member of the category –extracts 100 surrounding words Sample occurrence of words in the tools/machinery category [Yarowsky 1992]

17 Approaches using Roget's Thesaurus [Yarowsky 1992] cont.  Identify and weight salient words: Sample salient words for Roget categories 348 and 414 [Yarowsky 1992]  To disambiguate a word: sums up the weights of all salient words appearing in context  Accuracy: 92% disambiguating 12 words

18 Introduction to WordNet(1)  Online thesaurus system  Synsets: Synonymous Words  Hierachical Relationship

19 Introduction to WordNet(2) [Sanderson 2000]

20 Voorhees’ Disambg. Experiment  Calculation of Semantic Distance: Synset and Context words  Word’s Sense: Synset closest to Context Words  Retrieval Result: Worse than non-Disambig.

21 Gonzalo’s IR experiment(1) Two Questions  Can WordNet really offer any potential for text retrieval  How is text Retrieval performance affected by the disambiguation errors?

22 Gonzalo’s IR experiment(2)  Text Collection: Summary and Document Experiments  1. Standard Smart Run  2. Indexed In Terms of Word-Sense  3. Indexed In Terms of Synset  4. Introduction of Disambiguation Error

23 Gonzalo’s IR experiment(3) Experiements %correct document retrieved Indexed by synsets 62.0 Indexing by word senses53.2 Indexing by words48.0 Indexing by synsets(5% error)62.0 Id. with 10% errors 60.8 Id. with 20% errors 56.1 Id. with 30% errors 54.4 Id. with all possible 52.6 Id. with 60% errors49.1

24 Gonzalo’s IR experiment(4)  Disambiguation with WordNet can improve text retrieval  Solution lies in reliable Automatic WSD technique

25 Disambiguation With Unsupervised Learning Yarowsky’s Unsupervised Method  One Sense Per Collocation eg: Plant(manufacturing/life)  One Sense Per Discourse eg: defense(War/Sports)

26 Yarowsky’s Unsupervised Method cont. Algorithm Details  Step1:Store Word and its contexts as line eg:….zonal distribution of plant life…..  Step2: Identify a few words that represent the word Sense eg. plant(manufacturing/life)  Step3a: Get rules from the training set plant + X => A, weight plant + Y => B, weight  Step3b:Use the rules created in 3a to classify all occurrences of plant sample set.

27 Yarowsky’s Unsupervised Method cont.  Step3c: Use one-sense-per-discourse rule to filter or augment this addition  Step3d: Repeat Step 3 a-b-c iteratively.  Step4: the training converges on a stable residual set.  Step 5: the result will be a set of rules. Those rules will be used to disambiguate the word “plant”. eg. plant + growth => life plant + car => manufacturing

28 Yarowsky’s Unsupervised Method cont. Advantages of this method:  Better accuracy compared to other unsupervised method  No need for costly hand-tagged training sets(supervised method)

29 Schütze and Pedersen’s approach [Schütze 1995]  Source of word sense definitions –Not using a dictionary or thesaurus –Only using only the corpus to be disambiguated (Category B TREC-1 collection )  Thesaurus construction –Collect a (symmetric ) term-term matrix C –Entry c ij : number of times that words i and j co-occur in a symmetric window of total size k –Use SVD to reduce the dimensionality

30 Schütze and Pedersen’s approach [Schütze 1995] cont. –Thesaurus vector: columns –Semantic similarity: cosine between columns –Thesaurus: associate each word with its nearest neighbors –Context vector: summing thesaurus vectors of context words

31 Schütze and Pedersen’s approach [Schütze 1995] cont.  Disambiguation algorithm –Identify context vectors corresponding to all occurrences of a particular word –Partition them into regions of high density –Tag a sense for each such region –Disambiguating a word: Compute context vector of its occurrence Find the closest centroid of a region Assign the occurrence the sense of that centroid

32 Schütze and Pedersen’s approach [Schütze 1995] cont.  Accuracy: 90%  Application to IR –replacing the words by word senses –sense based retrieval’s average precision for 11 points of recall increased 4% with respect to word based. –Combine the ranking for each document: average precision increased: 11% –Each occurrence is assigned n(2,3,4,5) senses; average precision increased: 14% for n=3

33 Schütze and Pedersen’s approach [Schütze 1995] cont.

34 Conclusion  How much can WSD help improve IR effectiveness? Open question –Weiss: 1%, Voorhees’ method : negative –Krovetz and Croft, Sanderson : only useful for short queries –Schütze and Pedersen’s approaches and Gonzalo’s experiment : positive result  WSD must be accurate to be useful for IR  Schütze and Pedersen’s, Yarowsky’s algorithm: promising for IR  Luk’s approach : robust for data sparse, suitable for small corpus.

35 References [Krovetz 92] R. Krovetz & W.B. Croft (1992). Lexical Ambiguity and Information Retrieval, in ACM Transactions onInformation Systems, 10(1). Gonzalo 1998] J. Gonzalo, F. Verdejo, I. Chugur and J. Cigarran, “ Indexing with WordNet synsets can improve Text Retrieval ”, Proceedings of the COLING/ACL ’ 98 Workshop on Usage of WordNet for NLP, Montreal,1998 [Gonzalo 1992] R. Krovetz & W.B. Croft. “ Lexical Ambiguity and Information Retrieval ”, in ACM Transactions on Information Systems, 10(1), 1992 [Lesk 1988] M. Lesk, “ They said true things, but called them by wrong names ” – vocabulary problems in retrieval systems, in Proc. 4th Annual Conference of the University of Waterloo Centre for the New OED, 1988 [Luk 1995] A.K. Luk. “ Statistical sense disambiguation with relatively small corpora using dictionary definitions ”. In Proceedings of the 33rd Annual Meeting of the ACL, Columbus, Ohio, June 1995. Association for Computational Linguistics. [Salton 83] G. Salton & M.J. McGill (1983). Introduction To Modern Information Retrieval. The SMART and SIRE experimental retrieval systems, in New York: McGraw-Hill [Sanderson 1997] Sanderson, M. Word Sense Disambiguation and Information Retrieval, PhD Thesis, Technical Report (TR-1997-7) of the Department of Computing Science at the University of Glasgow, Glasgow G12 8QQ, UK. [Sanderson 2000] Sanderson, Mark, “ Retrieving with Good Sense ”, http://citeseer.nj.nec.com/sanderson00retrieving.html, 2000 http://citeseer.nj.nec.com/sanderson00retrieving.html

36 References cont. [Sch ü tze 1995] H. Sch ü tze & J.O. Pedersen. “ Information retrieval based on word senses ”, in Proceedings of the Symposium on Document Analysis and Information Retrieval, 4: 161-175. [Small 1982] S. Small & C. Rieger, “ Parsing and comprehending with word experts (a theoryand its realisation) ” in Strategies for Natural Language Processing, W.G. Lehnert & M.H. Ringle, Eds., LEA: 89-148, 1982 [Voorhees 1993] E. M. Voorhees, “ Using WordNet ™ to disambiguate word sense for text retrieval, in Proceedings of ACM SIGIR Conference ”, (16): 171-180. 1993 [Weiss 73] S.F. Weiss (1973). Learning to disambiguate, in Information Storage and Retrieval, 9:33-41, 1973 [Wilks 1990] Y. Wilks, D. Fass, C. Guo, J.E. Mcdonald, T. Plate, B.M. Slator (1990). ProvidingMachine Tractable Dictionary Tools, in Machine Translation, 5: 99-154, 1990 [Yarowsky 1992] D. Yarowsky, ` “ Word sense disambiguation using statistical models of Roget ’ s categories trained on large corpora, in Proceedings of COLING Conference ” : 454-460, 1992 [Yarowsky 1994] Yarowsky, D. “Decision lists for lexical ambiguity resolution:Application to Accent Restoration in Spanish and French.” In Proceedings of the 32rd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, 1994 [Yarowsky 1995] Yarowsky, D. “Unsupervised word sense disambiguation rivaling supervised methods.” In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189-- 196, Cambridge, MA, 1995


Download ppt "Word Sense Disambiguation and Information Retrieval ByGuitao Gao Qing Ma Prof:Jian-Yun Nie."

Similar presentations


Ads by Google