Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Similar presentations


Presentation on theme: "Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department."— Presentation transcript:

1 Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department of Information Studies 1 Department of Computer Science 2 University of Sheffield, UK

2 GWC2004 20 th January 2004 Outline Introduction Word sense disambiguation Experimental setup CLIR evaluation WSD evaluation Discussion and conclusion

3 GWC2004 20 th January 2004 Introduction CLIR – search for documents written in one language (target) with queries written in another (source) Approaches – translate query, documents or both Translation methods – e.g. MT, MRDs, parallel corpora, controlled vocabulary Problems – e.g. lexical coverage, ambiguity, small context, proper names, compound words WSD – to identify the correct sense of a word during translation Experiments – with EuroWordNet and “standard” IR test collection resources

4 GWC2004 20 th January 2004 Example translation Number: CL1 Caso Waldenheim caso#1 --> [case#9:grammatical case#1:](4167794) "nouns or pronouns or adjectives (often marked by inflection) related in some way to other words in a sentence" caso#2 --> [case#12:instance#2:](4704301) "an occurrence of something; "it was a case of bad judgment"" caso#3 --> [case#16:event#2:](8533655) "a special set of circumstances; "in that event, the first possibility is excluded"“ Case (event) Waldenheim Source query Target query EuroWordNet Disambiguation needed?

5 GWC2004 20 th January 2004 Word sense disambiguation Each Spanish noun can be associated with multiple synsets, in addition each of these can be mapped to multiple synsets in the ILI (English WN) Attempt to automatically identify the EuroWordNet synset appropriate to the query using WSD Adapt Resnik’s algorithm for disambiguating groups of nouns: –Treats EuroWordNet as a hierarchy and identifies most likely synsets based on distance in WordNet and corpus information –Query is treated as a “bag of words”

6 GWC2004 20 th January 2004 Experimental setup TREC 6 collection (242,918 documents and 25 queries) Spanish used for CL retrieval and English as monolingual baseline Query translation process: term identification  term translation (EWN)  retrieval EWN transformed into a kind of MRD for translation Focused on translation of nouns and adjectives Synset selection – manually, first, all or WSD algorithm Synset member selection – head (first) or all Experimented with short (title) and longer queries (title + description)

7 GWC2004 20 th January 2004 Example translation Number: CL1 Caso Waldenheim caso#1 --> [case#9:grammatical case#1:](4167794) "nouns or pronouns or adjectives (often marked by inflection) related in some way to other words in a sentence" caso#2 --> [case#12:instance#2:](4704301) "an occurrence of something; "it was a case of bad judgment"" caso#3 --> [case#16:event#2:](8533655) "a special set of circumstances; "in that event, the first possibility is excluded"“ case Waldenheim Source query EuroWordNet Disambiguation needed? 1 st sense, head case grammatical case Waldenheim 1 st sense, all words case Waldenheim case grammatical case Instance event Waldenheim all senses, head all senses, all words

8 GWC2004 20 th January 2004 CLIR evaluation (title & description) Measured MAP and relevant retrieved using trec_eval Baseline: map = 0.3512, relevant retrieved = 979 Synset selection Synset members Relevant retrieved MAP GOLDAll8900.2823 1 st 6760.2459 All 7600.2203 1 st 6980.2215 1 st All7070.2158 1 st 5500.1994 WSDAll7650.2534 1 st 5790.2073 80% monolingual Highest (72% monolingual)

9 GWC2004 20 th January 2004 CLIR evaluation (title only) Baseline: map = 0.3355, relevant retrieved = 977 Synset selection Synset members Relevant retrieved MAP GOLDAll8900.2823 1 st 6760.2459 All 7600.2203 1 st 6980.2215 1 st All7070.2158 1 st 5500.1994 WSDAll7650.2534 1 st 5790.2073 84% monolingual Highest (76% monolingual)

10 GWC2004 20 th January 2004 WSD evaluation Manual annotation identifies single correct sense for each noun; WSD algorithm can return multiple senses Calculated two evaluation metrics: –Relaxed: score 1 if correct sense is identified; corresponds to proportion of words where correct senses is included –Strict: score 1/m if correct sense included in m returned; gives indication of amount of incorrect senses returned “Choose first synset” used as naïve baseline x xx m Correct sense

11 GWC2004 20 th January 2004 WSD evaluation LanguageMethodStrictRelaxed EnglishWSD0.410.55 first0.47 SpanishWSD0.440.55 first0.48 WSD results are disappointing compared to state-of-the-art Limited context of queries seems to make disambiguation difficult BUT does not seem to effect CLIR results!

12 GWC2004 20 th January 2004 Discussion and conclusions Disagreement of usefulness of WSD for monolingual retrieval  WSD algorithms have to be accurate to be useful for retrieval  the IR algorithm performs a kind of disambiguation anyway Our results suggest some WSD better than none for CLIR using EWN as the translation resource even with poor WSD performance WSD algorithm well-suited to CLIR where it selects senses only when there is sufficient context Experiments highlight limitation in EWN for CLIR: many types of useful semantic information missing and lexical coverage

13 GWC2004 20 th January 2004 Future work Experiment with different languages supported by EWN to see if results generalise Experiment with different datasets (e.g. CLEF) and further bilingual pairs, e.g. English  Spanish. Use advanced query construction techniques, e.g. the “synonym” operator to combine synset members Combine various WSD algorithms to improve their individual effectiveness Improve the translation process based on EWN, e.g. identify phrases


Download ppt "Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department."

Similar presentations


Ads by Google