Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.

Similar presentations


Presentation on theme: "Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information."— Presentation transcript:

1 Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information Studies

2 Multilingual indexing two possibilities to create a common index for all the languages to create separate index for each language UTA followed the approach of separate indexes

3 Our result merging strategies in CLEF 2003 the raw score approach as a baseline the dataset size based method 185 German, 81 French, 99 Italian, 106 English, 285 Spanish, 120 Dutch, 35 Finnish and 89 Swedish documents (sum = 1000 docs) the score difference based method every score is compared with the best score of the topic only documents with the difference of scores under the predefined value are taken to the final list e.g. if the best score of the topics is 0.480001, and the difference value is 0.08, we will take with a document with score 0.400002, but not a document with score 0.400001 the final ordering (1000 docs / topic) is done by raw score merging strategy

4 Indexing methods inflected index dataset words are stored as such employed by www search engines normalized index stemming morphological analysis we applied normalized indexing in our CLEF 2003 runs

5 Word normalization methods stemming suitable for languages with weak morphology several stemming techniques we applied in CLEF 2003 mostly stemmers based on the Porter stemmer morphological analysis full description of inflectional morphology large lexicon of basic vocabulary suitable for languages with strong morphology

6 UTA applied both stemmers and morphological analyzers in multilingual runs of CLEF 2003 we built both stemmed and morhologically analyzed indexes for English, Finnish and Swedish for Dutch, French, German, Italian and Spanish we built stemmed indexes UTA indexes

7 The UTACLIR process each source word is normalized utilizing a morphological analyzer source stop words are removed each normalized source word is translated translated words are normalized (by a morphological analyzer or a stemmer, depending on the target language code) target stop words are removed if the source word is untranslatable, two highest ranked words obtained in n- gram-matching are selected as query words from the target index

8 Our results index typemerging strategy average precis. % difference % morph./stem.raw score 18.6 morph./stem.dataset size 18.3-1.6 morph./stem.score diff/top 18.2-2.1 morph./stem.round robin 18.4-1.1 stemmeddataset size 18.6 0.0 stemmedscore diff/top 18.5-0.5 stemmedraw score 18.3-1.6 stemmedround robin 18.4-1.1

9 The results of our additional monolingual English, bilingual English-Finnish and bilingual English-Swedish runs languageindex type average precis. % difference % Englishmorphol.anal. 45.6 Englishstemmed 46.3+1.5 Finnishmorphol.anal. 34.0 Finnishstemmed 19.0-44.1 Swedishmorphol.anal. 27.1 Swedishstemmed 19.0-29.9

10 Conclusions all the result merging strategies we applied produced almost equal results the performance did not vary depending on the index type in the multilingual task

11 Conclusions II the impact of different word normalization methods on IR performance has not been investigated properly our monolingual and bilingual tests show that stemming is an adequate normalization method for English, but not for Finnish and Swedish so far, morphological analysis seems to offer a hard baseline for competing methods (e.g., stemming) in Finnish and Swedish the reasons why stemming is not adequate for Finnish and Swedish may be different and should be investigated


Download ppt "Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information."

Similar presentations


Ads by Google