Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information Studies
Multilingual indexing two possibilities to create a common index for all the languages to create separate index for each language UTA followed the approach of separate indexes
Our result merging strategies in CLEF 2003 the raw score approach as a baseline the dataset size based method 185 German, 81 French, 99 Italian, 106 English, 285 Spanish, 120 Dutch, 35 Finnish and 89 Swedish documents (sum = 1000 docs) the score difference based method every score is compared with the best score of the topic only documents with the difference of scores under the predefined value are taken to the final list e.g. if the best score of the topics is , and the difference value is 0.08, we will take with a document with score , but not a document with score the final ordering (1000 docs / topic) is done by raw score merging strategy
Indexing methods inflected index dataset words are stored as such employed by www search engines normalized index stemming morphological analysis we applied normalized indexing in our CLEF 2003 runs
Word normalization methods stemming suitable for languages with weak morphology several stemming techniques we applied in CLEF 2003 mostly stemmers based on the Porter stemmer morphological analysis full description of inflectional morphology large lexicon of basic vocabulary suitable for languages with strong morphology
UTA applied both stemmers and morphological analyzers in multilingual runs of CLEF 2003 we built both stemmed and morhologically analyzed indexes for English, Finnish and Swedish for Dutch, French, German, Italian and Spanish we built stemmed indexes UTA indexes
The UTACLIR process each source word is normalized utilizing a morphological analyzer source stop words are removed each normalized source word is translated translated words are normalized (by a morphological analyzer or a stemmer, depending on the target language code) target stop words are removed if the source word is untranslatable, two highest ranked words obtained in n- gram-matching are selected as query words from the target index
Our results index typemerging strategy average precis. % difference % morph./stem.raw score 18.6 morph./stem.dataset size morph./stem.score diff/top morph./stem.round robin stemmeddataset size stemmedscore diff/top stemmedraw score stemmedround robin
The results of our additional monolingual English, bilingual English-Finnish and bilingual English-Swedish runs languageindex type average precis. % difference % Englishmorphol.anal Englishstemmed Finnishmorphol.anal Finnishstemmed Swedishmorphol.anal Swedishstemmed
Conclusions all the result merging strategies we applied produced almost equal results the performance did not vary depending on the index type in the multilingual task
Conclusions II the impact of different word normalization methods on IR performance has not been investigated properly our monolingual and bilingual tests show that stemming is an adequate normalization method for English, but not for Finnish and Swedish so far, morphological analysis seems to offer a hard baseline for competing methods (e.g., stemming) in Finnish and Swedish the reasons why stemming is not adequate for Finnish and Swedish may be different and should be investigated