Presentation on theme: "Combining Query Translation and Document Translation in Cross- Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems."— Presentation transcript:
Combining Query Translation and Document Translation in Cross- Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway
Talk Outline Development of new resources Fast approximate document translation Combining query translation and document translation Conclusions
New Resources Finnish and Swedish stoplists Base Finnish and Swedish lexicons for decompounding Statistical translation lexicons derived from parallel texts Finnish and Swedish statistical stemmers automatically generated from parallel texts English spelling normalizer
Development of Swedish Stoplist (by someone who doesn’t know Swedish) Look for Swedish words whose English translations are English stopwords in Swedish textbooks (e.g., grammar) written in English. en park (a park) ett piano (a piano) Jag vet inte mycket om honom (I don’t know much about him) efter skolan (after school) Hans och Greta (Hans and Greta) (Source: Swedish: A comprehensive grammar by P. Holmes & I. Hinchliffe)
Development of Swedish Base Lexicon A base lexicon should contain all and only the words and their variants that are not compounds. Compile a list of Swedish words (e.g., from the Swedish document collection). Remove the words that are 4 or fewer characters long. Remove the long words that can be decomposed into short words in the initial wordlist. animation animationen dator datoranimation datorgrafik datorteknologi datorvirus grafik teknologi virus dator animation dator grafik dator teknologi dator virus Remove the compounds that are decomposed.
Development of Statistical Translation Lexicons from Parallel Texts parallel texts (EU Official Journal) PDF texts conversion paragraph & sentence alignment statistical MT toolkit statistical association 1.English Dutch 2.English Finnish 3.English Swedish 4.Dutch English 5.Finnish English 6.Swedish English 1.Italian Spanish 2.German Italian 3.Finnish German statistical translation lexicons
Development of Statistical Stemmers dator datorn datorer datorersom datornät datornernä diamanten diamanterna diamanter diamant informatik diamond diamonds computer computers diamond computer Swedish words diamanten diamanterna diamanter diamant dator datorn datorer datorersom datornät datornernä informatik “computer” cluster “diamond” cluster statistical English translations dator diamant
Fast Approximate Document Translation Spanish documents Spanish-English MT List of Spanish words List of English words Bilingual Spanish-English wordlist English translations 1 2 3 4 Word-by-word
Query Translation-based Multilingual Retrieval English French German English docsFrench docsGerman docs merger combined ranked list of documents German French English Query Documents Spanish IR Spanish docs L&H
Documentation Translation-based Multilingual Retrieval English unified ranked list of documents German French English Query Documents EnglishSpanish IR
Query Translation v.s. Document Translation celíacos dietasDiets for Celiacs Las Dietas para Celiacs English words in topic 186 Spanish doc words query translation document translation (word-by-word) Nahrungen für Celiacs diät zöliakie celiacs diets diets coeliac diseases German doc words Average precision: 0.0003 (mul4en1)Average precision: 0.6750 (mul4en2) (Spanish)(German)(English) Dutch Netherlands Hollandais Hollande (French) query translation Néerlandais Pays-Bas Dutch Netherlands French document words (English) English words in topic 161 Average precision: 0.2213 (mul4en1)Average precision: 0.6167 (mul4en2) document translation (word-by-word) 0.0 1.0
Evaluation of Decompounding, Stemming and Query Expansion in Monolingual Retrieval baseline decomp stemexpan decomp+stem decomp+expanstem+expan decomp+stem+expan.4342.3727.3801.3630.4744.4294.4204.4331.4480.4220.4974.4121.4673.4867.4071.4224.5304 (22.16%).5678 (52.35%).5633 (48.20%).5465 (50.55%).4955.5111.4972.4727.5126.5473.4469.4880.4962.4804.5541.4838 Topics (TD) Dutch German Finnish Swedish
Conclusions Fast approximate document-translation worked well. Combining document-translation with query- translation was even better. Decompounding with stemming and query expansion worked well for languages with rich compounds. Statistical stemmers derived from parallel texts were not as effective as manually built stemmers for Finnish and Swedish. But there is still room for improving statistical stemmers.
Berkeley Text Retrieval System is available for research purpose. Send request to email@example.com Software