Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining Query Translation and Document Translation in Cross- Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems.

Similar presentations


Presentation on theme: "Combining Query Translation and Document Translation in Cross- Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems."— Presentation transcript:

1 Combining Query Translation and Document Translation in Cross- Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway

2 Talk Outline Development of new resources Fast approximate document translation Combining query translation and document translation Conclusions

3 New Resources Finnish and Swedish stoplists Base Finnish and Swedish lexicons for decompounding Statistical translation lexicons derived from parallel texts Finnish and Swedish statistical stemmers automatically generated from parallel texts English spelling normalizer

4 Development of Swedish Stoplist (by someone who doesn’t know Swedish) Look for Swedish words whose English translations are English stopwords in Swedish textbooks (e.g., grammar) written in English. en park (a park) ett piano (a piano) Jag vet inte mycket om honom (I don’t know much about him) efter skolan (after school) Hans och Greta (Hans and Greta) (Source: Swedish: A comprehensive grammar by P. Holmes & I. Hinchliffe)

5 Development of Swedish Base Lexicon A base lexicon should contain all and only the words and their variants that are not compounds. Compile a list of Swedish words (e.g., from the Swedish document collection). Remove the words that are 4 or fewer characters long. Remove the long words that can be decomposed into short words in the initial wordlist. animation animationen dator datoranimation datorgrafik datorteknologi datorvirus grafik teknologi virus dator animation dator grafik dator teknologi dator virus Remove the compounds that are decomposed.

6 Development of Statistical Translation Lexicons from Parallel Texts parallel texts (EU Official Journal) PDF  texts conversion paragraph & sentence alignment statistical MT toolkit statistical association 1.English  Dutch 2.English  Finnish 3.English  Swedish 4.Dutch  English 5.Finnish  English 6.Swedish  English 1.Italian  Spanish 2.German  Italian 3.Finnish  German statistical translation lexicons

7 Development of Statistical Stemmers dator datorn datorer datorersom datornät datornernä diamanten diamanterna diamanter diamant informatik diamond diamonds computer computers diamond computer Swedish words diamanten diamanterna diamanter diamant dator datorn datorer datorersom datornät datornernä informatik “computer” cluster “diamond” cluster statistical English translations dator diamant

8 Fast Approximate Document Translation Spanish documents Spanish-English MT List of Spanish words List of English words Bilingual Spanish-English wordlist English translations 1 2 3 4 Word-by-word

9 Query Translation-based Multilingual Retrieval English French German English docsFrench docsGerman docs merger combined ranked list of documents German French English Query Documents Spanish IR Spanish docs L&H

10 Documentation Translation-based Multilingual Retrieval English unified ranked list of documents German French English Query Documents EnglishSpanish IR

11 Evaluation of Multilingual Retrieval Run IDTrans. methodMerging methodAverage precision bkmul4en1query-transraw score0.3783 bkmul4en2doc-transnone0.4082 bkmul4en3query & doc-transraw score0.4260 Run IDTrans. methodMerging methodAverage precision bkmul8en1query-transraw score0.3317 bkmul8en2doc-transnone0.3401 bkmul8en3query & doc-transraw score0.3733 Multilingual-4: English, TD Multilingual-8: English, TD

12 Query Translation v.s. Document Translation celíacos dietasDiets for Celiacs Las Dietas para Celiacs English words in topic 186 Spanish doc words query translation document translation (word-by-word) Nahrungen für Celiacs diät zöliakie celiacs diets diets coeliac diseases German doc words Average precision: 0.0003 (mul4en1)Average precision: 0.6750 (mul4en2) (Spanish)(German)(English) Dutch Netherlands Hollandais Hollande (French) query translation Néerlandais Pays-Bas Dutch Netherlands French document words (English) English words in topic 161 Average precision: 0.2213 (mul4en1)Average precision: 0.6167 (mul4en2) document translation (word-by-word) 0.0 1.0

13 Manual v.s. Automatic Stemming LanguageNo stemmingManual (Snowball)Automatic (parallel texts) Finnish0.38010.49720.4304 Swedish0.36300.41210.3844 LanguageNo StemmingManual (Muscat)Automatic (L&H MT) French0.39050.45280.4521 Italian0.38010.43240.4322 Spanish0.46870.51660.5285 CLEF 2003 CLEF2001-2002 (topic fields: TD. No decompounding or query expansion) (topic fields: TD. No query expansion)

14 Evaluation of Decompounding, Stemming and Query Expansion in Monolingual Retrieval baseline decomp stemexpan decomp+stem decomp+expanstem+expan decomp+stem+expan.4342.3727.3801.3630.4744.4294.4204.4331.4480.4220.4974.4121.4673.4867.4071.4224.5304 (22.16%).5678 (52.35%).5633 (48.20%).5465 (50.55%).4955.5111.4972.4727.5126.5473.4469.4880.4962.4804.5541.4838 Topics (TD) Dutch German Finnish Swedish

15 Conclusions Fast approximate document-translation worked well. Combining document-translation with query- translation was even better. Decompounding with stemming and query expansion worked well for languages with rich compounds. Statistical stemmers derived from parallel texts were not as effective as manually built stemmers for Finnish and Swedish. But there is still room for improving statistical stemmers.

16 Berkeley Text Retrieval System is available for research purpose. Send request to aitao@sims.berkeley.edu Software

17 THANK YOU


Download ppt "Combining Query Translation and Document Translation in Cross- Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems."

Similar presentations


Ads by Google