Presentation is loading. Please wait.

Presentation is loading. Please wait.

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.

Similar presentations


Presentation on theme: "Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi."— Presentation transcript:

1 Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi

2 Outline  What is Multilingual Information Retrieval (MLIR)  Basic Approaches to MLIR  Xerox Experimental Approach  Resource Requirements for MLIR  Experimental Results  Conclusions and Future Extensions  Detailed Query Analysis  Sample Query Profile  Goal

3 Goal  To build a fully-functional MLIR ( too much time and resources needed ) IS NOT  To explore the most important factors in making MLIR effective IS

4 5 Definitions for MLIR 1.IR in any language other than English 2.IR on a parallel document collection or on a multilingual document collection where the search space is restricted to the query language 3.IR on a monolingual document collection which can be queried in multiple languages 4.IR on a multilingual document collection, where queries can retrieve documents in multiple languages 5.IR on multilingual documents, i.e. more than one language can be present in the individual documents

5 Basic Approaches to MLIR  IR systems rank documents according to statistical similarity measures based on the cooccurrence of terms in queries and documents  Mechanism for query or document translation  Techniques for the problem of interlingual term correspondence Query translation is easier but doesn’t provide much context Query translation is easier but doesn’t provide much context Document translation could be better but is costing (time, storage resources) Document translation could be better but is costing (time, storage resources) Term Vector Translation Term Vector Translation Text Translation Text Translation Latent Semantic Coindexing Latent Semantic Coindexing

6 Text Translation  High-end approach to MLIR (NLP and text generation techniques)  Direct Mapping of query from the source language into one or more target languages by using an MT system  Direct Resolution of ambiguity by using structural information from the source language text  PRO Extensive body of research on MT Extensive body of research on MT Commercial products available Commercial products available  CONS Low performance of current MT systems [Radwan, 1994] Low performance of current MT systems [Radwan, 1994]

7 Term Vector Translation  Direct Mapping of each word in the query written in the source language into all of its possible definitions in the target languages  Uses transfer dictionaries or parallel aligned corpus for the direct mapping Should each term be weighted according to the number of translations? Should each term be weighted according to the number of translations?  Issues related with term weighting strategies Should more common translations be weighted proportionally higher?  Vector Space Models can be used as retrieval strategies What resources do we use to obtain this information?

8 Latent Semantic Coindexing  Indirect Derivation of query translation by using a training corpus  Uses Singular Value Decomposition of parallel document collection to obtain term vector representation  Term vector representaion are comparable across all the languages of the collection (documents are represented as language-independent numerical vectors)  Query can retrieve a relevant document even if they have no words in common  Create a reduced-dimension Semantic Space in which related terms are near each other

9 LSI vs Standard Vector Model  Standard Vector Model Treat words as if they are independent Treat words as if they are independent  LSI Term-term inter-relationships are automatically modeled and used to improve retrieval by numerically analysing existing texts (no need for external dictionaries, thesauri or knowledge bases) Term-term inter-relationships are automatically modeled and used to improve retrieval by numerically analysing existing texts (no need for external dictionaries, thesauri or knowledge bases) Represent documents as linear combinations of orthogonal terms Represent documents as linear combinations of orthogonal terms Represents terms as continuous values on each of the k orthogonal indexing dimensions Represents terms as continuous values on each of the k orthogonal indexing dimensions

10 Resource Requirements  Support for character set of each language is needed  Facilities for automatic language recognition  Morphological Analyzer (PoS recognition, stemming algorithms, inflectional analyzers) Ex: German word Weingärtnergenossenschaften is analyzed as the feminine plural noun Wein#Gärtner# Genosse(n)#schajt Ex: German word Weingärtnergenossenschaften is analyzed as the feminine plural noun Wein#Gärtner# Genosse(n)#schajt Crucial to find term entries in bilingual dictionaries Crucial to find term entries in bilingual dictionaries  Resources for query translation Machine Translation System Machine Translation System Transfer Dictionaries Transfer Dictionaries Parallel texts and/or monolingual domain-specific corpora Parallel texts and/or monolingual domain-specific corpora

11 Resources for Query Translation  MT System  Transfer dictionaries (Bilingual Thesauri)  Parallel Texts For direct term vector translation For direct term vector translation For direct query translation For direct query translation To extract relationships between terms for term vector translation or to get indirect query translation (ex. SLI) To extract relationships between terms for term vector translation or to get indirect query translation (ex. SLI) Source of terminology to be used when parallel texts are not available Source of terminology to be used when parallel texts are not available Extracted from bilingual general dictionaries which include lots of “noise” vocabulary Extracted from bilingual general dictionaries which include lots of “noise” vocabulary  Domain-specific monolingual corpora

12 Transfer Dictionaries vs Parallel Texts  Transfer Dictionaries Conversion from bilingual dictionaries is a non-trivial effort Conversion from bilingual dictionaries is a non-trivial effort  Parallel Corpora Needed in large quantity to train statistical models of great sophistication Needed in large quantity to train statistical models of great sophistication Generate term translation vectors with probabilities [Brown, 1993] Generate term translation vectors with probabilities [Brown, 1993] Provide narrow but deep coverage (probabilities are domain specific) Provide narrow but deep coverage (probabilities are domain specific) Provide broad but shallow coverage of the language Provide broad but shallow coverage of the language Translation probabilities are not available Translation probabilities are not available Most technical terminology is missing Most technical terminology is missing

13 Xerox Experimental Approach 1  Evaluation in Multilingual IR Uses query with known relevance judgement Uses query with known relevance judgement Start with queries, documents, and relevance judgments in a single language Start with queries, documents, and relevance judgments in a single language Translates the queries into another language by human translators Translates the queries into another language by human translators Translated queries are retranslated by the MLIR system Translated queries are retranslated by the MLIR system Results are compared to the original queries to get a good sense of the relative performance of the MLIR system Results are compared to the original queries to get a good sense of the relative performance of the MLIR system

14 Xerox Experimental Approach 2  Experimental Setting Translated French queries and English documents Translated French queries and English documents Conversion of an online bilingual French => English dictionary to a WORD-BASED transfer dictionary suitable for text retrieval Conversion of an online bilingual French => English dictionary to a WORD-BASED transfer dictionary suitable for text retrieval TIPSTER text collection and queries 51-100 from TREC experiments [Harman, 1995] TIPSTER text collection and queries 51-100 from TREC experiments [Harman, 1995] Term vector translation model Term vector translation model Bilingual Transfer Dictionary to generate the model Bilingual Transfer Dictionary to generate the model Short version of queries (average lenght of 7 words) Short version of queries (average lenght of 7 words)

15 Xerox Experimental Approach 3  MLIR Process 1.Query is morphologically analyzed and each term is replaced by its inflectional root 2.Each root is looked up in the bilingual transfer dictionary and builds a translated query by taking the concatenation of all term translations 3.The translated query is sent to a traditional monolingual IR system Specialized term weighting and resolving ambiguity in translation are ignored Specialized term weighting and resolving ambiguity in translation are ignored  Notes Vector Space Model is used to measure similarity between query and each document Vector Space Model is used to measure similarity between query and each document

16 Experimental Results  Comparing the original English queries to three retranslation generated by different versions of the transfer dictionary  Three tranfer dictionary versions: automatic word-based, manual word-based and manual multi-word transfer dictionary  Average precision at 5,10,15 and 20 documents retrieved for the original English queries and the translation given by the different TD Original English Automatic word-based transfer dictionary Manual word-based transfer dictionary Manual multi-word transfer dictionary 0.3930.2350.2690.357

17 Detailed Query Analysis1  Comparison of the performance of the translated (Tr) and original (Orig) English queries. Values given are the number of queries in each category Performance Automatic word-based transfer dictionary Manual word-based transfer dictionary Manual multi-word transfer dictionary Tr > Orig Tr ~ Orig Tr < Orig 1 19 22 3 22 17 4 26 12 0.0 < Tr < Orig Tr = 0.0 10 12 9898 9393  Improvement in performance as more manual effort is applied to the dictionary construction process   Some queries which perform much better in their translated versions

18 Detailed Query Analysis2  Detailed Failure Analysis Recognizing and translating multi-word expressions is crucial to success in MLIR (in contrast to monolingual IR) Recognizing and translating multi-word expressions is crucial to success in MLIR (in contrast to monolingual IR) Carried out on the worse 17 queries when using word-based dictionary Carried out on the worse 17 queries when using word-based dictionary 9 queries lost information as a result of the failure to translate multi-word expressions correctly, 8 had problems due to ambiguity in translation (i.e. extraneous definitions added to query), and 4 suffered from a loss in retranslation (meaning decays with repeated translations) 9 queries lost information as a result of the failure to translate multi-word expressions correctly, 8 had problems due to ambiguity in translation (i.e. extraneous definitions added to query), and 4 suffered from a loss in retranslation (meaning decays with repeated translations) Individual components of phrases often have very diferent meanings in translation, so the entire sense of the phrase is often lost Individual components of phrases often have very diferent meanings in translation, so the entire sense of the phrase is often lost

19 Sample Query Profile1  English: original intent or interpretation of amendments to the U.S. Constitution  French: l’intention premkre ou une interpretation d’un amendment de la constitution des USA  Term vector retranslation intention - intention benefit premier - first initial bottom early front top leading basic primary original interpretation - interpretation amendment - amendment enrichment enriching agent constitution - formation settlement constitution USA - USA

20 Sample Query Profile2 VersionPrecisionReasons for decay Orig Eng LR TA1 TA2 Trans Eng 0.54 0.34 0.19 0.10 0.05 intent => intention, U.S. => USA constitution, amendement original, intention  The decay in performance of query 76 from the original English (orig Eng) to the translated English (traus Eng) due to translation ambiguity (TA) and loss in retranslation (LR)

21 Future Extensions  Additional loss in retranslation errors due to the experimented design which cannot be avoided (i.e. the ambiguity introduced by the human translator) Conclusions  Two primary sources of error in the current MLIR system missing translations of multi-word expressions and unresolved ambiguity in word-based translation missing translations of multi-word expressions and unresolved ambiguity in word-based translation  Improving automatically generated transfer dictionaries  MWE (gathering terminology lists from various specialized domains, performing terminology extraction from corpora  Extracting MWE (gathering terminology lists from various specialized domains, performing terminology extraction from corpora  Resolving ambiguity (using target language texts, term weighting strategies, user interactive tools)  Using models other than the vector space model (i.e. weighted boolean model)

22 THANK YOU!


Download ppt "Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi."

Similar presentations


Ads by Google