Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.

Similar presentations


Presentation on theme: "A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering."— Presentation transcript:

1 A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering Michigan State University East Lansing SIGIR2005

2 INTRODUCTION When only bilingual dictionaries are available, how to efficiently resolve the translation ambiguity of queries. maximum coherence model –maintains the uncertainty in translating query words –estimate translations of multiple query words simultaneously

3 INTRODUCTION Translation uncertainty problem –Given the short length of queries and the large variance existing in mapping information across different languages, such binary decisions are usually difficult to make. Translation independence assumption –the translation of one query word is usually determined independently from the translations of others.

4 RELATED WORK Statistical translation models Relevance language models –learn an association between the words in the language of queries and the language of documents from a bilingual corpus. –it is usually not only time consuming but also expensive to acquire large parallel bilingual corpora,

5 RELATED WORK Dictionary-based CLIR –Resolve the translation ambiguity by measuring the coherence of a translation word to the entire query. The selection of translation words are then determined by their coherence scores: –only the translation with the highest coherence score is selected. –translation word is selected when its coherence score exceeds a certain threshold.

6 RELATED WORK Typically, a translation selection strategy can be formulated into the following algorithm:

7

8 Modelling the Uncertainty in Query Translation probability of translating a word matrix for translation probabilities:

9 statistical retrieval model for CLIR

10 Maximum Coherence Model overall coherence for a query qs,

11 Maximum Coherence Model Using the matrix notation

12 Regularizer the goal of this regularizer is to refflect our prior knowledge of translation probabilities –without context, we assume that all translations provided by a bilingual dictionary are equally likely to be selected.

13 Regularizer Cp is a constant that balances the contribution between the coherence measurement and the regularizer.

14 Solving the Optimization Problem To write it in an explicit QP form, we define

15 Solving the Optimization Problem use the QP (quadratic programming) package in MATLAB to solve optimization problem

16 Experiment Setup TREC ad hoc test collections, including: –AP88-89 164,835 documents from Associated Press(1988, 1989) –WSJ87-88 83,480 documents fromWall Street Journal (1987,1988) –DOE1-2 226,087 documents from Department of Energy abstracts two heterogeneous collections –AP88-89 + WSJ87-88 –AP89 + WSJ87-88 + DOE1-2

17 Experiment Setup The SMART system is used to process document collections. –parsed into tokens with stop words removed, and then tokens are stemmed using the Porter algorithm. Queries come from a manual Chinese translation of TREC-3 ad hoc topics (topic 151-200).

18 Comparison to Selection- based Approaches

19 In general, the retrieval accuracy for heterogeneous collections appears to be worse than that for homogeneous collections. A better retrieval is achieved for short queries than for long queries. for the long queries, the “ BESTONE ” method does not consistently outperform the “ ALLTRANS ” method.

20 Comparison to Selection- based Approaches

21

22 The Necessity of Including Translation Uncertainty “ BESTONE ” method is able to work well for the third example but will fail in the first one. “ ALLTRANS ” method would be perfect for the.rst example but not for the third one.

23 The Impact of Translation Independence Assumption on Query Disambiguation

24 CONCLUSIONS Propose a novel statistical model for cross- language information retrieval, named “ maximum coherence model ”. –It preserves the translation uncertainty through the estimation of translation probabilities; –It estimates the translations for all query words simultaneously. In the future, we plan to improve the robustness of the maximum coherence model with regard to the query noises,


Download ppt "A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering."

Similar presentations


Ads by Google