Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.

Similar presentations


Presentation on theme: "Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005."— Presentation transcript:

1 Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005

2 Background Many approaches have been developed to mine the translations of the OOV terms from the web. However they all suffered from the lack of such bilingual resources available on the web. Great amount of bilingual information exist on the web in the form of tentative translation or references, such as “ 片名:麥迪 遜之橋 (The Bridges of Madison County) 導演:克林伊斯威特 (Eastwood, Clint).” When English terms occur in Chinese web pages, and especially when they occur within brackets, they are very likely to be translations of an immediately preceding Chinese term.

3 Background Two approaches have used bilingual resources on the Web: – Zhang and Vines searched for all pages contains the Chinese query term and use brackets to identify corresponding translations. – Cheng et al. observed that if a Chinese term occurs in an English web page, its translation usually exists in the same page too.

4 Background Zhang and Vines’ method does not restrict the search space, which means lots of web pages have to be crawled to get one containing the English translation. Cheng et al.’s method restricts too strongly and the search space is too small. – According to the analysis, only 1/45 of the pages containing both the OOV term and its English translation are identified by Google as English pages.

5 Overview This paper proposes a new approach to retrieve web pages of mixed languages which might contain the translations for the OOV term by expanding the Chinese query with an English hint word. Chinese is considered as the source language and English is considered as the target language. The proposed method is language independent.

6 Overview Given Chinese OOV term f, we want to find its translation e. – Assume Chinese term f 0 is relevant to f, and can be translated to e 0 using the existing bilingual lexicon. – When f and e exist in a web page, f 0 and e 0 are also very likely to exist in the same page. – Thus we search for pages containing f and e 0, where e 0 is a hint word generated by cross-lingual query expansion. For example: – To find web pages which might contain translations for “ 托爾斯 泰 ”(Tolstoy) – The query is expanded to “ 托爾斯泰 +war+peace” since “ 戰爭與和 平 ”(War and Peace) is very relevant to “ 托爾斯泰 ” and we know its translation.

7 Query Expansion To propose a “good” English hint e 0 for f, we first need to find a Chinese term f 0 that is relevant to f. Because f is an OOV term, it is unlikely to obtain much information from the existing Chinese monolingual corpora. Instead, Google is queried for web pages containing f. From the returning snippets, f 0 is selected based on the following criteria: 1. f 0 should be reliably translated into English noun or noun phrases given the available bilingual resources. 2. f 0 should be one of the most relevant words to f, where the relevance is estimated in terms of its frequency amongst the snippets. The corresponding translations e 0 for each f 0 were then used as the hint words for each f.

8 Query Expansion For example, for f = “ 浮士德 ”(Faust). – The top candidate of f 0 s are “ 歌德 ”, “ 簡介 ”, “ 文學 ”, and “ 悲劇 ”. – The original query “ 浮士德 ” is expanded to “ 浮士 德 +goethe”, “ 浮士德 +introduction”, “ 浮士德 +literature”, “ 浮士德 +tragic” and sent to Google again.

9 Extracting Translations Snippets containing the query and possibly English translation are returned by Google. Preprocessing: – HTML tags, punctuation marks and non-query source words are filtered out. The English translation is extracted from the processed top-N snippets. Confidence scores are provided for each translation candidates: 1. Transliteration cost 2. Translation cost 3. Frequency-distance weights According to the confidence scores of different models, we output the top-5 translation hypotheses for evaluation.

10 Experimental Results 310 Chinese OOV terms are collected from 12 categories including movie titles, book titles, organization names, product brands, sci & tech. terms, specie names, person names, location, military terms, medical terms, musical terms and sports terms. On average 13.2 snippets were used to identify the relevant Chinese terms f 0 for each OOV term f. Top-5 f 0 s were used to generate hint words e 0. Snippets containing both f and e 0 were then used to extract translations for f.

11 Experimental Results

12 Conclusion Cross-lingual query expansion fetches snippets with very high inclusion rate. Various similarity and relevancy features ensure high accuracy translation extraction. As a whole, these result in high quality translations for OOV terms. This approach is fast and language independent.


Download ppt "Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005."

Similar presentations


Ads by Google