Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei- Cheng Chen, Jenq-Haur Wang, Wen-

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
INFO 624 Week 3 Retrieval System Evaluation
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Chapter 5: Information Retrieval and Web Search
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Chapter 6: Information Retrieval and Web Search
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 23: Probabilistic Language Models April 13, 2004.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Statistical NLP: Lecture 13
INF 141: Information Retrieval
Presentation transcript:

Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei- Cheng Chen, Jenq-Haur Wang, Wen- Hsiang Lu +, and Lee-Feng Chien Institute of Information Science, Academia Sinica, Taiwan + CSIE, National Cheng Kung University SIGIR 2004

Abstract Exploit the Web as the corpus source to translate unknown queries for CLIR –translations for unknown query terms via mining of bilingual search-result pages obtained from Web search engines

Introduction Conventionally CLIR approaches have focused mainly on incorporating dictionaries and domain- specific bilingual corpora for query translation –the incorrect translation of a few query terms in a query is tolerable and can be remedied via query expansion in the process of document retrieval –For longer queries, it is still possible to retrieve relevant documents in target languages even if there exist a few unknown query terms

Introduction Real queries are often short –The average query length for a Web search was about 2.3 words in English and 3.18 characters in Chinese –Conventional CLIR approaches that are based on domain-specific corpora might not be applicable to dealing with the translation of short queries with unknown terms Sufficiently large bilingual corpora are not always available using small corpora may provide a low coverage rate for translation

Introduction Search engine log analysis –3-month log from Dreamer –228,566 unique queries –nearly 82.9% of the top 19,124 high frequent query terms (with 80% coverage rate) were not included in the LDC English-to-Chinese lexicon –14.9% of the unknown query terms were in English (with 1.19 words on average)

Introduction For some language pairs, the Web consists of rich texts in a mixture of multiple languages –contain bilingual translations of proper nouns –whether this nice characteristic makes it possible for the bilingual translations of a large number of unknown query terms to be automatically extracted; and whether the extracted bilingual translations (if any) can effectively improve CLIR performance

Introduction Search-result-based approach –mine query translations from the dynamically-retrieved bilingual search-result pages ordered list of snippets of summaries returned by search engine –Two major difficulties term extraction: how to extract terms with correct lexical boundaries from the noisy bilingual search-result pages as translation candidates translation selection: how to estimate term similarity for determining correct or relevant translations from the extracted candidates

Review on Web-based approaches The parallel-corpus-based approaches –Collecting parallel texts of different language versions from the Web –Nie et al. A Web page’s parents might contain the links to different versions of it and Web pages with the same content might have similar structures and lengths –Resnik language identification for finding Web pages in the languages of interest –Yang et al. presented an alignment method to identify one-to-one Chinese and English title pairs based on dynamic programming

Parallel-corpus-based approaches –Mining of parallel texts is feasible, but some of the proposed methods might not be general to common applications in which queries are short and diverse. Moreover, these methods often require powerful crawlers to gather sufficient Web data as well as more network bandwidth and storage

Comparable-corpus-based approaches The comparable-corpus-based approaches –Fung et al. used a vector-space model and took a bilingual lexicon (called seed words) as feature sets to estimate the similarity between a word and its translation candidates –how to automatically gather appropriate comparable corpora from the Web is still a challenging task

Anchor-text-based approach The anchor-text-based approach –Lu, et al. –anchor texts are utilized as an aligned bilingual comparable corpus for query translation An anchor text is the descriptive part of an out-link of a Web page used to provide a brief description of the linked Web page For an unknown term appearing in an anchor text of a Web page, it is likely that its corresponding target translations may appear together in other anchor texts linking to the same page Such a bundle of anchor texts pointing together to the same page is called as an anchor-text set

Anchor-text-based approach –probabilistic model A translation candidate had a higher chance of being an effective translation only if it was written in the target language and frequently co-occurred with the query term in the same anchor text sets The model further assumed that the translation candidates in the anchor texts of the pages with higher authority may be more reliable

Anchor-text-based approach the similarity between a source query s and a translation candidate t

Anchor-text-based approach –U = {u1, u2,... un}, in which u i is a page of concern –P(u i ) is the probability value used to measure the authority of page u i. P(u i ) was estimated along with the probability of u i being linked where L(u j ) indicates the number of in-links of page u j –assumed that s and t are independent given u i ; then, the joint probability P(s∩t|u i ) was equal to the product of P(s|u i ) and P(t|u i ). The values of P(s|u i ) and P(t|u i ) were estimated by calculating the fractions of the numbers of u i ’s in-links containing s and t over L(u i ), respectively.

Search-result-based approach Observation –translated or semantically-close terms frequently occur together with a source query term in mixed-language texts –experiment 430 popular English query terms (PE-430) from a real search engine log and translated them into a Chinese query set (PC- 430) randomly selected 100 English query terms (RE-100) from the top 19,124 query terms in the log and translated them into a Chinese query set (RC-100)

Search-result-based approach The coverage rates of the test queries’ correct translations in different numbers of the retrieved snippets more than 95% of the popular queries’ translations appeared in top30~40 snippets of the summaries from Google, and about 70% of the random queries’ translations were covered as well

Term Extraction SCPCD –combines the symmetric conditional probability (SCP) with the concept of context dependency (CD)

Term Extraction –SCP is the association estimation of its composed sub n-grams where w 1 …w n is the n-gram to be estimated, p(w 1 …w n ) is the probability of the occurrence of the n-gram w 1 …w n, and freq(w 1...w n ) is the frequency of the n-gram

Term Extraction –CD is a refined measure varying from 0 to 1 where LC(w 1 …w n ) (or RC(w 1 …w n )) is the number of unique left (or right) adjacent words/characters for the n-gram in the corpus, or equal to the frequency of the n- gram if there is no left (or right) adjacent word/character

Translation Extraction The Chi-square Method –Given a source query s and a translation candidate t, suppose the total number of Web pages is N, the number of Web pages containing both s and t, n(s,t), is a, the number of Web pages containing s but not t, n(s,¬t), is b, the number of Web pages containing t but not s, n(¬s,t), is c, the number of Web pages containing neither s nor t, n (¬s, ¬ t), is d. (d=N-a-b-c)

Translation Extraction –Assume s and t are independent. Then the expected frequency of (s,t), E(s,t), is (a+c)(a+b)/N, the expected frequency of (s, ¬t),E (s, ¬t), is (b+d)(a+b)/N, the expected frequency of (¬ s, t),E (¬ s, t), is (a+c)(c+d)/N, the expected frequency of (¬ s, ¬ t),E (¬ s, ¬ t), is b+d)(c+d)/N –chi-square test

Translation Extraction The Context-Vector Method –For both of the query term and its candidates, take their contextual terms constituting the search- result pages as their features –tf-idf weighting scheme where f(t i,p) is the frequency of term t i in search-result page p, N is the total number of Web pages, and n is the number of the pages containing t i – Similarity: cosine measurement

Translation Extraction Analysis –chi-square method is more applicable to high-frequency query terms than low-frequency query terms since high- frequency query terms are more likely to appear with their candidate terms –certain candidates that frequently co-occur with a query term may not imply that they are appropriate translations –Although the context-vector method provides an effective way to overcome this problem, its performance strongly depends on the quality of the retrieved search-result pages such as the sizes and amounts of snippets.

Translation Extraction –Both of the methods do not need to collect large corpora in advance –Their execution time is determined by the processes of Web search and term/feature extraction Suppose n t translation candidates are extracted for each query term. The chi-square method requires 1+3 n t Web searches and the context-vector method requires 1+n t ones. However, the context-vector method needs to do extra 1+n t feature extraction tasks. In general, feature extraction takes much more time than Web search needs.

The Combined Approaches The proposed search-result-based approach is actually a combination of chi-square and context- vector method (χ 2 +CV) Effectively exploit the two kinds of Web resources: anchor texts and search-result pages –Combine the probabilistic inference model with the context-vector and chi-square methods

The Combined Approaches Linear combination weighting scheme where mi  { χ 2,CV,AT}, a m i is an assigned weight for each similarity measure Smi, and Rmi(s,t), which represents the similarity ranking of each translation candidate t with respect to s, is assigned to be from 1 to k (number of candidates) in decreasing order of similarity measure S mi (s,t).

Performance Evaluation Parallel-corpus-based approaches –Hong Kong Law parallel text collection 238,236 English-Chinese text paragraphs Adopted Ø 2, a χ 2 –like statistic, to measure the association between terms, and extracted word/phrase translation pairs Anchor-text-based approach –collected 1,980,816 traditional Chinese Web pages in Taiwan, and then extracted 109,416 pages (URLs), whose anchor-text sets contained both traditional Chinese and English terms, as the anchor-text-set corpus

Performance Evaluation Search-result pages –submitting queries to the real-world search engines, including Google and Openfind –used only the first 100 retrieved snippets to extract terms and features. Evaluation metric –The average top-n inclusion rate –the percentage of the queries whose translations could be found in the first n extracted translations

Experiments on NTCIR-2 Query Translation –There were a total of 178 unique query terms in the 50 test English title queries, and 22 of them were not included in the LDC English-Chinese lexicon –The average length of the title queries was 3.8 English words (after removing stop word –anchor-text-based and search-result-based approaches are quite complementary The anchor-text-based approach can achieve higher precision (higher top-1 inclusion rates) for the test queries, and the proposed search-result-based approach can have high coverage of various translation pairs (higher inclusion rates in the top 5 lists)

Query Translation Performance

CLIR Performance Another important merit of the proposed approach is its effectiveness in extracting semantically-close translations investigated whether these automatically extracted translations could benefit CLIR

CLIR Performance The probabilistic retrieval model was adopted where Q is a query, D is a document, e is an English query term in Q, c is a target translation of e in traditional Chinese and λrepresents a smoothing parameter. P(e) is the priori probability of e, which can be estimated based on e’s page frequency on the Web. P(c|D) is the probability of c appearing in document D.

CLIR Performance P(e|c) is the translation probability of e given c dictionary-based approach (using the LDC English-Chinese lexicon) P(e|c)≈ 1/n e, where n e is the number of possible translations of c and P(e|c)=0 if n e is zero; search-result-based approach P(e|c) ≈ S {χ2,CV} (e,c) the approach combining with search result corpus and anchor-text corpus P(e|c) ≈ S {χ2,CV,AT} (e,c) the hybrid approach combining all resources (dictionary + anchor-text corpus + search-result corpus) P(e|c) ≈[S {χ2,CV,AT} (e,c)+1/n e ]/2.

CLIR Performance

Translation of Web Query Terms Collected Web queries from two real- world Chinese search engine logs in Taiwan, i.e. Dreamer and GAIS. The Dreamer log contained 228,566 unique query terms from a period of over 3 months in 1998, while the GAIS log contained 114,182 unique query terms from a period of two weeks in 1999.

Translation of Web Query Terms Two different test query sets were prepared popular-query set –430 frequent English query terms –obtained from the 1,230 English terms out of the most popular 9,709 query terms (with frequencies above 10 in both logs). –two types: type Dic (the terms existing in the dictionary), consisting of about 36% (156/430) of the test queries; and type OOV (out of vocabulary; the terms not in the dictionary), consisting of about 64% (274/430) of the test queries. The second set, called

Translation of Web Query Terms random-query set –100 English query terms –were randomly selected from the top 19,124 queries in the Dreamer log. About 60% of the randomly-selected English query terms were not included in the LDC English-Chinese lexicon.

Translation of Web Query Terms

Discussion Flexibility for query specification –In many CLIR applications, it is difficult to specify ‘correct’ queries in source languages for searching relevant documents in target languages - especially for particular domains such as disease names –search-result-based approach provides more flexibility and convenience for query specification –Not only the query but also its relevant terms may frequently co-occur with its correct translations in the search-result pages –search-result pages are dynamic and allow new words to be effectively translated

Discussion Translation effectiveness –search-result-based approach is feasible for translating unknown query terms –applicable to some other language pairs 50 scientists’ names and 50 disease names in English were randomly selected from 256 scientists (Science/People) and 664 diseases (Health/Diseases and Conditions) in the Yahoo! Directory English-to-Japanese translation: the top-1, top-3, top-5 inclusion rates were 35%, 52%, and 63%, respectively English-to-Korean translation: the top-1, top-3, top-5 inclusion rates were 32%, 54%, and 63%, respectively

Discussion –The proposed approach is also capable of translating a query term with multiple meanings if the occurrence frequency of each of its translations is high enough on the Web. –The proposed approach might not perform good at the translation of terms that do not frequently co- occur with their translations in the search result pages such as some common terms, and is dependent on the performance of the employed search engines –The translation extraction process of it might not be effective for language pairs that do not exhibit the mixed language characteristic on the Web.

Discussion Application –LiveTrans ( provideS online English translation service of query terms for several Asian languages provides cross-language search for retrieval of both Web pages and images