Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2011/05/30

Cross-Language Information Retrieval Query Translation Information Retrieval Source Query Target Translation Target Documents Query in source language and retrieve relevant documents in target languages 海珊 / 侯賽因 / 哈珊 / 胡笙 (TC) 侯赛因 / 海珊 / 哈珊 (SC) Hussein

References The Web as a Parallel Corpus –Philip Resnik and Noah A. Smith, Computational Linguistics, Special Issue on the Web as Corpus, 2003 Automatic Construction of English/Chinese Parallel Corpora –Christopher C. Yang & Kar Wing Li, Journal of the American Society for Information Science and Technology, 2003 Statistical Cross-Language Information Retrieval using N- Best Query Translations (SIGIR2002) –Marcello Federico & Nicola Bertoldi, ITC-irst Centro per la Ricerca Scientifica e Techologica Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval –Wessel Kraaij, Jian-Yun Nie and Michel Simard, Computational Linguistics, Special Issue on the Web as Corpus, 2003 A Probability Model to Improve Word Alignment (ACL2003) –Colin Cherry & Dekang Lin, University of Alberta

The Web as Corpus

Outline The Web as a Parallel Corpus Philip Resnik and Noah A. Smith Computational Linguistics, Special Issue on the Web as Corpus, 2003 Automatic Construction of English/Chinese Parallel Corpora Christopher C. Yang and Kar Wing Li Journal of the American Society for Information Science and Technology, 2003 Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval Wessel Kraaij, Jian-Yun Nie and Michel Simard Computational Linguistics, Special Issue on the Web as Corpus, 2003

The Web as a Parallel Corpus Philip Resnik and Noah A. Smith Computational Linguistics, Special Issue on the Web as Corpus, 2003

Parallel Corpora Bitexts, bodies of text in parallel translation, plays an important role in machine translation and multilingual natural language processing. Not readily available in necessary quantities –Canadian parliamentary proceedings (Hansards) in English/French –United Nations proceedings (Linguistic Data Consortium, http://www.ldc.upenn.edu/) –Religious texts (Resnik, Olsen, and Diab) –Localized versions of software manuals (Resnik and Melamed 1997; Menezes and Richardson)

STRAND An architecture for structural translation recognition, acquiring natural data (Resnik 1998, 1999) Identify pairs of Web Pages that are mutual translations. Web page authors disseminate information in multiple languages –When presenting the same content in two different languages, authors exhibit a very strong tendency to use the same document structure

Finding Parallel Web Pages Finding parallel text on the Web consists of three main steps: –Location of pages that might have parallel translation –Generation of candidate pairs that might be translations –Structural filtering out of nontranslation candidate pairs Locating pages –Two types: parents and siblings –Ask AltaVista: (anchor: “english” OR anchor: ”anglais”) AND (anchor: “french” OR anchor: “francais”)

Two types of Website Structure

STRAND Generating Candidate Pairs: –Automatic language identification (Dunning 1994) –URL-matching: manually creating a list of substitution rules E.g., http://mysite.com/english/home_en.html => http://mysite.com/big5/home_ch.html –Document length: length(E)  C. length(F) Structural filtering –The heart of STRAND –Markup analyzer: determine a set of pair-specific structural values for translation pairs

Automatic Construction of English/Chinese Parallel Corpora Christopher C. Yang and Kar Wing Li Journal of the American Society for Information Science and Technology, 2003

Web Parallel Corpora Some web sites with bilingual text contain a completely separate monolingual sub-tree for each language. Title alignment and dynamic programming matching

References

Statistical Cross-Language Information Retrieval using N-Best Query Translations Marcello Federico & Nicola Bertoldi, ITC-irst Centro per la Ricerca Scientifica e Techologica

Outline Statistical CLIR Approach Query Document Model Query Translation Model

Statistical CLIR Approach CLIR problem − Given a query i in the source language (Italian), one would like to find relevant documents d in the target language (English), within a collection D. P(d | i) P(i, d) − To fill the language difference between query and documents, the hidden variable e is introduced, which represents an English translation of i.

Statistical CLIR Approach –P(e,d) is computed by the query-document model –P(i,e) is computed by the query-translation model

Statistical CLIR Approach

Query-Document Model Statistical LM & Smoothing –Term frequencies of a document are smoothed linearly and the amount of probability assigned to never observed terms is proportionally to the size of the document vocabulary local global

Query-Translation Model According to the HMM Determine N-best translations –The most probable translation e* can be computed through the Viterbi search algorithm. –Intermediate results of the Viterbi algorithm can be used by the A* search algorithm to efficiently compute the N most possible translations of i.

Query-Translation Model P(i | e) are estimated from a bilingual dictionary P(e | e ’ ) are estimated on the target document collection (order-free bigram LM) Smoothing

CLIR Algorithm Use two approximations to limit the set of possible translations and documents.

Complexity of CLIR Algorithm

Text Preprocessing

Blind Relevance Feedback The R most relevant terms are selected from the top B ranked documents according to:

Comparison with other CLIR Models Hiemstra (1999) Xu (2001)

Term Translation Model using Search Result Pages Apply page authority to search-result-based translation extraction

Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval Wessel Kraaij, Jian-Yun Nie and Michel Simard Computational Linguistics, Special Issue on the Web as Corpus, 2003

Web Mining for CLIR The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. Conventional approach: IR + MT (machine translation)

Problems in Query Translation Finding translations –Lexical coverage: Proper names and abbreviations. –Transliteration: Phonemic representation of a named entity. Jeltsin, Eltsine, Yeltsin, and Jelzin (in Latin script) Pruning translation alternatives Weighting translation alternatives

Exploitation of Parallel Texts Using a pseudofeedback approach (Yang et al. 1998) Capturing global cross-language term associations (Yang et al. 1998; Lavrenko 2002) Transposing to a language-independent semantic space (Dumais et al. 1997; Yang et al. 1998) Training a statistical translation model (Nie et al. 1999; Franz et al. 2001; Hiemstra 2001; Xu et al. 2001)

Mining Process in PTMiner

Embedding Translation into IR Model Basic language model Normalized log-likelihood ratio (NLLR)

Embedding Translation into IR Model * Query Model: * Document Model: * Basic Language Model: (log likelihood ratio) (normalized log likelihood ratio)

A Probability Model to Improve Word Alignment Colin Cherry & Dekang Lin, University of Alberta

Outline Introduction Probabilistic Word-Alignment Model Word-Alignment Algorithm –Constraints –Features

Introduction Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation. –E.g., translation lexicons, transfer rules Word-alignment problem –Conventional approaches usually used co-occurrence models E.g., Ø 2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993) –Indirect association problem: Melamed (2000) proposed competitive linking along with an explicit noise model to solve To propose a probabilistic word-alignment model which allows easy integration of context-specific features. CISCO System Inc. 思科系統 CISCO System Inc. 思科系統

Probabilistic Word-Alignment Model Given E = e 1, e 2, …, e m, F = f 1, f 2, …, f n –If e i and f j are translation pair, then link l(e i, f j ) exists –If e i has no corresponding translation, then null link l(e i, f 0 ) exists –If f j has no corresponding translation, then null link l(e 0, f j ) exists –An alignment A is a set of links such that every word in E and F participates in at least one link Alignment problem is to find alignment A to maximize P(A|E, F) IBM ’ s translation model: maximize P(A, F|E)

Probabilistic Word-Alignment Model (Cont.) Given A = {l 1, l 2, …, l t }, where l k = l(e i k, f j k ), then consecutive subsets of A, l i j = {l i, l i+1, …, l j } Let C k = {E, F, l 1 k-1 } represent the context of l k

Probabilistic Word-Alignment Model (Cont.) C k = {E, F, l 1 k-1 } is too complex to estimate FT k is a set of context-related features such that P(l k |C k ) can be approximated by P(l k |e i k, f j k, FT k ) Let C k ’ = {e i k, f j k } ∪ FT k

An Illustrative Example

Word-Alignment Algorithm Input: E, F, T E –T E is E ’ s dependency tree which enable us to make use of features and constraints based on linguistic intuitions Constraints –One-to-one constraint: every word participates in exactly one link –Cohesion constraint: use T E to induce T F with no crossing dependencies

Word-Alignment Algorithm (Cont.) Features –Adjacency features ft a : for any word pair (e i, f j ), if a link l(e i ’, f j ’ ) exists where -2  i ’ -i  2 and -2  j ’ -j  2, then ft a (i-i ’, j-j ’, e i ’ ) is active for this context. –Dependency features ft d : for any word pair (e i, f j ), let e i ’ be the governor of e i,and let rel be the grammatical relationship between them. If a link l(e i ’, f j ’ ) exists, then ft d (j-j ’, rel) is active for this context.

Experimental Results Test bed: Hansard corpus –Training: 50K aligned pairs of sentences (Och & Ney 2000) –Testing: 500 pairs

Future Work The alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “ many ” side.

Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

Similar presentations

Presentation on theme: "Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

Similar presentations

Presentation on theme: "Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,"— Presentation transcript:

Similar presentations

About project

Feedback