1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Active Appearance Models
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
1 Fuchun Peng Microsoft Bing 7/23/  Query is often treated as a bag of words  But when people are formulating queries, they use “concepts” as.
Machine Learning and Data Mining Clustering
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.
Expectation Maximization Algorithm
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.
Natural Language Processing Expectation Maximization.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical Machine Translation Papers from COLING 2004
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003 pp

2 Abstract Often, the training procedure for statistical machine translation models is based on maximum likelihood or related criteria. Problem of this approach: there is only a loose relation to the final translation quality on unseen text. In this paper, we analyze various training criteria which directly optimize translation quality. We describe a new algorithm for efficient training an unsmoothed error count. We show that significantly better results can often be obtained if the final evaluation criterion is taken directly into account as part of the training procedure.

3 Statistical Machine Translation with Log-linear Models To translate a source (‘French’) sentence f into a target (‘English’) sentence e, we will choose the sentence with the highest probability: We directly model the posterior probability Pr(e|f) by using a log-linear model. The direct translation probability is given by: h m (e,f): feature function λ m : model parameter

4 Statistical Machine Translation with Log-linear Models In this framework, –the modeling problem amounts to developing suitable feature functions that capture the relevant properties of the translation task. –the training problem amounts to obtaining suitable parameter values λ 1 M.

5 Automatic Assessment of Translation Quality To automatically evaluate machine translation quality by comparing hypothesis translations with reference translations. Examples of such methods –word error rate, –position-independent word error rate (Tillmann et al., 1997), –generation string accuracy (Bangalore et al., 2000), –multi-reference word error rate (Nießen et al., 2000), –BLEU score (Papineni et al., 2001), –NIST score (Doddington, 2002). All these criteria try to approximate human assessment and often achieve an astonishing degree of correlation to human subjective evaluation of fluency and adequacy (Papineni et al., 2001; Doddington, 2002).

6 Automatic Assessment of Translation Quality In this paper, we use the following methods: multi-reference word error rate (mWER): –the hypothesis translation is compared to various reference translations by computing the edit distance between the hypothesis and the closest of the given reference translations. multi-reference position independent error rate(mPER): –This criterion ignores the word order by treating a sentence as a bag-of-words and computing the minimum number of substitutions, insertions, deletions needed to transform the hypothesis into the closest of the given reference translations.

7 Automatic Assessment of Translation Quality BLEU score: This criterion computes the geometric mean of the precision of n- grams of various lengths between a hypothesis and a set of reference translations multiplied by a factor BP(∙) that penalizes short sentences: –Here p n denotes the precision of n-grams in the hypothesis translation. We use N=4. NIST score: This criterion computes a weighted precision of. -grams between a hypothesis and a set of reference translations multiplied by a factor BP(∙) that penalizes short sentences: –Here W n denotes the weighted precision of n-grams in the translation. We use N=4.

8 Training Criteria for Minimum Error Rate Training We assume that –we can measure the number of errors in sentence e by comparing it with a reference sentence r using a function E(r,e). –the number of errors for a set of sentences e 1 s is obtained by summing the errors for the individual sentences: Our goal is to obtain a minimal error count on a representative corpus f 1 s with given reference translations ê 1 s and a set of K different candidate translations C s = {e s,1, …, e s,K } for each input sentence f s.

9 Training Criteria for Minimum Error Rate Training The optimization criterion is not easy to handle:

10 Training Criteria for Minimum Error Rate Training It includes an argmax operation (Eq. 6). Therefore, it is not possible to compute a gradient and we cannot use gradient descent methods to perform optimization. The objective function has many different local optima. The optimization algorithm must handle this.

11 Training Criteria for Minimum Error Rate Training To be able to compute a gradient and to make the objective function smoother, we can use the following error criterion, with a parameter α to adjust the smoothness: The unsmoothed error count has many different local optima and is very unstable. The smoothed error count is much more stable and has fewer local optima.

12 Training Criteria for Minimum Error Rate Training

13 Optimization Algorithm for Unsmoothed Error Count A standard algorithm for the optimization of the unsmoothed error count (Eq. 5) is Powells algorithm combined with a grid-based line optimization method (Press et al., 2002). A major problem with the standard approach is the fact that grid-based line optimization is hard to adjust such that both good performance and efficient search are guaranteed. –fine-grained grid => the algorithm is slow. –large grid => the optimal solution might be missed.

14 Optimization Algorithm for Unsmoothed Error Count Computing along a line λ 1 M +γ∙d 1 M with parameter γresults in an optimization problem of the following functional form: Hence, every candidate translation in C corresponds to a line. The function is piecewise linear. This allows us to compute an efficient exhaustive representation of that function.

15 Optimization Algorithm for Unsmoothed Error Count - new algorithm To optimize We compute the ordered sequence of linear intervals constituting f(γ;f) for every sentence f together with the incremental change in error count from the previous to the next interval. Hence, we obtain for every sentence f a sequence which denote the interval boundaries and a corresponding sequence for the change in error count involved at the corresponding interval boundary

16 Optimization Algorithm for Unsmoothed Error Count - new algorithm By merging all sequences γ f and ΔE f for all different sentences of our corpus, the complete set of interval boundaries and error count changes on the whole corpus are obtained. The optimal γ can now be computed easily by traversing the sequence of interval boundaries while updating an error count.

17 Baseline Translation Approach The basic feature functions of our model are identical to the alignment template approach (Och and Ney, 2002). In this translation model, a sentence is translated by segmenting the input sentence into phrases, translating these phrases and reordering the translations in the target language. The log-linear model includes M=8 different features.

18 Baseline Translation Approach Algorithm 1.perform search (using a manually defined set of parameter values) and compute an n-best list, and use this n-best list to train the model parameters. 2.use the new model parameters in a new search and compute a new n-best list, which is combined with the existing n-best list. 3.using this extended n-best list new model parameters are computed. This is iterated until the resulting n-best list does not change. In this algorithm convergence is guaranteed as, in the limit, the n-best list will contain all possible translations.

19 Results We present results on the 2002 TIDES Chinese–English small data track task. The goal is the translation of news text from Chinese to English.

20 Results The basic feature functions were trained using the training corpus. The development corpus was used to optimize the parameters of the log-linear model. Translation results are reported on the test corpus.

21 Results Table 2: Effect of different error criteria in training on the development corpus. Note that better results correspond to larger BLEU and NIST scores and to smaller error rates. Italic numbers refer to results for which the difference to the best result (indicated in bold) is not statistically significant.

22 Results Table 3: Effect of different error criteria used in training on the test corpus. Note that better results correspond to larger BLEU and NIST scores and to smaller error rates. Italic numbers refer to results for which the difference to the best result (indicated in bold) is not statistically significant.

23 Results We observe that if we choose a certain error criterion in training, we obtain in most cases the best results using the same criterion as the evaluation metric on the test data. We observe that the smoothed error count gives almost identical results to the unsmoothed error count.

24 Conclusions We presented alternative training criteria for log-linear statistical machine translation models which are directly related to translation quality: an unsmoothed error count and a smoothed error count on a development corpus. For the unsmoothed error count, we presented a new line optimization algorithm which can efficiently find the optimal solution along a line. Optimizing error rate as part of the training criterion helps to obtain better error rate on unseen test data. Note, that this approach can be applied to any evaluation criterion.