Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Linear Classifiers (perceptrons)
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Detecting and Tracking Moving Objects for Video Surveillance Isaac Cohen and Gerard Medioni University of Southern California.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Expectation Maximization Algorithm
Optimization Methods One-Dimensional Unconstrained Optimization
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Scalable Text Mining with Sparse Generative Models
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Optimization Methods One-Dimensional Unconstrained Optimization
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Operations Research Models
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Multimodal Interaction Dr. Mike Spann
Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Report #1 By Team: Green Ensemble AusDM 2009 ENSEMBLE Analytical Challenge: Rules, Objectives, and Our Approach.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Today Ensemble Methods. Recap of the course. Classifier Fusion
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Linear Models Tony Dodd. 21 January 2008Mathematics for Data Modelling: Linear Models Overview Linear models. Parameter estimation. Linear in the parameters.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
LECTURE 11: Advanced Discriminant Analysis
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CSSE463: Image Recognition Day 21
Statistical Models for Automatic Speech Recognition
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
○ Hisashi Shimosaka (Doshisha University)
Statistical Machine Translation Papers from COLING 2004
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006

GOAL To directly optimize translation quality WHY??  No direct correlation in popular evaluation criteria F-Measure (parsing) Mean Average Precision (ranked retrieval) BLEU—multi-reference word error rate (statistical machine translation)

Problem: The difference in classification of error between the statistical approach and the automatic evaluation methods. Solution (maybe): optimize model parameters according to individual evaluation methods

Background Optimal under “zero-one loss function” A different metric would have a different optimal decision rule

Background, continued Problems: finding suitable feature functions (M) and parameter values(λ) MMI (max mutual info)  One unique global optimum  Algorithms guaranteed to find it  Optimal translation quality?

So what? Review automatic evaluation criteria Two training criteria that might help New training algorithm for optimizing an unsmoothed error count Och’s approach Evaluation of training criteria

Translation quality metrics mWER –(multi-reference word error rate)  Compute edit distance to closest ref. transl. mPER – (multi-reference position independent error rate)  bag of words, edit distance BLEU  The mean of the precision of n-grams NIST  Weighted precision of n-grams

Training Minimize error rate Problems:  argmax operation (6)- no global optimum  Many local optima

Smoothed Error Count This is easier to deal with than last function, but still tricky Performance doesn’t change much with smoothing

Unsmoothed Error Count Standard: Powell’s algorithm – grid-based line optimization Fine-grained grid: slow Large grid: miss optimal solution NEW: Log-linear model Guaranteed to find the optimal solution Much faster and more stable

New Algorithm Each candidate translation in C corresponds to a line (t and m are constants)  Piecewise linear

Algorithm: the nitty-gritty For every f : Compute ordered sequence of linear intervals that make up f (γ;f) Compute each change in error count between intervals Merge all sequences γ f and ΔE f Traverse the sequence of boundaries while keeping track of error count to find the optimal γ

Baseline Same as alignment template approach This model, log-linear, had M = 8 features Extract n-best candidate translations from all possible translations Wait a minute...

N-best??? Overfitting? Unseen data? First, compute n-best list using “made-up” parameter values. Use this list to train model for new parameters. Second, use new parameters, do new search, make new n-best list, append to old n-best list Third, use new list to train model for even better parameters

Keep going until the n-best list doesn’t change – all possible translations are in list Each iteration generates approx. 200 additional translations The algorithm only takes 5-7 iterations to converge

Additional Sneaky Stuff Problems with MMI (maximum mutual info)  Reference sentences have to be part of n-best list Solution:  Fake reference sentences, of course  Select from the n-best list, those sentences with the fewest word errors with respect to the REAL references, and call these: “pseudo-references”

Experiment 2002 TIDES Chinese- English small data track task News text from Chinese to English Note: no rule-based components used to translate numbers, dates, or names

Development Corpus Results

Test Corpus Results

Conclusions Alternative training criteria which directly relate to quality of translation  Unsmoothed and smoothed error count on development corpus Optimizing error rate in training yields better results on unseen test data  Maybe ‘true’ translation quality is also increased  We don’t know because the evaluation metrics need help

Future Questions How many parameters can be reliably estimated using differing criteria on development corpuses (corpi) of various sizes? Does the criteria used make a difference? Which error rate criteria (smooth/unsmooth) should be optimized in training?

Boasting This approach applies to any evaluation technique If the evaluation methods ever get better, this algorithm will yield correspondingly better results

Side-stepping It’s possible that this algorithm could be used to “overfit” the evaluation method, giving falsely inflated scores It’s not our problem. The developers of the evaluation methods should develop so this can’t happen

... And Around The World This algorithm has a place wherever evaluation methods are used It could yield improvements in these other areas as well

Questions, observations, accolades...

My Observations Improvements do not seem significant This exposes a problem in the evaluation metrics, but does nothing to solve it Seems like a good idea, but has many unanswered questions regarding optimal implementation

THANK YOU and Good Night!