Download presentation

Presentation is loading. Please wait.

Published byLukas Lansberry Modified over 2 years ago

1
Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

2
Stephan Vogel - Machine Translation2 Overview lOptimization approaches lSimplex lMER lAvoiding local minima lAdditional considerations lTuning towards different metrics lTuning on different development sets

3
Stephan Vogel - Machine Translation3 Tuning the SMT System lWe use different models in SMT system lModels have simplifications lTrained on different amounts of data l=> Models have different levels of reliability and scores have different ranges l=> Give different weight to different Models Q = c 1 Q 1 + c 2 Q 2 + … + c n Q n lFind optimal scaling factors (feature weights) c 1 … c n lOptimal means: Highest score for chosen evaluation metric M ie: find (c 1, …, c n ) such that M(argmin e {Q(e,f)}) is high lMetric M is our objective function

4
Stephan Vogel - Machine Translation4 Problems lThe surface of the objective function is not nice lNot convex -> local minima (actually, many local minima) lNot differentiable -> gradient descent methods not readily applicable lThere may be dangerous areas (boundary cliffs) lExample: lTune on Dev set with short reference translations lOptimization leads towards short translations lNew test set has long reference translations lTranslations are now too short ->length penalty Small change Big effect

5
Stephan Vogel - Machine Translation5 Brute Force Approach – Manual Tuning lDecode with different scaling factors lGet feeling for range of good values lGet feeling for importance of models lLM is typically most important lSentence length (word count feature) to balance shortening effect of LM lWord reordering is more or less effective depending on language lNarrow down range in which scaling factors are tested lEssentially multi-linear optimization lWorks good for small number of models lTime consuming (CPU wise) if decoding takes long time

6
Stephan Vogel - Machine Translation6 Automatic Tuning lMany algorithms to find (near) optimal solutions available lSimplex lPowell (line search) lMIRA (Margin Infused Relaxed Algorithm) lSpecially designed minimum error training (Och 2003) lGenetic algorithm lNote: models are not improved, only their combination lNote: some parameters change performance of decoder, but are not in Q lNumber of alternative translation lBeam size lWord reordering restrictions

7
Stephan Vogel - Machine Translation7 Automatic Tuning on N-best List lOptimization algorithm need many iterations – too expensive to run full translations l=> Use n-best lists le.g. for each of 500 source sentences 1000 translations lChange scaling factors results in re-ranking the n-best lists lEvaluate new 1-best translations lApply any of the standard optimization techniques lAdvantage: much faster lCan pre-calculate the counts (e.g. n-gram matches) for each translation to speed up evaluation

8
Stephan Vogel - Machine Translation8 Simplex (Nelder-Mead) lStart with n+1 random configurations lGet 1-best translation for each configuration -> objective function lSort points x k according to objective function: f(x 1 ) < f(x 2 ) < … < f(x n+1 ) lCalculate x 0 as center of gravity for x 1 … x n lReplace worst point with a point reflected through the centroid x r = x 0 + r * (x 0 – x n+1 )

9
Stephan Vogel - Machine Translation9 Demo lObviously, we need to change the size of the simplex to enforce convergence lAlso, want to adjust the step size lIf new point is best point – increase step size lIf new point is worse then x 1 … x n – decrease step size

10
Stephan Vogel - Machine Translation10 Expansion and Contraction lReflection: Calculate x r = x 0 + r * (x 0 – x n+1 ) if f(x 1 ) <= f(x r ) < f(x n ) replace x n+1 with x r; Next iteration lExpansion: If reflected point is better then best, i.e. f(x r ) < f(x 1 ) Calculate x e = x 0 + e * (x 0 – x n+1 ) If f(x e ) < f(x r ) then replace x n+1 with x e else replace x n+1 with x r Next iteration else Contract lContraction: Reflected point f(x r ) >= f(x n ) Calculate x c = x n+1 + c * (x 0 – x n+1 ) If f(x c ) <= f(x n+1 ) then replace x n+1 with x c else Shrink lShrinking: For all x k, k = 2 … n+1: x k = x 1 + s * (x k – x 1 ) Next iteration

11
Stephan Vogel - Machine Translation11 Changing the Simplex x n+1 x1x1 reflection x n+1 x0x0 expansion x n+1 x0x0 contraction x n+1 x0x0 shrinking

12
Stephan Vogel - Machine Translation12 Powell Line Search lSelect directions in search space, then Loop until convergence Loop over directions d Perform line search for direction d until convergence lMany variants lSelect directions lEasiest is to use the model scores lOr combine multiple scores lStep size in line search lMER (Och 2003) is line search along models with smart selection of steps

13
Stephan Vogel - Machine Translation13 Minimum Error Training For each hypothesis we have Q = c k *Q k Select one Q \k = c k Q k + n\k c n *Q n = c k Q k + Q Rest ckck Metric Score WER = 8 Total Model Score Q Rest Q k Individual model score gives slope 1

14
Stephan Vogel - Machine Translation14 Minimum Error Training lSource sentence 1 lDepending on scaling factor c k, different hyps are in 1-best position lSet c k to have metric-best hyp also being model-best ckck h 11 : WER = 8 h 12 : WER = 5 h 13 : WER = 4 best hyp: h 11 h 12 h Model Score

15
Stephan Vogel - Machine Translation15 Minimum Error Training lSelect minimum number of evaluation points lCalculate intersection point lKeep only if hyps are minimum at that point lChoose evaluation points between intersection points ckck h 11 : WER = 8 h 12 : WER = 5 h 13 : WER = 4 best hyp: h 11 h 12 h Model Score

16
Stephan Vogel - Machine Translation16 Minimum Error Training lSource sentence 1, now different error scores lOptimization would find a different c k l=> Different metrics lead to different scaling factors ckck Model Score h 11 : WER = 8 h 12 : WER = 2 h 13 : WER = 4 best hyp: h 11 h 12 h

17
Stephan Vogel - Machine Translation17 Minimum Error Training lSentence 2 lBest c k in a different range lNo matter which c k, h 22 would newer be 1-best ckck best hyp: h 21 : WER = 2 h 22 : WER = 0 h 23 : WER = 5 h 21 h Model Score

18
Stephan Vogel - Machine Translation18 Minimum Error Training lMultiple sentences ckck h 11 : WER = 8 h 12 : WER = 5 h 13 : WER = 4 best hyp: h 11 h 12 h 13 h 21 : WER = 2 h 22 : WER = 0 h 23 : WER = 5 h 21 h Model Score

19
Stephan Vogel - Machine Translation19 Iterate Decoding - Optimization lN-best list is (very restricted) substitute for search space lWith updated feature weights we may have generated other (better) translations lSome of the hyps in the n-best list would have been pruned lIterate lRe-translate with new feature weights lMerge new translations with old translations (increases stability) lRun optimizer over larger n-best lists lRepeat until no new translations, or improvement < epsilon, or just k times (typically 5-10 iterations)

20
Stephan Vogel - Machine Translation20 Avoiding Local Minima lOptimization can get stuck in local minimum lRemedies lFiddle around with the parameters of your optimization algorithm lLarger n-best list -> more evaluation points lCombine with Simulated Annealing type approach (Smith & Eisner, 2007) lRestart multiple times

21
Stephan Vogel - Machine Translation21 Random Restarts lComparison Simplex/Powell (Alok, unpublished) lComparison Simplex/ext. Simplex/MER (Bing Zhao, unpubl.) lObservations: lAlok: Simplex jumpier then Powell lBing: Simplex better than MER lBoth: you need many restarts

22
Stephan Vogel - Machine Translation22 Optimizing NOT Towards References lIdeally, we want system output identical to reference translations lBut there is not guarantee that system can generate reference translations (under realistic conditions) lE.g. we restrict reordering window lWe have unknown words lReference translations may have words unknow to the system lInstead of forcing decoder towards reference translations optimize towards best translations generated by the system lFind hypotheses with best metric score lUse those as pseudo references lOptimize towards the pseudo references

23
Stephan Vogel - Machine Translation23 Optimizing Towards Different Metrics lAutomatic metrics have different characteristics lOptimizing towards one does not mean that other metric scores will also go up lEsp. Different metrics prefer shorter or longer translations Typically: TER < BLEU < METEOR (< means shorter translation) lMauser et al (2007) on Ch-En NIST 2005 test set lReasonably well behaved lResulting length of translation differs by more than 15%

24
Stephan Vogel - Machine Translation24 Generalization to other Test Sets lOptimize on one set, test on multiple other sets lAgain Mauser et al, Ch-En lShown is behavior over Simplex optimization iterations lNice, nearly parallel development of metric scores lHowever, we had also observed brittle behavior lEsp. when ratio src_length / ref_length is very different between dev and eval test sets

25
Stephan Vogel - Machine Translation25 Large Weight = Important Feature? lAssume we have c LM = 1.0, c TM = 0.55, c WC = 3.2 lWhich feature is most important? lCannot say!!! lWe want to re-rank the n-best lists lFeature weights scale feature values such that they can compete lExample: lVariation in LM and TM larger then for WC lNeed large weight for WC to make small differences effective lTo know if feature is important, remove it and look at drop in metric score Q LM Q TM Q WC Q H1H H2H H3H

26
Stephan Vogel - Machine Translation26 Open Issues lShould not all optimizers get the same results, if done right lThe models are the same, its just finding the right mix lIf local minima can be avoided, then similar good optima should be found lHow to stay save lAvoid good optima close to cliffs lDifferent configurations give very similar metric scores, pick one which is more stable lOne hat fits all? lWhy one set of feature weights? lHow about different sets for lGood/bad translations (tuning on tail: mixed results so far) lShort/long sentences lBegin/middle/end of sentence l...

27
Stephan Vogel - Machine Translation27 Summary lOptimizing system by modifying scaling factors (feature weights) lDifferent optimization approaches can be used lSimplex, Powell most common lMERT (Och) is similar to Powell, with pre-calculation of grid points lMany local optima, avoid getting stuck early lMost effective: many restarts lGeneralization lTo unseen test data: mostly ok, sometimes selection of dev set has big impact (length penalty!) lTo different metrics: reasonably stable (metrics are reasonably correlated in most cases) lStill open questions => more research needed

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google