Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011.

Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011

Stephan Vogel - Machine Translation2 Decoder lDecoding issues (Previous Session) lTwo step decoding lGeneration of translation lattice lBest path search lWith limited word reordering lSpecific Issues lRecombination of hypotheses lPruning lN-best list generation lFuture cost estimation

Stephan Vogel - Machine Translation3 Recombination of Hypotheses lRecombination: Of two hypotheses keep only the better one if no future information can switch their current ranking lNotice: this depends on the models lModel score depends on current partial translation and the extension, e.g. LM lModel score depends on global features known only at the sentence end, e.g. sentence length model lThe models define equivalence classes for the hypotheses lExpand only best hypothesis in each equivalence class

Stephan Vogel - Machine Translation4 Recombination of Hypotheses: Example ln-gram LM lHypotheses H1: I would like to go H2: I would not like to go Assume as possible expansions: to the movies | to the cinema | and watch a film lLMscore is identical for H1+Expansion as for H2+Expansion for bi, tri, four-gram LMs lE.g : 3-gram LMscore Expansion 1 is: -log p( to | to go ) – log p( the | go to ) – log p( movies | to the) lTherefore: Cost(H1) Cost(H1+E) < Cost(H2+E) for all possible expansions E

Stephan Vogel - Machine Translation5 Recombination of Hypotheses: Example 2 lSentence length model p( I | J ) lHypothesis H1: I would like to go H2: I would not like to go Assume as possible expansions: to the movies | to the cinema | and watch a film lLength( H1 ) = 5, Length( H2 ) = 6 lFor identical expansions the lengths will remain different lSituation at sentence end lPossible that -log P( len( H1 + E ) | J ) > -log P( len( H2 + E ) | J ) lThen possible that TotalCost( H1 + E ) > TotalCost( H2 + E ) lI.e. reranking of hypotheses lTherefore: can not recombine H2 into H1

Stephan Vogel - Machine Translation6 Recombination: Keep ‘em around lExpand only best hyp lStore pointers to recombined hyps for n-best list generation hbhb hbhb hrhr hrhr hrhr hrhr Better Increasing coverage

Stephan Vogel - Machine Translation7 Recombination: Keep ‘em around lExpand only best hyp lStore pointers to recombined hyps for n-best list generation hbhb hbhb hrhr hrhr hrhr hrhr Better Increasing coverage

Stephan Vogel - Machine Translation8 Recombination of Hypotheses lTypical features for recombination of partial hypotheses lLM history lPositions of covered source words – some translations are more expensive lNumber of generated words on target side – for sentence length model lOften only number of covered source words is considered, rather then actual positions lFits with typical organization of decoder: hyps are stored according to number of covered source words lHyps are recombined which are not strictly comparable lUse future cost estimate to lessen its impact lOverall: trade-off between speed and ‘correctness’ of search lIdeally: only compare (and recombine) hyps if all models used in the search see them as equivalent lRealistically: use fewer, coarser equivalence classes by ‘forgetting’ some of the models (they still add to the scores)

Stephan Vogel - Machine Translation9 Effect of Reordering Chinese-EnglishArabic-English RNISTBLEUNISTBLEU 17.970.2058.590.385 28.000.2068.870.424 38.040.2098.940.432 48.070.2139.020.441 lR: reordering window; R = 1: monotone decoding, lReordering mainly improves fluency, i.e. stronger effect for Bleu lImprovement for Arabic: 4.8% NIST and 12.7% Bleu lLess improvement for Chinese: ~5% in Bleu lArabic devtest set (203 sentences) lChinese test set 2002 (878 sentences)

Stephan Vogel - Machine Translation10 Search Space lExample: sentence with 48 words lFull search for using coverage and language model state lAv. Expanded is for entire test set, i.e. 4991 words RExpandedCollisionsAv. Expanded 0183.80606,467 11,834,212588,29372,343 28,589,2213,479,193326,470 333,853,16112,127,1751,230,020 lMore reordering -> more collisions lGrowth of search space is counteracted by recombination of hypotheses and by pruning

Stephan Vogel - Machine Translation11 Pruning lPruning lEven after recombination too many hyps lRemove bad hyps and keep only the best ones lIn recombination we compared hyps which are equivalent under the models lNow we need to compare hyps, which are not strictly equivalent under the models lWe risk to remove hyps which would have won the race in the long run lI.e. we introduce errors into the search lSearch Error – Model Errors lModel errors: our models give higher probability to worse translation lSearch errors: our decoder looses translations with higher probability

Stephan Vogel - Machine Translation12 Pruning: Which Hyps to Compare? lWhich hyps are we comparing? lHow many should we keep? Recombination Pruning

Stephan Vogel - Machine Translation13 Pruning: Which Hyps to Compare? lCoarser equivalence relation => need to drop at least one of the models, or replace by simpler model lRecombination according to translated positions and LM state Pruning according to number of translated positions and LM state lRecombination according to number of translated positions and LM state Pruning according to number of translated positions OR LM state lRecombination with 5-gram LM Pruning with 3-gram LM lQuestion: which is the more important feature? lWhich leads to more search errors? lHow much loss in translation quality? lQuality more important than speed in most applications! lNot one correct answer – depends on other components of the system lIdeally, decoder allows for different recombination and pruning settings

Stephan Vogel - Machine Translation14 How Many Hyps to Keep? lBeam search: keep hyp h if Cost(h) < Cost(h best ) + const Cost Models separate alternatives a lot -> keep few hyps Models do not separate alternatives -> keep many hyps # translated words Prune bad hyps

Stephan Vogel - Machine Translation15 Additive Beam lIs additive constant (in log domain) the right thing to do? lHyps may spread more and more Cost Fewer and fewer hyps Inside beam # translated words

Stephan Vogel - Machine Translation16 Multiplicative Beam lBeam search: keep hyp h if Cost(h) < Cost(h best ) * const Cost # translated words Opening beam Covers more hyps

Stephan Vogel - Machine Translation17 Pruning and Optimization lEach feature has a feature weight lOptimization by adjusting feature weights lCan result in compressing or spreading the scores lThis actually happened in our first MERT implementation: Higher and higher feature weights => Hyps spreading further and further appart => Fewer hyps inside the beam => Lower and lower Bleu score  lTwo-pronged repair: lNormalizing feature weights lNot proper beam pruning, but restricting number of hyps

Stephan Vogel - Machine Translation18 How Many Hyps to Keep? lKeep n-best hyps lDoes not use the information from the models to decide how many hyps to keep Cost Keep constant number of hyps # translated words Prune bad hyps

Stephan Vogel - Machine Translation19 Efficiency lTwo problems lSorting lGenerating lots of hyps which are pruned (what a waste of time) lCan we avoid generating hyps, which would most likely be pruned?

Stephan Vogel - Machine Translation20 Efficiency lAssumptions: lWe want to generate hyps which cover n positions lAll hyp sets H k, k < n, are sorted according to total score lAll phrase pairs (edges in translation lattice), which can be used to expand a hyp h in H k to cover n positions, are sorted according to their score (weighted sum of individual scores) h1 h2 h3 h4 h5 p1 p2 p3 p4 h1p2 h1p1 h2p3 h4p2 h1p3 h3p2 Hyps sortedPhrases sortedNew Hyps sorted h2p1 prune

Stephan Vogel - Machine Translation21 Naïve Way lNaïve way: Foreach hyp h Foreach phrasepair p newhyp = h  p Cost(newhyp) = Cost(h)+ Cost(p)+ CostLM + CostDM + … lThis generates many hyps which will be pruned

Stephan Vogel - Machine Translation22 Early Termination lIf Cost(newhyp) = Cost(h) + Cost(p) it would be easy Besthyp = h 1  p 1 Loop h = next hyp Loop p = next p newhyp = h  p Cost(newhyp) = Cost(h) + Cost(p) Until Cost(newhyp) > Cost(besthyp) + const Until Cost(newhyp) > Cost(besthyp) + const lThat’s for proper beam pruning, would still generate too many hyps for max number of hyp strategy lIn addition, we have LM and DM, etc

Stephan Vogel - Machine Translation23 2 ‘Cube’ Pruning lExpand always best hyp until lNo hyps within beam anymore lOr max number of hyps reached p1p2p3 h1 h2 h3 h4 1 12 3 4 3.1 3.4 4.2 6.5 4.6 5.6 4

Stephan Vogel - Machine Translation24 Effect of Recombination and Pruning lAverage number of expanded hypotheses and NIST scores for different recombination (R) and pruning (P) combinations and different beam sizes (= number of hyps) lTest Set: Arabic DevTest (203 sentences) Beam Width R : P1251020 Av. Hyps exp. C : c8258991,1321,4921,801 C L : c1,1741,8576,21330,293214,402 C L : C2,7924,24812,92153,228287,278 NIST C : c8.188.818.218.228.27 C L : c8.418.628.888.958.96 C L : C8.478.688.858.98 c = number of translation words, C = coverage vector, i.e. positions, L = LM history NIST scores: higher is better

Stephan Vogel - Machine Translation25 Number of Hypotheses versus NIST lLanguage model state required as recombination feature lMore hypotheses – better quality lDifferent ways to achieve similar translation quality lCL : C generates more ‘useless’ hypotheses (number of bad hyps grows faster than number of good hyps)

Stephan Vogel - Machine Translation26 N-Best List Generation lBenefit: lRequired for optimizing model scaling factors lRescoring with richer models lFor down-stream processing lTranslation with pivot language: L1 -> L2 -> L3 lInformation extraction l… lWe have n-best translations at sentence end lBut: Hypotheses are recombined -> many good translations don’t reach the sentence end lRecover those translations

Stephan Vogel - Machine Translation27 Storing Multiple Backpointers lWhen recombining hypotheses, store them with the best (i.e. surviving) hypothesis, but don’t expand them hbhb hbhb hrhr hrhr hrhr hrhr

Stephan Vogel - Machine Translation28 Calculating True Score lPropagate final score backwards lFor best hypothesis we have correct final score Q f (h b ) lFor recombined hypothesis we know current score Q c (h r ) and difference to current score Q c (h b ) of best hypothesis lFinal score of recombined hypothesis is then: Q(h r ) = Q f (h b ) + ( Q c (h r ) - Q c (h b ) ) lUse B = (Q, h, B’ ) to store sequences of hypotheses which make up a translation lStart with n-best final hypotheses lFor each of top n Bs, go to predecessor hypothesis and to recombined hypotheses of predecessor hypothesis lStore Bs according to coverage

Stephan Vogel - Machine Translation29 Problem with N-Best Generation lDuplicates when using phrases US # companies # and # other # institutions US companies # and # other # institutions US # companies and # other # institutions US # companies # and other # institutions... lExample run: 1000-best -> ~400 different strings on average Extreme case: only 10 different strings lPossible solution: Checking uniqueness during backtracking, i.e. creating and hashing partial translations

Stephan Vogel - Machine Translation30 Rest-Cost Estimation lIn Pruning we compare hyps, which are not strictly equivalent under the models lRisk: prefer hypotheses which have covered the easy parts lRemedy: estimate remaining cost for each hypothesis compare hypotheses based on ActualCost + FutureCost lWant to know minimum expected cost (similar to A * search) lGives a bound for pruning lHowever, not possible with acceptable effort for all models lWant to include as many models as possible lTranslation model costs, word count, phrase count lLanguage model costs lDistortion model costs lCalculate expected cost R(l, r) for each span (l, r)

Stephan Vogel - Machine Translation31 Rest Cost for Translation Models lTranslation model, word count and phrase count features are ‘local’ costs lDepend only on current phrase pair lStrictly additive: R(l, m) + R(m, r) = R(l, r) lMinimize over alternative translations lFor each source phrase span (l, r): initialize with cost for best translation lCombine adjacent spans, take best combination

Stephan Vogel - Machine Translation32 Rest Cost for Language Models lWe do not have history -> only approximation lFor each span (l, r) calculate LM score without history lCombine LM scores for adjacent spans lNotice: p(e 1 … e m ) * p(e m+1 … e n ) != p(e 1 … e n ) beyond 1-gram LM lAlternative: fast monotone decoding with TM-best translations lHistory available lThen R(l,r) = R(1,r) – R(1,l)

Stephan Vogel - Machine Translation33 Rest Cost for Distance-Based DM lDistance-based DM: rest cost depends on coverage pattern lTo many different coverage patterns, can not pre-calculate lEstimate by jumping to first gap, then filling gaps in sequence lMoore & Quirk 2007: DM cost plus rest cost S adjacent S’’: d=0 S left of S’: d=2L(S) S’ subsequence of S’’: d=2(D(S,S’’)+L(S)) Otherwise: d=2(D(S,S’)+L(S)) SS’’S’ Current phrase Previous phraseGap-free initial segment L(.) = length of phrase, D(.,.) = distance between phrases

Stephan Vogel - Machine Translation34 Rest Cost for Lexicalized DM lLexicalized DM per phrase (f, e) = (f, t(f)) lDM(f,e) scores: in-mon, in-swap, in-dist, out-mon, out-swap, out-dist lTreat as local cost for each span (l, r) lMinimize over alternative translations and different orientations in-* and out-*

Stephan Vogel - Machine Translation35 Effect of Rest-Cost Estimation lFrom Richard Zens 2008 lWe did not describe ‘per Position’ lLM is important, DM is important

Stephan Vogel - Machine Translation36 Summary lDifferent translation strategies – related to word reordering lTwo level decoding strategy (one possible way to do it) lGenerating translation lattice: contains all word and phrase translations lFinding best path lWord reordering as extension to best path search lJump ahead in lattice, fill in gap later lShort reordering window: decoding time exponential in size of window lRecombination of hypotheses lIf models can not re-rank hypotheses, keep only best lDepends on models used

Stephan Vogel - Machine Translation37 Summary lPruning of hypotheses lBeam pruning lProblem with too few hyps in beam (e.g. when running MERT) lKeeping a maximum number of hyps lEfficiency of implementation lTry to avoid generating hyps, which are pruned lCube pruning lN-best list generation lNeeded for MERT lSpurious ambiguity

Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011.

Similar presentations

Presentation on theme: "Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011.

Similar presentations

Presentation on theme: "Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011."— Presentation transcript:

Similar presentations

About project

Feedback