# Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

## Presentation on theme: "Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint."— Presentation transcript:

Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A

Outline 1.Word Alignment 2.Fertility based models - IBM-3 and IBM-4 specifically 3.Removing Deficiency 4.Maximum Likelihood Training - expectation maximization (EM)

Word Alignment Given: a bilingual sentence pair, e.g. Task: find out which words correspond to each other

Considered Approach Overall strategy: 1.Design a probabilistic model (or take an existing) for translation and word alignment, with a (manageable) set of base probabilities 2.Learn the base probabilities from a set of training data (sentence pairs without alignments) 3.To annotate a given sentence pair: compute most likely alignment Approach for this talk: - probabilistic approach - data driven - unsupervised: no alignments given - based on (Brown et al. 1993)

Considered Models Alignment and Translation Model: for a target sentence and a source sentence : Considered Alignments: - each target word corresponds to at most one source word (mainly for computational reasons) conditional model alignments are hidden variables

Generative Process for Fertility Based Models Given a source sentence : 1.For - decide on the number of target words aligned to 2.For - For : decide on the kth target word aligned to 3.For - For : decide on the position of the target word The remaining positions are filled with the words - distortion model - IBM-3/4/5 differ - source for deficiency Then decide on the number of unaligned wordstarget words

IBM-3, Distortion and Deficiency ?????? IBM-3: deficient, we could choose j=1 (already taken) This work: nondeficient variant of IBM-3:

Reflection IBM-3: This work: nondeficient variant of IBM-3: Same base probabilities as the original IBM-3 Nondeficiency achieved by renormalization Relation to parametric models Need to keep track of all taken positions (just like the IBM-5)

IBM-4 ?????? 1.Word Classes: 2.Center position of die (the closest previous aligned word ): IBM-4 (deficient): This work: nondeficient variant of IBM-4: renormalization

Nondeficient IBM-4 Deficient IBM-4: Nondeficient IBM-4: () Leave out position 13 because we have to place neigeuse afterwards.

Training: Maximum Likelihood Maximize the likelihood of the training corpus: Subject to simplex/probability constraints on the base probabilities nonconcave maximization problem many local maxima no global algorithms known method of choice: expectation maximization (Dempster et al. ´77, Neal & Hinton ´98) for convenience: take the negative logarithm of the objective

EM: A Majorize-Minimize Method … assuming there is just one variable to minimize - in practice there are several thousand variables, with simplex (a.k.a. probability ) constraints - and for us, the bounding functions will not be convex negative log likelihood function 0

EM: Problems for IBM-3 and IBM-4 cf. (Udupa & Maji 2006) For IBM-3 and IBM-4: evaluating the negative log likelihood at a given point is intractable (deficient + nondeficient) bounding function known up to weights (= expectations) - computing the weights is intractable (deficient + nondeficient) - there are exponentially many weights (nondeficient) ) approximations Approximated bounding functions: non-convex (nondeficient) ) local minimization with projected gradient descent (250 iter.) (e.g. Bertsekas 1999)

E-Step: Computing the Weights of the Bounding Function Minimizing the bounding function decomposes into smaller problems (one per probability distribution) Here: consider only distortion for the IBM-3. For each sentence: - for the deficient variant: need expectation of i aligning to j - for the nondeficient variant: need expectation of i aligning to j when choosing from a set of open positions (exponentially many ) In both cases: hillclimbing procedure as in (Brown et al. 1993) - gives a likely alignment and neighbors ! approximate expectations - incremental for the deficient variant (fast) - not/only partially incremental for the nondeficient variant (slow) This method is also used to compute alignments

Briefly Mentioned More contributions in the paper: For the IBM-3: pooled distortion model Reduced deficiency for the IBM-4: words can no longer be placed outside of the sentence (but still on top of one another) In both cases: parametric models handled via EM and projected gradient descent based on

Experimental Setup Europarl German \$ English and Spanish \$ English - gold alignments: my own and from Lambert et al. all sentences lower cased in deficient mode: deficient empty word model (Och & Ney 2003) 100000 sentence pairs (leads to 1 day running time, 8GB memory) Evaluation metric: weighted F-Measure (Fraser & Marcu 2007) - accuracy measure (higher values = better alignments) - ® = 0.1 (recall more important than precision)

Results – Short Story Alignment accuracy : the nondeficient IBM-3 is clearly better than the deficient one IBM-4: all level, no winners IBM-5 beats everything Also tried: phrase based translation (Moses Experiment Management System) training run in both directions, diag-grow-final-and tuning (MERT) on 750 sentence pairs run for all models and variants the various BLEU scores offer no conclusions

Results – Long Story (1)

Results – Long Story (2)

Results – Long Story (3)

Results – Long Story (4)

Related work on Word Alignment Models: Brown et al. 1993 Vogel, Ney, Tillmann 1996 Wang & Waibel 1998 Melamed 2000 Marcu & Wong 2002 Deng & Byrne 2005 Fraser & Marcu 2007 Mauser et al. 2009 Algorithms: Al-Onaizan et al. 1999 Matusov, Zens, Ney 2004 Taskar et al. 2005 Udupa & Maji 2005 Lacoste-Julien et al. 2006 Cromières, Kurohashi 2009 Regularity terms: Liang, Taskar, Klein 2006 Graca, Ganchev, Taskar 2010 Bansal, Quirk, Moore 2011 Vaswani et al. 2012

Conclusion Contributed: Nondeficient variants of IBM-3 and IBM-4 Maximum Likelihood Training based on EM - E-steps solved by hillclimbing - M-steps solved by projected gradient ascent Findings: important goal of probabilistic modeling (theoretical value) improvements for IBM-3 (f-measure on gold alignments) otherwise no improvements (IBM-4, BLEU scores) IBM-5 beats everything (f-measures on gold alignments)

Thank you! Questions? Source code and gold alignments online: https://github.com/Thomas1205/RegAligner http://user.phil-fak.uni-duesseldorf.de/~tosch/

The Bounding Function Need to iteratively solve problems of the form (ignoring a constant) exponentially many terms previous parameters product of factors sum of logarithms factor evaluated with the parameters to be currently optimized

Similar presentations