Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A

Outline 1.Word Alignment 2.Fertility based models - IBM-3 and IBM-4 specifically 3.Removing Deficiency 4.Maximum Likelihood Training - expectation maximization (EM)

Word Alignment Given: a bilingual sentence pair, e.g. Task: find out which words correspond to each other

Considered Approach Overall strategy: 1.Design a probabilistic model (or take an existing) for translation and word alignment, with a (manageable) set of base probabilities 2.Learn the base probabilities from a set of training data (sentence pairs without alignments) 3.To annotate a given sentence pair: compute most likely alignment Approach for this talk: - probabilistic approach - data driven - unsupervised: no alignments given - based on (Brown et al. 1993)

Considered Models Alignment and Translation Model: for a target sentence and a source sentence : Considered Alignments: - each target word corresponds to at most one source word (mainly for computational reasons) conditional model alignments are hidden variables

Generative Process for Fertility Based Models Given a source sentence : 1.For - decide on the number of target words aligned to 2.For - For : decide on the kth target word aligned to 3.For - For : decide on the position of the target word The remaining positions are filled with the words - distortion model - IBM-3/4/5 differ - source for deficiency Then decide on the number of unaligned wordstarget words

IBM-3, Distortion and Deficiency ?????? IBM-3: deficient, we could choose j=1 (already taken) This work: nondeficient variant of IBM-3:

Reflection IBM-3: This work: nondeficient variant of IBM-3: Same base probabilities as the original IBM-3 Nondeficiency achieved by renormalization Relation to parametric models Need to keep track of all taken positions (just like the IBM-5)

IBM-4 ?????? 1.Word Classes: 2.Center position of die (the closest previous aligned word ): IBM-4 (deficient): This work: nondeficient variant of IBM-4: renormalization

Nondeficient IBM-4 Deficient IBM-4: Nondeficient IBM-4: () Leave out position 13 because we have to place neigeuse afterwards.

Training: Maximum Likelihood Maximize the likelihood of the training corpus: Subject to simplex/probability constraints on the base probabilities nonconcave maximization problem many local maxima no global algorithms known method of choice: expectation maximization (Dempster et al. ´77, Neal & Hinton ´98) for convenience: take the negative logarithm of the objective

EM: A Majorize-Minimize Method … assuming there is just one variable to minimize - in practice there are several thousand variables, with simplex (a.k.a. probability ) constraints - and for us, the bounding functions will not be convex negative log likelihood function 0

EM: Problems for IBM-3 and IBM-4 cf. (Udupa & Maji 2006) For IBM-3 and IBM-4: evaluating the negative log likelihood at a given point is intractable (deficient + nondeficient) bounding function known up to weights (= expectations) - computing the weights is intractable (deficient + nondeficient) - there are exponentially many weights (nondeficient) ) approximations Approximated bounding functions: non-convex (nondeficient) ) local minimization with projected gradient descent (250 iter.) (e.g. Bertsekas 1999)

E-Step: Computing the Weights of the Bounding Function Minimizing the bounding function decomposes into smaller problems (one per probability distribution) Here: consider only distortion for the IBM-3. For each sentence: - for the deficient variant: need expectation of i aligning to j - for the nondeficient variant: need expectation of i aligning to j when choosing from a set of open positions (exponentially many ) In both cases: hillclimbing procedure as in (Brown et al. 1993) - gives a likely alignment and neighbors ! approximate expectations - incremental for the deficient variant (fast) - not/only partially incremental for the nondeficient variant (slow) This method is also used to compute alignments

Briefly Mentioned More contributions in the paper: For the IBM-3: pooled distortion model Reduced deficiency for the IBM-4: words can no longer be placed outside of the sentence (but still on top of one another) In both cases: parametric models handled via EM and projected gradient descent based on

Experimental Setup Europarl German $ English and Spanish $ English - gold alignments: my own and from Lambert et al. all sentences lower cased in deficient mode: deficient empty word model (Och & Ney 2003) 100000 sentence pairs (leads to 1 day running time, 8GB memory) Evaluation metric: weighted F-Measure (Fraser & Marcu 2007) - accuracy measure (higher values = better alignments) - ® = 0.1 (recall more important than precision)

Results – Short Story Alignment accuracy : the nondeficient IBM-3 is clearly better than the deficient one IBM-4: all level, no winners IBM-5 beats everything Also tried: phrase based translation (Moses Experiment Management System) training run in both directions, diag-grow-final-and tuning (MERT) on 750 sentence pairs run for all models and variants the various BLEU scores offer no conclusions

Results – Long Story (1)

Related work on Word Alignment Models: Brown et al. 1993 Vogel, Ney, Tillmann 1996 Wang & Waibel 1998 Melamed 2000 Marcu & Wong 2002 Deng & Byrne 2005 Fraser & Marcu 2007 Mauser et al. 2009 Algorithms: Al-Onaizan et al. 1999 Matusov, Zens, Ney 2004 Taskar et al. 2005 Udupa & Maji 2005 Lacoste-Julien et al. 2006 Cromières, Kurohashi 2009 Regularity terms: Liang, Taskar, Klein 2006 Graca, Ganchev, Taskar 2010 Bansal, Quirk, Moore 2011 Vaswani et al. 2012

Conclusion Contributed: Nondeficient variants of IBM-3 and IBM-4 Maximum Likelihood Training based on EM - E-steps solved by hillclimbing - M-steps solved by projected gradient ascent Findings: important goal of probabilistic modeling (theoretical value) improvements for IBM-3 (f-measure on gold alignments) otherwise no improvements (IBM-4, BLEU scores) IBM-5 beats everything (f-measures on gold alignments)

Thank you! Questions? Source code and gold alignments online: https://github.com/Thomas1205/RegAligner http://user.phil-fak.uni-duesseldorf.de/~tosch/

The Bounding Function Need to iteratively solve problems of the form (ignoring a constant) exponentially many terms previous parameters product of factors sum of logarithms factor evaluated with the parameters to be currently optimized

Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Similar presentations

Presentation on theme: "Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Similar presentations

Presentation on theme: "Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint."— Presentation transcript:

Similar presentations

About project

Feedback