Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Statistical Machine Translation

Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.

Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček Charles University in Prague Institute of.

Joint Parsing and Alignment with Weakly Synchronized Grammars David Burkett, John Blitzer, & Dan Klein TexPoint fonts used in EMF. Read the TexPoint manual.

Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan Roth University of Illinois at Urbana-Champaign.

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Regression “A new perspective on freedom” TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A AAA A A.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.

Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.

Machine Learning and Data Mining Clustering

DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.

The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.

Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

Expectation Maximization Algorithm

Maximum Likelihood (ML), Expectation Maximization (EM)

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

1 Collaborative Filtering: Latent Variable Model LIU Tengfei Computer Science and Engineering Department April 13, 2011.

More Realistic Power Grid Verification Based on Hierarchical Current and Power constraints 2 Chung-Kuan Cheng, 2 Peng Du, 2 Andrew B. Kahng, 1 Grantham.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Statistical Machine Translation Part VIII – Log-Linear Models Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Online Learning for Collaborative Filtering

Bayesian Word Alignment for Statistical Machine Translation Authors: Coskun Mermer, Murat Saraclar Present by Jun Lang I2R SMT-Reading Group.

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Information Bottleneck versus Maximum Likelihood Felix Polyakov.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Statistical Machine Translation Part II: Word Alignments and EM

Learning Deep Generative Models by Ruslan Salakhutdinov

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Deep Feedforward Networks

Statistical Machine Translation Part IV – Log-Linear Models

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

Alex Fraser Institute for Natural Language Processing

Statistical Machine Translation

Introduction to EM algorithm

Statistical Machine Translation Papers from COLING 2004

Probabilistic Latent Preference Analysis

Topic Models in Text Processing

Statistical NLP Spring 2011

Machine Learning and Data Mining Clustering

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Thomas Schoenemann University of Düsseldorf, Germany ACL 2013, Sofia, Bulgaria Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A

Outline 1.Word Alignment 2.Fertility based models - IBM-3 and IBM-4 specifically 3.Removing Deficiency 4.Maximum Likelihood Training - expectation maximization (EM)

Word Alignment Given: a bilingual sentence pair, e.g. Task: find out which words correspond to each other

Considered Approach Overall strategy: 1.Design a probabilistic model (or take an existing) for translation and word alignment, with a (manageable) set of base probabilities 2.Learn the base probabilities from a set of training data (sentence pairs without alignments) 3.To annotate a given sentence pair: compute most likely alignment Approach for this talk: - probabilistic approach - data driven - unsupervised: no alignments given - based on (Brown et al. 1993)

Considered Models Alignment and Translation Model: for a target sentence and a source sentence : Considered Alignments: - each target word corresponds to at most one source word (mainly for computational reasons) conditional model alignments are hidden variables

Generative Process for Fertility Based Models Given a source sentence : 1.For - decide on the number of target words aligned to 2.For - For : decide on the kth target word aligned to 3.For - For : decide on the position of the target word The remaining positions are filled with the words - distortion model - IBM-3/4/5 differ - source for deficiency Then decide on the number of unaligned wordstarget words

IBM-3, Distortion and Deficiency ?????? IBM-3: deficient, we could choose j=1 (already taken) This work: nondeficient variant of IBM-3:

Reflection IBM-3: This work: nondeficient variant of IBM-3: Same base probabilities as the original IBM-3 Nondeficiency achieved by renormalization Relation to parametric models Need to keep track of all taken positions (just like the IBM-5)

IBM-4 ?????? 1.Word Classes: 2.Center position of die (the closest previous aligned word ): IBM-4 (deficient): This work: nondeficient variant of IBM-4: renormalization

Nondeficient IBM-4 Deficient IBM-4: Nondeficient IBM-4: () Leave out position 13 because we have to place neigeuse afterwards.

Training: Maximum Likelihood Maximize the likelihood of the training corpus: Subject to simplex/probability constraints on the base probabilities nonconcave maximization problem many local maxima no global algorithms known method of choice: expectation maximization (Dempster et al. ´77, Neal & Hinton ´98) for convenience: take the negative logarithm of the objective

EM: A Majorize-Minimize Method … assuming there is just one variable to minimize - in practice there are several thousand variables, with simplex (a.k.a. probability ) constraints - and for us, the bounding functions will not be convex negative log likelihood function 0

EM: Problems for IBM-3 and IBM-4 cf. (Udupa & Maji 2006) For IBM-3 and IBM-4: evaluating the negative log likelihood at a given point is intractable (deficient + nondeficient) bounding function known up to weights (= expectations) - computing the weights is intractable (deficient + nondeficient) - there are exponentially many weights (nondeficient) ) approximations Approximated bounding functions: non-convex (nondeficient) ) local minimization with projected gradient descent (250 iter.) (e.g. Bertsekas 1999)

E-Step: Computing the Weights of the Bounding Function Minimizing the bounding function decomposes into smaller problems (one per probability distribution) Here: consider only distortion for the IBM-3. For each sentence: - for the deficient variant: need expectation of i aligning to j - for the nondeficient variant: need expectation of i aligning to j when choosing from a set of open positions (exponentially many ) In both cases: hillclimbing procedure as in (Brown et al. 1993) - gives a likely alignment and neighbors ! approximate expectations - incremental for the deficient variant (fast) - not/only partially incremental for the nondeficient variant (slow) This method is also used to compute alignments

Briefly Mentioned More contributions in the paper: For the IBM-3: pooled distortion model Reduced deficiency for the IBM-4: words can no longer be placed outside of the sentence (but still on top of one another) In both cases: parametric models handled via EM and projected gradient descent based on

Experimental Setup Europarl German $ English and Spanish $ English - gold alignments: my own and from Lambert et al. all sentences lower cased in deficient mode: deficient empty word model (Och & Ney 2003) sentence pairs (leads to 1 day running time, 8GB memory) Evaluation metric: weighted F-Measure (Fraser & Marcu 2007) - accuracy measure (higher values = better alignments) - ® = 0.1 (recall more important than precision)

Results – Short Story Alignment accuracy : the nondeficient IBM-3 is clearly better than the deficient one IBM-4: all level, no winners IBM-5 beats everything Also tried: phrase based translation (Moses Experiment Management System) training run in both directions, diag-grow-final-and tuning (MERT) on 750 sentence pairs run for all models and variants the various BLEU scores offer no conclusions

Results – Long Story (1)

Results – Long Story (2)

Results – Long Story (3)

Results – Long Story (4)

Related work on Word Alignment Models: Brown et al Vogel, Ney, Tillmann 1996 Wang & Waibel 1998 Melamed 2000 Marcu & Wong 2002 Deng & Byrne 2005 Fraser & Marcu 2007 Mauser et al Algorithms: Al-Onaizan et al Matusov, Zens, Ney 2004 Taskar et al Udupa & Maji 2005 Lacoste-Julien et al Cromières, Kurohashi 2009 Regularity terms: Liang, Taskar, Klein 2006 Graca, Ganchev, Taskar 2010 Bansal, Quirk, Moore 2011 Vaswani et al. 2012

Conclusion Contributed: Nondeficient variants of IBM-3 and IBM-4 Maximum Likelihood Training based on EM - E-steps solved by hillclimbing - M-steps solved by projected gradient ascent Findings: important goal of probabilistic modeling (theoretical value) improvements for IBM-3 (f-measure on gold alignments) otherwise no improvements (IBM-4, BLEU scores) IBM-5 beats everything (f-measures on gold alignments)

Thank you! Questions? Source code and gold alignments online:

The Bounding Function Need to iteratively solve problems of the form (ignoring a constant) exponentially many terms previous parameters product of factors sum of logarithms factor evaluated with the parameters to be currently optimized