Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 39 of 42 Wednesday, 29 November.
Computing & Information Sciences Kansas State University Lecture 38 of 42 CIS 530 / 730 Artificial Intelligence Lecture 38 of 42 Natural Language Processing,
Machine Translation Domain Adaptation Day PROJECT #2 2.
+. + Natural Language Processing CS311, Spring 2013 David Kauchak.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 24 Jim Martin.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007.
Machine Translation: Introduction Slides from: Dan Jurafsky.
Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department.
Introduction to Statistical Machine Translation Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics.
CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides.
Translation Model Parameters & Expectation Maximization Algorithm Lecture 2 (adapted from notes from Philipp Koehn & Mary Hearne) Dr. Declan Groves, CNGL,
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Expectation Maximization Algorithm
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
MACHINE TRANSLATION AND MT TOOLS: GIZA++ AND MOSES -Nirdesh Chauhan.
Natural Language Processing Expectation Maximization.
Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Machine Translation Course 5 Diana Trandab ă ț Academic year:
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
CSE 517 Natural Language Processing Winter 2015
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Natural Language Processing Lecture 23—12/1/2015 Jim Martin.
Machine Translation Diana Trandab ă ţ Academic Year
Discriminative Word Alignment with Conditional Random Fields Phil Blunsom & Trevor Cohn [ACL2006] Eiji ARAMAKI.
Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering,
Statistical Machine Translation Part II: Word Alignments and EM
Alexander Fraser CIS, LMU München Machine Translation
Data Mining Lecture 11.
Machine Translation: Introduction
CSCI 5832 Natural Language Processing
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Bayesian Models in Machine Learning
CSCI 5832 Natural Language Processing
LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Machine Translation and MT tools: Giza++ and Moses
Statistical Machine Translation Part IIIb – Phrase-based Model
Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn
Introduction to Statistical Machine Translation
Machine Translation and MT tools: Giza++ and Moses
Machine Translation: Word alignment models
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine translation

Overview  Centauri/Arcturan puzzle  Word level translation models  IBM Model 1  IBM Model 2  HMM Model  IBM Model 3  IBM Model 4 & 5 (brief overview)  Word alignment evaluation  Definition  Measures  Symmetrization  Translation using noisy channel

Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp jjat arrat mat bat oloat at-yurp

It was really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. Translate: Clients do not sell pharmaceuticals in Europe.

Principles applied  Derive word-level correspondences between sentences  Prefer one-to-one translation  Prefer consistent translation (small number of senses)  Prefer monotone translation  Words can be dropped  Look at target sentences to estimate fluency

Word-based translation models

Word-level translation models  The IBM word translation models assign a probability to a target sentence e given a source sentence f, using word-level correspondences  We will discuss the following models  IBM Model 1  IBM Model 2  HMM Model (not an IBM model but related)  IBM Model 3  IBM Models 4 & 5 (only briefly)

Alignment in IBM Models  1 source word for each target  For every target word token e at position j there exists a unique source word token f at position i such that f is a translation of e

Alignment function

Words may be reordered  The alignment does not need to be monotone: can have crossing correspondences

One-to-many translation  A source word may correspond to more than one target word

Deleting words  Not all words from the source need to have a corresponding target word (some source words are deleted)  The German article das may be dropped

Inserting words  Some target words may not correspond to any word in the source  A special NULL token is introduced at position 0; it is aligned to all target words that are inserted

Disadvantage of Alignment Function  The IBM models and HMM use this definition of translation correspondence  Problem:  Cannot represent one target word token corresponding to multiple source word tokens  E.g. German target, English source very small house klitzeklein Haus  More general alignment: each target word token corresponds to a set of source word tokens

IBM Model

Generative process for IBM-1 le=4 select a(1) with a(i|4)= select a(2) with a(i|4)=0.2 select a(3) with a(i|4)=0.2 select a(4) with a(i|4)=0.2 theissmallthe

IBM Model 1 Target words are dependent only on their corresponding source words, not on any other source or target words Only parameters of model

Example

IBM Model 1 translation probability Using law of total probability.

How to estimate parameters  If we observe parallel sentences with alignments, can estimate lexical probability through relative frequency  This is maximum likelihood estimation for multinomials (remember homework assignments from 570)

Estimating parameters with incomplete data  We don’t have parallel sentence with word alignments  Alignments are hidden (data is incomplete)  We can still estimate the model parameters by maximum likelihood  Not as straightforward as counting and normalizing but not too bad  EM algorithm: a simple, intuitive method to maximize likelihood  Other general non-linear optimization algorithms (projected gradient, LBFGS, etc. )

EM algorithm  Incomplete data  If we had complete data, we could estimate model parameters  If we had model parameters, we could compute probabilities of missing data (hidden variables)  Expectation Maximization (EM) in a nutshell  Initialize model parameters (e.g. uniform or break symmetries)  Assign probabilities to missing data  Estimate new parameters given completed data  Iterate until convergence

EM Example

Convergence after several iterations

EM for IBM Model 1

EM for IBM 1 example Ignoring the NULL word in the source for simplicity. Also ignoring a constant factor (independent of a) for each alignment.

EM for IBM Model 1

Doesn’t look easy to sum up: exponentially many things to add!

EM for IBM Model 1

Re-arranging the sum Due to strong independence assumptions we can sum over the alignments efficiently.

Collecting counts for M-step  Here is our final expression for the probability of an alignment given a sentence pair.

Collecting counts for M-step  The expected count for word f translating to word e given sentence pair e,f:  Can be efficiently computed as follows, using similar rearranging:

M-step for IBM Model 1  After collecting counts from all sentence pairs, we add them up and re-normalize to get new lexical translation probabilities:

IBM Model 2

Generative process for IBM-2 le=4 select a(1) with a(i|1,4,4) select a(2) with a(i|2,4,4) select a(3) with a(i|3,4,4) select a(4) with a(i|4,4,4) theissmallthe

Parameter estimation for IBM Model 2  Very similar to IBM Model 1: the model factors in the same way  The only difference is that instead of uniform alignment probabilities, we use learned position- dependent probabilities (sometimes called distortion probabilities)  Collect expected counts for lexical translation and distortion probabilities for the M-step

HMM Model

Generative process for HMM le=4 select a(1) with d(i|-1,4) select a(2) with d(i|1,4) select a(3) with d(i|2,4) select a(4) with d(i|3,4) houseissmallthe Using

HMM alignment model  Hidden Markov Model like ones for POS tagging with some differences  The state space is the space of integers from 0 to source length  It is globally conditioned on the source sentence houseissmallthe

Parameter Estimation for HMM model

Local Optima in IBM-2 and HMM  These models have multiple different local optima in the general case  Good starting points are important for local search algorithms  Initialize parameters of IBM-2 using a trained IBM-1 model  Initialize HMM from IBM-1 or IBM-2  Such initialization schemes can have large impact on performance (some results later)  See Och & Ney 03 [from optional word translation models readings] for more details

IBM Model 3  Motivation  For the IBM models 1 and 2 the alignments of all target words are independent  For HMM the alignment of a target word depends only on the alignment of the previous target word  This may lead to situations where one source word is aligned to a large number of target words Because the model does not remember how many target words have already aligned to a source word  Can’t encode a preference for one-to-one alignment  IBM Model 3 adds the capability to keep track of the fertility of source words  Counts how many target words a source word generates

IBM Model 3 generative process Marydidnotslapthegreenwitch Marianounalaverdebrujadababofetadaa Marianounalaverdebrujadababofetadaa NULL For each target word placeholder, generate a target word given the aligned source word using t(e|f)

IBM 3 Probability  Multiple ways to generate sentence e and alignment a given source sentence f  Due to words with fertility >1 and unobserved source of inserted words slap 213 unadabaa unadababofetadaa 213NULL bofetada slap 213 dabaunaa dababofetadaa 213 NULL a

IBM Model 3 probability  Sum up all ways to generate a target and alignment

Dependencies among hidden variables

IBM Model 4 & 5  Distortion model in IBM 3 is absolute  Target position j depends only on corresponding source position i  IBM 4 adds a relative distortion model, capturing the intuition that words move in groups (the placement of target words aligned to i depends on the placement of target words aligned to i-1).  IBM 3 and IBM 4 are deficient  Words in the target could get placed on top of each other with non- zero probability so some mass is lost  IBM model 5 addresses the deficiency

IBM Model 4 Example

Word alignment evaluation

Evaluating IBM Models  Can use them for translation  But can also evaluate their performance on the derived word-to-word correspondences  We will use this evaluation method to compare models  Need manually defined (gold-standard) word alignments  Need a measure to compare the model’s output to the gold standard

Evaluation of word alignment

Word Alignment Standards Can have many-to-many alignments; one source to several target and one target to several source.

Symmetrizing Word Alignments Because of the asymmetric nature of these models, performance can be improved by running in both direction and combining the alignments.

Symmetrizing Word Alignments  Can also use union or selective union using a growing heuristic

Comparison of models on alignment Summary of model characteristics (from Och & Ney 03 )

Comparison of models on alignment AER of models (from Och & Ney 03) Model Training 0.5K 2K 8K 34K

Effect of Symmetrization Performance of models (from Och & Ney 03) Other improvements by Och & Ney: smoothing very important; adding a dictionary can help (see paper for more details)

Translation with word-based models

Using word-based models for translation  Can use the word-based model directly  More accurate if we use a noisy-channel model  Can incorporate a target language model to improve fluency  The target language model can be trained on monolingual data which we usually have much more of

Using word-based models for translation  We have introduced a set of models that can be used to score candidate translations for a given source sentence  Haven’t talked about how to find the best possible translation  Will discuss it when we talk about decoding in phrase- based models  In brief, decoding is very hard even for IBM-1

Summary  Introduced word-based translation models  The concept of alignment  IBM-Model 1 (uniform alignment)  IBM-Model 2 (absolute distortion model)  HMM Model (relative distortion model)  IBM-Model 3 (fertility and absolute distortion)  IBM-Model 4 (fertility and relative distortion)  IBM-Model 5 (like IBM-4 but fixes deficiency)  Parameter estimation for word-based translation models  Exact if we have strong independence assumptions for the hidden variables  Approximate for models with fertility  Use simpler models to initialize more complex ones and find good alignments  Translation using a word-based model  Noisy channel model allows the incorporation of a language model

Assignments and next time  HW1 will be posted online tomorrow April 7  Will be due midnight PST on April 21  Next time  Will give a brief overview of other word-alignment models (for paper presentation ideas)  Will talk about phrase translation models  Read Chapter 5  Finish reading Chapter 4