Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine.

Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine translation

Overview  Centauri/Arcturan puzzle  Word level translation models  IBM Model 1  IBM Model 2  HMM Model  IBM Model 3  IBM Model 4 & 5 (brief overview)  Word alignment evaluation  Definition  Measures  Symmetrization  Translation using noisy channel

Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Centauri/Arcturan [Knight, 1997] Think how to translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp jjat arrat mat bat oloat at-yurp

It was really Spanish/English 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. Translate: Clients do not sell pharmaceuticals in Europe.

Principles applied  Derive word-level correspondences between sentences  Prefer one-to-one translation  Prefer consistent translation (small number of senses)  Prefer monotone translation  Words can be dropped  Look at target sentences to estimate fluency

Word-based translation models

Word-level translation models  The IBM word translation models assign a probability to a target sentence e given a source sentence f, using word-level correspondences  We will discuss the following models  IBM Model 1  IBM Model 2  HMM Model (not an IBM model but related)  IBM Model 3  IBM Models 4 & 5 (only briefly)

Alignment in IBM Models  1 source word for each target  For every target word token e at position j there exists a unique source word token f at position i such that f is a translation of e

Alignment function

Words may be reordered  The alignment does not need to be monotone: can have crossing correspondences

One-to-many translation  A source word may correspond to more than one target word

Deleting words  Not all words from the source need to have a corresponding target word (some source words are deleted)  The German article das may be dropped

Inserting words  Some target words may not correspond to any word in the source  A special NULL token is introduced at position 0; it is aligned to all target words that are inserted

Disadvantage of Alignment Function  The IBM models and HMM use this definition of translation correspondence  Problem:  Cannot represent one target word token corresponding to multiple source word tokens  E.g. German target, English source very small house klitzeklein Haus  More general alignment: each target word token corresponds to a set of source word tokens

IBM Model 1 012 12 012 12 012 12 012 12 9

Generative process for IBM-1 le=4 select a(1) with a(i|4)=0.2 1 134 select a(2) with a(i|4)=0.2 select a(3) with a(i|4)=0.2 select a(4) with a(i|4)=0.2 theissmallthe

IBM Model 1 Target words are dependent only on their corresponding source words, not on any other source or target words Only parameters of model

Example

IBM Model 1 translation probability Using law of total probability.

How to estimate parameters  If we observe parallel sentences with alignments, can estimate lexical probability through relative frequency  This is maximum likelihood estimation for multinomials (remember homework assignments from 570)

Estimating parameters with incomplete data  We don’t have parallel sentence with word alignments  Alignments are hidden (data is incomplete)  We can still estimate the model parameters by maximum likelihood  Not as straightforward as counting and normalizing but not too bad  EM algorithm: a simple, intuitive method to maximize likelihood  Other general non-linear optimization algorithms (projected gradient, LBFGS, etc. )

EM algorithm  Incomplete data  If we had complete data, we could estimate model parameters  If we had model parameters, we could compute probabilities of missing data (hidden variables)  Expectation Maximization (EM) in a nutshell  Initialize model parameters (e.g. uniform or break symmetries)  Assign probabilities to missing data  Estimate new parameters given completed data  Iterate until convergence

EM Example

Convergence after several iterations

EM for IBM Model 1

EM for IBM 1 example Ignoring the NULL word in the source for simplicity. Also ignoring a constant factor (independent of a) for each alignment.

EM for IBM Model 1

Doesn’t look easy to sum up: exponentially many things to add!

EM for IBM Model 1

Re-arranging the sum Due to strong independence assumptions we can sum over the alignments efficiently.

Collecting counts for M-step  Here is our final expression for the probability of an alignment given a sentence pair.

Collecting counts for M-step  The expected count for word f translating to word e given sentence pair e,f:  Can be efficiently computed as follows, using similar rearranging:

M-step for IBM Model 1  After collecting counts from all sentence pairs, we add them up and re-normalize to get new lexical translation probabilities:

IBM Model 2

Generative process for IBM-2 le=4 select a(1) with a(i|1,4,4) 1 134 select a(2) with a(i|2,4,4) select a(3) with a(i|3,4,4) select a(4) with a(i|4,4,4) theissmallthe

Parameter estimation for IBM Model 2  Very similar to IBM Model 1: the model factors in the same way  The only difference is that instead of uniform alignment probabilities, we use learned position- dependent probabilities (sometimes called distortion probabilities)  Collect expected counts for lexical translation and distortion probabilities for the M-step

HMM Model

Generative process for HMM le=4 select a(1) with d(i|-1,4) 1 234 select a(2) with d(i|1,4) select a(3) with d(i|2,4) select a(4) with d(i|3,4) houseissmallthe Using

HMM alignment model  Hidden Markov Model like ones for POS tagging with some differences  The state space is the space of integers from 0 to source length  It is globally conditioned on the source sentence 1 234 houseissmallthe

Parameter Estimation for HMM model

Local Optima in IBM-2 and HMM  These models have multiple different local optima in the general case  Good starting points are important for local search algorithms  Initialize parameters of IBM-2 using a trained IBM-1 model  Initialize HMM from IBM-1 or IBM-2  Such initialization schemes can have large impact on performance (some results later)  See Och & Ney 03 [from optional word translation models readings] for more details

IBM Model 3  Motivation  For the IBM models 1 and 2 the alignments of all target words are independent  For HMM the alignment of a target word depends only on the alignment of the previous target word  This may lead to situations where one source word is aligned to a large number of target words Because the model does not remember how many target words have already aligned to a source word  Can’t encode a preference for one-to-one alignment  IBM Model 3 adds the capability to keep track of the fertility of source words  Counts how many target words a source word generates

IBM Model 3 generative process Marydidnotslapthegreenwitch 11211113 Marianounalaverdebrujadababofetadaa Marianounalaverdebrujadababofetadaa 11211113NULL For each target word placeholder, generate a target word given the aligned source word using t(e|f)

IBM 3 Probability  Multiple ways to generate sentence e and alignment a given source sentence f  Due to words with fertility >1 and unobserved source of inserted words slap 213 unadabaa unadababofetadaa 213NULL bofetada slap 213 dabaunaa dababofetadaa 213 NULL a

IBM Model 3 probability  Sum up all ways to generate a target and alignment

Dependencies among hidden variables

IBM Model 4 & 5  Distortion model in IBM 3 is absolute  Target position j depends only on corresponding source position i  IBM 4 adds a relative distortion model, capturing the intuition that words move in groups (the placement of target words aligned to i depends on the placement of target words aligned to i-1).  IBM 3 and IBM 4 are deficient  Words in the target could get placed on top of each other with non- zero probability so some mass is lost  IBM model 5 addresses the deficiency

IBM Model 4 Example

Word alignment evaluation

Evaluating IBM Models  Can use them for translation  But can also evaluate their performance on the derived word-to-word correspondences  We will use this evaluation method to compare models  Need manually defined (gold-standard) word alignments  Need a measure to compare the model’s output to the gold standard

Evaluation of word alignment

Word Alignment Standards Can have many-to-many alignments; one source to several target and one target to several source.

Symmetrizing Word Alignments Because of the asymmetric nature of these models, performance can be improved by running in both direction and combining the alignments.

Symmetrizing Word Alignments  Can also use union or selective union using a growing heuristic

Comparison of models on alignment Summary of model characteristics (from Och & Ney 03 )

Comparison of models on alignment AER of models (from Och & Ney 03) Model Training 0.5K 2K 8K 34K

Effect of Symmetrization Performance of models (from Och & Ney 03) Other improvements by Och & Ney: smoothing very important; adding a dictionary can help (see paper for more details)

Translation with word-based models

Using word-based models for translation  Can use the word-based model directly  More accurate if we use a noisy-channel model  Can incorporate a target language model to improve fluency  The target language model can be trained on monolingual data which we usually have much more of

Using word-based models for translation  We have introduced a set of models that can be used to score candidate translations for a given source sentence  Haven’t talked about how to find the best possible translation  Will discuss it when we talk about decoding in phrase- based models  In brief, decoding is very hard even for IBM-1

Summary  Introduced word-based translation models  The concept of alignment  IBM-Model 1 (uniform alignment)  IBM-Model 2 (absolute distortion model)  HMM Model (relative distortion model)  IBM-Model 3 (fertility and absolute distortion)  IBM-Model 4 (fertility and relative distortion)  IBM-Model 5 (like IBM-4 but fixes deficiency)  Parameter estimation for word-based translation models  Exact if we have strong independence assumptions for the hidden variables  Approximate for models with fertility  Use simpler models to initialize more complex ones and find good alignments  Translation using a word-based model  Noisy channel model allows the incorporation of a language model

Assignments and next time  HW1 will be posted online tomorrow April 7  Will be due midnight PST on April 21  Next time  Will give a brief overview of other word-alignment models (for paper presentation ideas)  Will talk about phrase translation models  Read Chapter 5  Finish reading Chapter 4

Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine.

Similar presentations

Presentation on theme: "Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine.

Similar presentations

Presentation on theme: "Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine."— Presentation transcript:

Similar presentations

About project

Feedback