Download presentation

Presentation is loading. Please wait.

Published byMelanie Love Modified over 2 years ago

1
Statistical Machine Translation: IBM Models and the Alignment Template System

2
Statistical Machine Translation Goal: Given foreign sentence f: Maria no dio una bofetada a la bruja verde Find the most likely English translation e: Maria did not slap the green witch

3
Statistical Machine Translation Most likely English translation e is given by: P(e|f) estimates conditional probability of any e given f

4
Statistical Machine Translation How to estimate P(e|f)? Noisy channel: Decompose P(e|f) into P(f|e) * P(e) / P(f) Estimate P(f|e) and P(e) separately using parallel corpus Direct: Estimate P(e|f) directly using parallel corpus (more on this later)

5
Noisy Channel Model Translation Model P(f|e) How likely is f to be a translation of e? Estimate parameters from bilingual corpus Language Model P(e) How likely is e to be an English sentence? Estimate parameters from monolingual corpus Decoder Given f, what is the best translation e?

6
Noisy Channel Model Generative story: Generate e with probability p(e) Pass e through noisy channel Out comes f with probability p(f|e) Translation task: Given f, deduce most likely e that produced f, or:

7
Translation Model How to model P(f|e)? Learn parameters of P(f|e) from a bilingual corpus S of sentence pairs : = … =

8
Translation Model Insufficient data in parallel corpus to estimate P(f|e) at the sentence level (Why?) Decompose process of translating e -> f into small steps whose probabilities can be estimated

9
Translation Model English sentence e = e 1 …e l Foreign sentence f = f 1 …f m Alignment A = {a 1 …a m }, where a j ε {0…l} A indicates which English word generates each foreign word

10
Alignments e: the blue witch f: la bruja azul A = {1,3,2} (intuitively good alignment)

11
Alignments e: the blue witch f: la bruja azul A = {1,1,1} (intuitively bad alignment)

12
Alignments e: the blue witch f: la bruja azul (illegal alignment!)

13
Alignments Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m?

14
Alignments Question: how many possible alignments are there for a given e and f, where |e| = l and |f| = m? Answer: Each foreign word can align with any one of |e| = l words, or it can remain unaligned Each foreign word has (l + 1) choices for an alignment, and there are |f| = m foreign words So, there are (l+1)^m alignments for a given e and f

15
Alignments Question: If all alignments are equally likely, what is the probability of any one alignment, given e?

16
Alignments Question: If all alignments are equally likely, what is the probability of any one alignment, given e? Answer: P(A|e) = p(|f| = m) * 1/(l+1)^m If we assume that p(|f| = m) is uniform over all possible values of |f|, then we can let p(|f| = m) = C P(A|e) = C /(l+1)^m

17
Generative Story e: blue witch f: bruja azul ? How do we get from e to f?

18
IBM Model 1 Model parameters: T(f j | e aj ) = translation probability of foreign word given English word that generated it

19
IBM Model 1 Generative story: Given e: Pick m = |f|, where all lengths m are equally probable Pick A with probability P(A|e) = 1/(l+1)^m, since all alignments are equally likely given l and m Pick f 1 …f m with probability where T(f j | e aj ) is the translation probability of f j given the English word it is aligned to

20
IBM Model 1 Example e: blue witch

21
IBM Model 1 Example e: blue witch f: f1 f2 Pick m = |f| = 2

22
IBM Model 1 Example e: blue witch f: f1 f2 Pick A = {2,1} with probability 1/(l+1)^m

23
IBM Model 1 Example e: blue witch f: bruja f2 Pick f1 = bruja with probability t(bruja|witch)

24
IBM Model 1 Example e: blue witch f: bruja azul Pick f2 = azul with probability t(azul|blue)

25
IBM Model 1: Parameter Estimation How does this generative story help us to estimate P(f|e) from the data? Since the model for P(f|e) contains the parameter T(f j | e aj ), we first need to estimate T(f j | e aj )

26
lBM Model 1: Parameter Estimation How to estimate T(f j | e aj ) from the data? If we had the data and the alignments A, along with P(A|f,e), then we could estimate T(f j | e aj ) using expected counts as follows:

27
lBM Model 1: Parameter Estimation How to estimate P(A|f,e)? P(A|f,e) = P(A,f|e) / P(f|e) But So we need to compute P(A,f|e)… This is given by the Model 1 generative story:

28
IBM Model 1 Example e: the blue witch f: la bruja azul P(A|f,e) = P(f,A|e)/ P(f|e) =

29
IBM Model 1: Parameter Estimation So, in order to estimate P(f|e), we first need to estimate the model parameter T(f j | e aj ) In order to compute T(f j | e aj ), we need to estimate P(A|f,e) And in order to compute P(A|f,e), we need to estimate T(f j | e aj )…

30
IBM Model 1: Parameter Estimation Training data is a set of pairs Log likelihood of training data given model parameters is: To maximize log likelihood of training data given model parameters, use EM: hidden variable = alignments A model parameters = translation probabilities T

31
EM Initialize model parameters T(f|e) Calculate alignment probabilities P(A|f,e) under current values of T(f|e) Calculate expected counts from alignment probabilities Re-estimate T(f|e) from these expected counts Repeat until log likelihood of training data converges to a maximum

32
IBM Model 2 Model parameters: T(f j | e aj ) = translation probability of foreign word f j given English word e aj that generated it d(i|j,l,m) = distortion probability, or probability that f j is aligned to e i, given l and m

33
IBM Model 3 Model parameters: T(f j | e aj ) = translation probability of foreign word f j given English word e aj that generated it r(j|i,l,m) = reverse distortion probability, or probability of position f j, given its alignment to e i, l, and m n(e i ) = fertility of word e i, or number of foreign words aligned to e i p 1 = probability of generating a foreign word by alignment with the NULL English word

34
IBM Model 3 Generative Story: Choose fertilities for each English word Insert spurious words according to probability of being aligned to the NULL English word Translate English words -> foreign words Reorder words according to reverse distortion probabilities

35
IBM Model 3 Example Consider the following example from [Knight 1999]: Maria did not slap the green witch

36
IBM Model 3 Example Maria did not slap the green witch Maria not slap slap slap the green witch Choose fertilities: phi(Maria) = 1

37
IBM Model 3 Example Maria did not slap the green witch Maria not slap slap slap the green witch Maria not slap slap slap NULL the green witch Insert spurious words: p(NULL)

38
IBM Model 3 Example Maria did not slap the green witch Maria not slap slap slap the green witch Maria not slap slap slap NULL the green witch Maria no dio una bofetada a la verde bruja Translate words: t(verde|green)

39
IBM Model 3 Example Maria no dio una bofetada a la verde bruja Maria no dio una bofetada a la bruja verde Reorder words

40
IBM Model 3 For models 1 and 2: We can compute exact EM updates For models 3 and 4: Exact EM updates cannot be efficiently computed Use best alignments from previous iterations to initialize each successive model Explore only the subspace of potential alignments that lies within same neighborhood as the initial alignments

41
IBM Model 4 Model parameters: Same as model 3, except uses more complicated model of reordering (for details, see Brown et al. 1993)

42
Language Model Given an English sentence e 1, e 2 …e l : P(e 1, e 2 …e l ) = P(e 1 ) * P(e 2 |e 1 ) * … * P(e l | e 1, e 2 …e l-1 ) N-gram model: Assume P(e i ) depends only on the N-1 previous words, so that P(e i |e 1,e 2, …e i-1 ) = P(e i |e i-N,e i-N+1, …e i-1 )

43
N=2: Bigram Language Model P(Maria did not slap the green witch) = P(Maria|START) * P(did|Maria) * P(not|did) * … P(END|witch)

44
Word-Based MT Word = fundamental unit of translation Weaknesses: no explicit modeling of word context word-by-word translation may not accurately convey meaning of phrase: il ne va pas -> he does not go IBM models prevent alignment of foreign words with >1 English word: aller -> to go

45
Phrase-Based MT Phrase = basic unit of translation Strengths: explicit modeling of word context captures local reorderings, local dependencies

46
Example Rules: English: he does not go Foreign: il ne va pas ne va pas -> does not go

47
Alignment Template System [Och and Ney, 2004] Alignment template: Pair of source and target language phrases Word alignment among words within those phrases Formally, an alignment template is a triple (F,E,A): F = words on foreign side E = words on English side A = alignments among words on the foreign and English sides

48
Estimating P(e|f) Noisy channel: Decompose P(e|f) into P(f|e) and P(e) Estimate P(f|e) and P(e) separately Direct: Estimate P(e|f) directly from training corpus Use log-linear model

49
[Koehn 2003] Log-linear Models for MT Compute best translation as follows: where hi are the feature functions and λi are the model parameters Typical feature functions include: phrase translation probabilities lexical translation probabilities language model probability reordering model word penalty

50
[Och and Ney 2003] Log-linear Models for MT Noisy Channel model is a special case of Log- Linear model where: h1 = log(P(f|e)), λ1 = 1 h2 = log(P(e)), λ2 = 1 Then:

51
Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

52
Word-Align Training Corpus: Run GIZA++ word alignment in normal direction, from e -> f ilnevapas he does not go

53
Word-Align Training Corpus: Run GIZA++ word alignment in inverse direction, from f->e ilnevapas he does not go

54
Alignment Symmetrization: Merge bi-directional alignments using some heuristic between intersection and union Question: what is tradeoff in precision/recall using intersection/union? Here, we use union ilnevapas he does not go

55
Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

56
Extract phrase pairs: Extract all phrase pairs (E,F) consistent with word alignments, where consistency is defined as follows: (1) Each word in English phrase is aligned only with words in the foreign phrase (2) Each word in foreign phrase is aligned only with words in the English phrase Phrase pairs must consist of contiguous words in each language ilnevapas he does not go

57
Extract phrase pairs: Question: why is the illustrated phrase pair inconsistent with the alignment matrix? ilnevapas he does not go

58
Extract phrase pairs: Question: why is the illustrated phrase pair inconsistent with the alignment matrix? Answer: ne is aligned with not, which is outside the phrase pair; also, does is aligned with pas, which is outside the phrase pair ilnevapas he does not go

59
Extract phrase pairs: ilnevapas he does not go

60
Extract phrase pairs: ilnevapas he does not go

61
Extract phrase pairs:

62
Extract phrase pairs:

63
Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

64
Probability Assignment Use relative frequency estimation: P(F,E,A|F) = Count(F,E,A)/Count(F,E,A)

65
Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

66
Language Model Use N-gram language model P(e), just as for word-based MT

67
Alignment Template System Word-align training corpus Extract phrase pairs Assign probabilities to phrase pairs Train language model Decode

68
Beam search State space: set of possible partial translation hypotheses Start state: initial empty translation of foreign input Expansion operation: extend existing English hypothesis one phrase at a time, by translating a phrase in foreign sentence into English

69
Decoder Example Start: f: Maria no dio una bofetada a la bruja verde e: Expand English translation: translate Maria -> Mary or bruja -> witch mark foreign words as covered update probabilities

70
Decoder Example Example from [Koehn 2003]

71
BLEU MT Evaluation Metric BLEU measure n-gram precision against a set of k reference English translations: What percentage of n-grams (where n ranges from 1 through 5, typically) in the MT English output are also found in a reference translation? Brevity penalty: penalize English translations with fewer words than the reference translations Why is this metric so widely used? Correlates surprisingly well with human judgment of machine-generated translations

72
References Brown et al A statistical approach to Machine Translation. Brown et al The mathematics of statistical machine translation. Collins Lecture Notes from Fall 2003: Machine Learning Approaches for Natural Language Processing. Knight A Statistical MT Workbook. Knight and Koehn A Statistical Machine Translation Tutorial. Koehn, Och and Marcu A Phrase-Based Statistical Machine Translation System. Koehn, Pharaoh: A Phrase-Based Decoder. Och and Ney The Alignment Template System. Och and Ney Discriminative Training and Maximum Entropy Models for Statistical Machine Translation.

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google