Presentation is loading. Please wait.

Presentation is loading. Please wait.

(Statistical) Approaches to Word Alignment

Similar presentations


Presentation on theme: "(Statistical) Approaches to Word Alignment"— Presentation transcript:

1 (Statistical) Approaches to Word Alignment
Advanced Machine Translation Seminar Sanjika Hewavitharana Language Technologies Institute Carnegie Mellon University 02/02/2006

2 Word Alignment Models We want to learn how to translate words and phrases Can learn it from parallel corpora Typically work with sentence aligned corpora Available from LDC, etc For specific applications new data collection required Model the associations between the different languages Word to word mapping -> lexicon Differences in word order -> distortion model ‘Wordiness’, i.e. how many words to express a concept -> fertility Statistical translation is based on word alignment models

3 Alignment Example Observations: Often 1-1 Often monotone
Some 1-to-many Some 1-to-nothing

4 Word Alignment Models IBM1 – lexical probabilities only
IBM2 – lexicon plus absolut position IBM3 – plus fertilities IBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4 HMM – lexicon plus relative position BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation Syntactical alignment models [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999, Wu 1997, Yamada et al. 2003]

5 Notation Source language Target language f : source (French) word
J : length of source sentence j : position in source sentence; j = 1,2,...,J : source sentence Target language e : target (English) word I : length of target sentence i : position in target sentence; i = 1,2,...,I : target sentence

6 SMT - Principle Translate a ‘French’ string into an ‘English’ string
Bayes’ decision rule for translation: Based on Noisy channel model We will call f source and e target

7 Alignment as Hidden Variable
‘Hidden alignments’ to capture word-to-word correspondences Number of connections: J * I (each source word with each target word) Number of alignments: 2JI Restricted alignment Each source word has one connection – a function i = aj: position i of ei which is connected to j Number of alignments is now: IJ : whole alignment Relationship between Translation Model and Alignment Model

8 Empty Position (Null Word)
Sometimes a word has no correspondence Alignment function aligns each source word to one target word, i.e. cannot skip source word Solution: Introduce empty position 0 with null word e0 ‘Skip’ source word fj by aligning it to e0 Target sentence is extended to: Alignment is extended to:

9 Translation Model Sum over all possible alignments
3 probability distributions: Length: Alignment: Lexicon:

10 Model Assumptions Decompose interaction into pairwise dependencies
Length: Source length only dependent on target length (very weak) Alignment: Zero order model: target position only dependent on source position First order model: target position only dependent on previous target position Lexicon: source word only dependent on aligned word

11 IBM Model 1 Length: Source length only dependent on target length
Alignment: Assume uniform probability for position alignment Lexicon: source word only dependent on aligned word Alignment probability

12 IBM Model 1 – Generative Process
To generate a French string from an English string : Step 1: Pick the length of All lengths are equally probable; is a constant Step 2: Pick an alignment with probability Step 3: Pick the French words with probability Final Result:

13 IBM Model 1 – Training Parameters of the model:
Training data: parallel sentence pairs We adjust parameters s.t. it maximize Normalized for each : EM Algorithm used for the estimation Initialize the parameters uniformly Collect counts for each pair in the corpus Re-estimate parameters using counts Repeated for several iterations Model simple enough to compute over all alignments Parameters does not depend on initial values

14 IBM Model 1 Training– Pseudo Code
# Accumulation (over corpus) For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) Count(fj,ei) += p(fj|ei)/Sum # Re-estimate probabilities (over count table) For each target word e For each source word f Sum += Count(f,e) p(f|e) = Count(f,e)/Sum # Repeat for several iterations

15 IBM Model 2 Only Difference from Model 1 is in Alignment Probability
Length: Source length only dependent on target length Alignment: Target position depends on the source position (in addition to the source length and target length) Model 1 is a special case of Model 2, where Lexicon: source word only dependent on aligned word

16 IBM Model 2 – Generative Process
To generate a French string from an English string : Step 1: Pick the length of All lengths are equally probable; is a constant Step 2: Pick an alignment with probability Step 3: Pick the French words with probability Final Result:

17 IBM Model 2 – Training Parameters of the model:
Training data: parallel sentence pairs We maximize w.r.t translation and alignment params. EM Algorithm used for the estimation Initialize alignment parameters uniformly, and translation probabilities from Model 1 Accumulate counts, re-estimate parameters Model simple enough to compute over all alignments

18 Fertility-based Alignment Models
Models 3-5 are based on Fertility Fertility: Number of source words connected with a target word : fertility values of = probability that is connected with source words Alignment: Defined in the reverse-direction (target to source) = probability of French position j given English position is i

19 IBM Model 3 – Generative Process
To generate a French string from an English string : Step 1: Choose (I+1) fertilities with probability

20 IBM Model 3 – Generative Process
Step 2: For each , for k =1… , choose a position …J and a French word with probability For a given alignment, there are orderings

21 IBM Model 3 – Example [Knight 99] e0 Mary did not slap the green witch
Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch Mary no daba una botefada a la verde bruja Mary no daba una botefada a la bruja verde [e] 1 [choose fertility] [fertility for e0] [choose translation] [choose target positions j ] [aj ]

22 IBM Model 3 – Training Parameters of the model:
EM Algorithm used for the estimation Not possible to compute exact EM updates Initialize n,d,p uniformly, and translation probabilities from Model 2 Accumulate counts, re-estimate parameters Cannot efficiently compute over all alignments Only Viterbi alignment is used Model 3 is deficient Probability mass is wasted on impossible translations

23 IBM Model 4 Try to model re-ordering of phrases
is replaced with two sets of parameters: One for placing the first word (head) of a group of words One for placing rest of the words relative to the head Deficient Alignment can generate source positions outside of sentence length J Model 5 removes this deficiency

24 HMM Alignment Model Idea: relative position model Target Source
[Vogel 96]

25 HMM Alignment First order model: target position dependent on previous target position (captures movement of entire phrases) Alignment probability: Alignment depends on relative position Maximum approximation:

26 IBM2 vs HMM [Vogel 96]

27 Enhancements to HMM & IBM Models
HMM model with empty word Adding I empty words to the target side Model 6 IBM 4: predicts distance between subsequent target positions HMM: predicts distance between subsequent source positions Model 6: A log-linear combination of IBM 4 and HMM Models Smoothing Alignment prob. – Interpolate with uniform dist. Fertility prob. – Depends of number of letters in a word Symmetrization Heuristic postprocessing to combine alignments in both directions

28 Experimental Results [Franz 03]
Refined models perform better Models 4,5,6 better than Model 1 or Dice coefficient model HMM better than IBM 2 Alignment quality based on the training method and bootstrap scheme used IBM 1->HMM->IBM 3 better than IBM 1->IBM 2->IBM 3 Smoothing and Symmetrization have a significant effect on alignment quality More alignments in training yields better results Using word classes Improvement for large corpora but not for small corpora

29 References: Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer (1993). The Mathematics of Statistical Machine Translation , Computational Linguistics, vol. 19, no. 2. Stephan Vogel, Hermann Ney, Christoph Tillmann (1996). HMM-based Word Alignment in Statistical Translation , COLING, The 16th Int. Conf. on Computational Linguistics, Copenhagen, Denmark, August, pp Franz Josef Och, Hermann Ney (2003), A Systematic Comparison of Various Statistical Alignment Models , Computational Linguistics, vol. 29, no.1, pp Knight, Kevin, (1999), A Statistical MT Tutorial Workbook, Available at


Download ppt "(Statistical) Approaches to Word Alignment"

Similar presentations


Ads by Google