Download presentation
Presentation is loading. Please wait.
1
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006
2
Statistical Machine Translation a refresh
3
The Noisy Channel Translate from f to e e f e’ E FF E encoderdecoder e’ = argmax P(e|f) = argmax P(e)* P(f|e) e e source model language model channel model translation model
4
Language Model Bag Translation: sentence => bag of words N-gram language model
5
Translation Model Alignment: P(f,a|e) Fertility: dependent solely on the English word Mary did not slap the green witch Mary no daba una botefada a la verde bruja (Spanish) Mary: fertility 1; did: 0; slap:3; the: 2; green 1; witch 1 (Example from Kevin Knight’s tutorial)
6
Conditional Probability & Word-based Statistical MT Fertility one-to-one mapping from e to f one-to-many mapping from e to f Conditional Probability: given e, what is alignment probability with f? i.e., p(f,a|e) Word-based MT IBM 1-5
7
How About Many-to-many Mapping? a b c x y
8
Out of sight, out of mind: Invisible Idiot Output from Systran French: Hors de la vue, hors de l’esprit. Back to English: Out of the sight, of the spirit. German: Aus dem Anblick des Geistes heraus. Translated back to English: From the sight of the spirit out. Italian: Dalla vista dello spirito fuori. Translated back to English: From the sight of the spirit outside. Portuguese: Da vista do espírito fora. Translated back to English: Of the sight of the spirit it are. Spanish: De la vista del alcohol está. Translated back to English: Of the Vista of the alcohol it is. From http://www.discourse.net/archives/2005/06/of_the_vista_of_the_alcohol_it_i s.html
9
Lost in Translation
10
Solution many-to-many mapping How? Word-based Phrase-based
11
Alignment between Multiple Phrases Phrases are not really phrases Phrases defined differently in different models Most extracted phrases based on word- based alignment Och and Ney (1999): alignment template model Melamed (2001): Non-compositional compounds model
12
Marcu and Wong (2002)
13
Promising Features Looking for phrases and alignments simultaneously for both Source and Target sentences Directly modeling phrase-based probabilities Not dependent on word-based probabilities
14
Phrase & Concept phrase: a sequence of consecutive words. concept: a pair of aligned phrases A set of concepts can be linearized into a sentence pair (E, F) if E and F can be obtained by permuting the phrases e i and f i that characterize all concepts c i ∊ C. This property is denoted in the predicate L(E, F, C)
15
Two Models Model 1: –Joint probability distribution –phrases are equivalent translations
16
Model 2 A position-based distortion joint probability model Probability of the alignment between two phrases
17
Probability to Generate a Sentence Pair
18
How? Sentences Phrases Concepts
19
Four Steps Phrases & Concepts determination Initialize the joint probability of concepts, i.e., t-distribution table EM training on Viterbi alignments –Calculate t-distribution table –Full Iteration and then approximation of EM –Viterbi alignment –Smoothing Generate conditional probability from joint probability, needed in the decoder
20
Step 1: Phrase Determination All unigram Frequency of n-gram >=5
21
Step 2: Initialize the t-distribution Table Given a sentence E of l words, there are S(l, k) ways in which the l words can be partitioned into k non-empty concepts
22
S(m, k) ways for a sentence of F be partitioned into k non-empty concepts The number of concepts k is between 1 and min(l, m) Total number of concepts alignment between two sentences:
23
Probability of Two Concepts
24
How about the Word Order The equation doesn’t take word order into consideration. Phrases must consist of consecutive words The formula overestimates the numerator and denominator equally, so the approximation works well in the practice
25
Step 3: EM training on Viterbi Alignments After the initial t-table is built, EM can be used to improve the parameters However, it is impossible to calculate expectations over all possible alignments So for the initial alignment, only the concepts with high t_probabilities are aligned
26
Implementation Greedy Alignment: Greedily produce an initial alignment Hillclimbing: examine the probability of neighbor concepts to get local maxima by performing the following operations:
27
Swap concepts:, =>, Merge concepts:, => Break concepts : =>, Move words across concepts:, =>, From www.iccs.informatics.ed.ac.uk/ ~osborne/msc- projects/oconnor.pdf
28
Viterbi Search Smoothing
29
Training Iteration First iteration used Model 1 Rest iterations used Model 2
30
Step 4: Derivation of Conditional Probability Model P(f|e) = p(e, f)/p(e) Used in the decoder model
31
Encoder Given a Foreign sentence F, maximize the probability p(E, F) Hillclimb by modifying E and the alignment between E and F to maximize p(E)*P(F|E) P(E) is a trigram-based language model at word level instead of phrase level
32
Evaluation Data: French-English Hansard data Compared with Giza (IBM Model 4) Training: 100,000 sentence pairs Testing: 500 unseen sentences, uniformly distributed across length 6, 8, 10, 15 and 20
33
Results
34
Comparison of the Model: from Koehn et.al (2003)
35
Limitations of the model: complexity problems Phrases up to 6 words Size of t-table Large number of possible alignments Memory management Expensive operations such as swap, break, merge during Viterbi training
36
Limitations of the model: non- consecutive phrases English: not French: ne … pas –is not =>“ne est pas” –is not here => “ne est pas ici” Longer alignment? Sparse problem
37
Complexity vs. Performance Marcu and Wong: n-gram <=6 Keohn et al. (2003) –Allow Length of words >3 –Complexity increases largely but no significant improvement
38
Questions?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.