Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

Similar presentations


Presentation on theme: "A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006."— Presentation transcript:

1 A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006

2 Statistical Machine Translation a refresh

3 The Noisy Channel Translate from f to e e f e’ E  FF  E encoderdecoder e’ = argmax P(e|f) = argmax P(e)* P(f|e) e e source model language model channel model translation model

4 Language Model Bag Translation: sentence => bag of words N-gram language model

5 Translation Model Alignment: P(f,a|e) Fertility: dependent solely on the English word Mary did not slap the green witch Mary no daba una botefada a la verde bruja (Spanish) Mary: fertility 1; did: 0; slap:3; the: 2; green 1; witch 1 (Example from Kevin Knight’s tutorial)

6 Conditional Probability & Word-based Statistical MT Fertility one-to-one mapping from e to f one-to-many mapping from e to f Conditional Probability: given e, what is alignment probability with f? i.e., p(f,a|e) Word-based MT IBM 1-5

7 How About Many-to-many Mapping? a b c x y

8 Out of sight, out of mind: Invisible Idiot Output from Systran French: Hors de la vue, hors de l’esprit. Back to English: Out of the sight, of the spirit. German: Aus dem Anblick des Geistes heraus. Translated back to English: From the sight of the spirit out. Italian: Dalla vista dello spirito fuori. Translated back to English: From the sight of the spirit outside. Portuguese: Da vista do espírito fora. Translated back to English: Of the sight of the spirit it are. Spanish: De la vista del alcohol está. Translated back to English: Of the Vista of the alcohol it is. From http://www.discourse.net/archives/2005/06/of_the_vista_of_the_alcohol_it_i s.html

9 Lost in Translation

10 Solution many-to-many mapping How? Word-based Phrase-based

11 Alignment between Multiple Phrases Phrases are not really phrases Phrases defined differently in different models Most extracted phrases based on word- based alignment Och and Ney (1999): alignment template model Melamed (2001): Non-compositional compounds model

12 Marcu and Wong (2002)

13 Promising Features Looking for phrases and alignments simultaneously for both Source and Target sentences Directly modeling phrase-based probabilities Not dependent on word-based probabilities

14 Phrase & Concept phrase: a sequence of consecutive words. concept: a pair of aligned phrases A set of concepts can be linearized into a sentence pair (E, F) if E and F can be obtained by permuting the phrases e i and f i that characterize all concepts c i ∊ C. This property is denoted in the predicate L(E, F, C)

15 Two Models Model 1: –Joint probability distribution –phrases are equivalent translations

16 Model 2 A position-based distortion joint probability model Probability of the alignment between two phrases

17 Probability to Generate a Sentence Pair

18 How? Sentences Phrases Concepts

19 Four Steps Phrases & Concepts determination Initialize the joint probability of concepts, i.e., t-distribution table EM training on Viterbi alignments –Calculate t-distribution table –Full Iteration and then approximation of EM –Viterbi alignment –Smoothing Generate conditional probability from joint probability, needed in the decoder

20 Step 1: Phrase Determination All unigram Frequency of n-gram >=5

21 Step 2: Initialize the t-distribution Table Given a sentence E of l words, there are S(l, k) ways in which the l words can be partitioned into k non-empty concepts

22 S(m, k) ways for a sentence of F be partitioned into k non-empty concepts The number of concepts k is between 1 and min(l, m) Total number of concepts alignment between two sentences:

23 Probability of Two Concepts

24 How about the Word Order The equation doesn’t take word order into consideration. Phrases must consist of consecutive words The formula overestimates the numerator and denominator equally, so the approximation works well in the practice

25 Step 3: EM training on Viterbi Alignments After the initial t-table is built, EM can be used to improve the parameters However, it is impossible to calculate expectations over all possible alignments So for the initial alignment, only the concepts with high t_probabilities are aligned

26 Implementation Greedy Alignment: Greedily produce an initial alignment Hillclimbing: examine the probability of neighbor concepts to get local maxima by performing the following operations:

27 Swap concepts:, =>, Merge concepts:, => Break concepts : =>, Move words across concepts:, =>, From www.iccs.informatics.ed.ac.uk/ ~osborne/msc- projects/oconnor.pdf

28 Viterbi Search Smoothing

29 Training Iteration First iteration used Model 1 Rest iterations used Model 2

30 Step 4: Derivation of Conditional Probability Model P(f|e) = p(e, f)/p(e) Used in the decoder model

31 Encoder Given a Foreign sentence F, maximize the probability p(E, F) Hillclimb by modifying E and the alignment between E and F to maximize p(E)*P(F|E) P(E) is a trigram-based language model at word level instead of phrase level

32 Evaluation Data: French-English Hansard data Compared with Giza (IBM Model 4) Training: 100,000 sentence pairs Testing: 500 unseen sentences, uniformly distributed across length 6, 8, 10, 15 and 20

33 Results

34 Comparison of the Model: from Koehn et.al (2003)

35 Limitations of the model: complexity problems Phrases up to 6 words Size of t-table Large number of possible alignments Memory management Expensive operations such as swap, break, merge during Viterbi training

36 Limitations of the model: non- consecutive phrases English: not French: ne … pas –is not =>“ne est pas” –is not here => “ne est pas ici” Longer alignment? Sparse problem

37 Complexity vs. Performance Marcu and Wong: n-gram <=6 Keohn et al. (2003) –Allow Length of words >3 –Complexity increases largely but no significant improvement

38 Questions?


Download ppt "A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006."

Similar presentations


Ads by Google