Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010,

Similar presentations


Presentation on theme: "A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010,"— Presentation transcript:

1 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010, Massachusetts, USA

2 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan 2 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages  Highly inflected languages (Arabic, Basque, Turkish, Russian, Hungarian, etc. )  Extensive use of affixes  Huge number of distinct word forms used in the Finnish Air Force lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas lento+ kone+ suihku+ turbiini+ moottori+ apu+ mekaanikko+ ali+ upseeri+ oppilas This is a Finnish word!!! flight machine shower turbine engine assistance mechanic under officer student technical warrant officer trainee specialized in aircraft jet engines Highly inflected languages are hard for MT  Analysis at the morpheme level is sensible.

3 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morphological Analysis Helps  Morpheme representation alleviates data sparseness Arabic  English (Lee, 2004) Czech  English (Goldwater & McClosky, 2005) Finnish  English (Yang & Kirchhoff, 2006)  Word representation captures context better for large corpora Arabic  English (Sadat & Habash, 2006) Finnish  English (de Gispert et al., 2009) 3 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Our approach: the basic unit of translation is the morpheme, but word boundaries are respected at all stages. auto+ si = your car auto+ i+ si = your car+ s auto+ i+ ssa+ si = in your car+ s auto+ i+ ssa+ si+ ko = in your car+ s ?

4 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Translation into Morphologically Rich Languages  A challenging translation direction  Recent interest in morphologically rich languages:  Arabic (Badr et al., 2008)  Turkish (Oflazer and El-Kahlout, 2007)  enhance the performance for only small bi-texts  Greek (Avramidis and Koehn, 2008),  Russian (Toutanova et al., 2008)  rely heavily on language-specific tools 4 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages We are interested in unsupervised approach, and experiment translation with a large dataset

5 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Methodology  Morphological Analysis – Unsupervised  Morphological Enhancements – Respect word boundaries  Translation Model Enrichment – Merge phrase tables (PT) 5 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Words 1) Morphological analysis Morph- emes 5) Translation Model Enrichment Align- ment Alignment training PT scoring Phrase table Phrase pairs 2) Phrase extraction 3) MERT tuning 4) Decoding

6 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morphological Analysis  Use Morfessor (Creutz and Lagus, 2007) - unsupervised morphological analyzer  Segments words  morphemes (PRE, STM, SUF) un/PRE+ care/STM+ ful/SUF+ ly/SUF  “+” sign used to enforce word boundary constraints later 6 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment

7 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Word Boundary-aware Phrase Extraction  Typical SMT: maximum phrase length n=7 words  Problem: morpheme phrases  of length n can span less than n words  may only partially span words Severe for morphologically rich languages  Solution: morpheme phrases  span n words or less  span a sequence of whole words 7 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages SRC = the STM new STM, un PRE+ democratic STM immigration STM policy STM TGT = uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM (uusi=new, epädemokraattisen=undemocratic maahanmuuttopolitiikan=immigration policy) T RAINING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment advoid proposing nonwords

8 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morpheme MERT Optimizing Word BLEU  Why ?  BLEU brevity penalty influenced by sentence length  Same # words span different # morphemes  a 6-morpheme Finnish word: epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF  Suboptimal weight for the word penalty feature function  Solution: for each iteration of MERT,  Decode at the morpheme level  Convert morpheme translation  word sequence  Compute word BLEU  Convert back to morphemes 8 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages T UNING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment

9 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models  Morpheme language model (LM)  Pros: alleviate data sparseness  Cons: phrases span fewer words  Introduce a second LM at the word level  Log-linear model: add a separate feature  Moses decoder: add word-level “view” on the morpheme-level hypotheses 9 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: Previous hypotheses Current hypothesis D ECODING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment

10 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models 10 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: “s SUF+ en SUF maahanmuutto PRE+ ” Previous hypotheses Current hypothesis D ECODING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment  Morpheme language model (LM)  Pros: alleviate data sparseness  Cons: phrases span fewer words  Introduce a second LM at the word level  Log-linear model: add a separate feature  Moses decoder: add word-level “view” on the morpheme-level hypotheses

11 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models 11 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: “s SUF+ en SUF maahanmuutto PRE+ ” ; “en SUF maahanmuutto PRE+ politiikan STM ” Previous hypotheses Current hypothesis D ECODING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment  Morpheme language model (LM)  Pros: alleviate data sparseness  Cons: phrases span fewer words  Introduce a second LM at the word level  Log-linear model: add a separate feature  Moses decoder: add word-level “view” on the morpheme-level hypotheses

12 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models 12 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: “s SUF+ en SUF maahanmuutto PRE+ ” ; “en SUF maahanmuutto PRE+ politiikan STM ” Word LM: uusi, epädemokraattisen maahanmuuttopolitiikan Previous hypotheses Current hypothesis D ECODING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment  Morpheme language model (LM)  Pros: alleviate data sparseness  Cons: phrases span fewer words  Introduce a second LM at the word level  Log-linear model: add a separate feature  Moses decoder: add word-level “view” on the morpheme-level hypotheses

13 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Decoding with Twin Language Models 13 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages uusi STM, epä PRE+ demokraat STM+ t SUF+ i SUF+ s SUF+ en SUF maahanmuutto PRE+ politiikan STM Morpheme LM: “s SUF+ en SUF maahanmuutto PRE+ ” ; “en SUF maahanmuutto PRE+ politiikan STM ” Word LM: uusi, epädemokraattisen maahanmuuttopolitiikan Previous hypotheses Current hypothesis D ECODING This is: (1) different from scoring with two word-level LMs & (2) superior to n-best rescoring Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment  Morpheme language model (LM)  Pros: alleviate data sparseness  Cons: phrases span fewer words  Introduce a second LM at the word level  Log-linear model: add a separate feature  Moses decoder: add word-level “view” on the morpheme-level hypotheses

14 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Building Twin Translation Models 14 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages problems ||| vaikeuksista From the same source, we generate two translation models E NRICHING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment Decoding PT merging PT w PT m Word GIZA++ Word alignmentMorpheme alignment Morpheme Phrase Extraction GIZA++ Parallel corpora problem STM+ s SUF ||| vaikeu STM+ ksi SUF+ sta SUF Morphological segmentation PT w→m problem STM+ s SUF ||| ongelma STM+ t SUF

15 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Phrase Table (PT) Merging  Add-feature methods e.g., Chen et al., 2009  Heuristic-driven  Interpolation-based methods e.g., Wu & Wang, 2007  Linear interpolation of phrase and lexicalized translation probabilities  For two PTs originated from different sources 15 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages E NRICHING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment Phrase penaltyproblem STM+ s SUF ||| vaikeu STM+ ksi SUF + sta SUF ||| 0.07 0.11 0.01 0.01 2.7 ||| 0.37 0.60 0.11 0.14 2.7 problem STM+ s SUF ||| ongelma STM+ t SUF Lexicalized translation probabilities Phrase translation probabilities problem STM+ s SUF ||| vaikeu STM+ ksi SUF + sta SUF ||| 0.07 0.11 0.01 0.01 2.7 2.7 1 ||| 0.37 0.60 0.11 0.14 2.7 1 2.7 problem STM+ s SUF ||| ongelma STM+ t SUF We take into account the fact: our twin translation models are of equal quality

16 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan  Preserve the normalized ML estimations (Koehn et al., 2003)  Use the raw counts of both models to compute Our method – Phrase translation probabilities 16 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages PT m PT w→m PT w  m PT m The number of times the pair was extracted from the training dataset E NRICHING Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment

17 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan  Use linear interpolation  What happens if a phrase pair belongs to only one PT?  Previous methods: interpolate with 0  Might cause some good phrases to be penalized  Our method: induce all scores before interpolation  Use lexical model of one PT to score phrase pairs of the other Lexicalized translation probabilities 17 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages problem STM+ s SUF ||| vaikeu STM+ ksi SUF+ sta problem STM+ s SUF ||| ongelma STM+ sta SUF problem STM+ s SUF ||| vaikeu STM+ ksi SUF+ sta problem STM+ s SUF ||| ongelma STM+ t SUF PT w  m PT m PT w  m PT m E NRICHING Morpheme Lexical Model (PT m )Word Lexical Model (PT w ) (problem STM+ s SUF | ongelma STM+ t SUF ) (problems | ongelmat) Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment

18 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Dataset & Settings 18 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages E XPERIMENTS  Dataset  Past shared task WPT05 (en/fi)  714K sentence pairs  Split into T1, T2, T3, and T4 of sizes 40K, 80K, 160K, and 320K  Standard phrase-based SMT settings:  Moses  IBM Model 4  Case insensitive BLEU

19 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan MT Baseline Systems  w-system – word level  m-system – morpheme level  m-BLEU – morpheme version of BLEU 19 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages w-systemm-system BLEUm-BLEUBLEUm-BLEU T111.5645.5711.0749.15 T212.9548.6312.6853.78 T313.6450.3013.3254.40 T414.2050.8513.5754.70 Full14.5853.0514.0855.26 E XPERIMENTS Either the m-system does not perform as well as the w-system or BLEU is not capable of measuring morpheme improvements

20 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morphological Enhancements - Individual  phr: boundary-aware phrase extraction  tune: MERT tuning – word BLEU  lm: decoding with twin LMs 20 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages SystemT1 (40K)Full (714K) w-system11.5614.58 m-system11.0714.08 m+phr11.44 +0.37 14.43 +0.35 m+tune11.73 +0.66 14.55 +0.47 m+lm11.58 +0.51 14.53 +0.45 E XPERIMENTS Individual enhancement yields improvements for both small and large corpora Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment

21 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Morphological Enhancements - combined 21 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages SystemT1 (40K)Full (714K) w-system11.5614.58 m-system11.0714.08 m+phr+lm11.77 +0.70 14.58 +0.50 m+phr+lm+tune11.90 +0.83 14.39 +0.31 E XPERIMENTS Morphological enhancements: on par with the w-system and yield sizable improvements over the m-system Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment  phr: boundary-aware phrase extraction  tune: MERT tuning – word BLEU  lm: decoding with twin LMs

22 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Translation Model Enrichment  add-1: one extra feature  add-2: two extra features  Interpolation: linear interpolation  ourMethod 22 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Merging methods Full (714K) m-system14.08 w-system14.58 add-114.25 +0.17 add-213.89 -0.19 interpolation14.63 +0.55 ourMethod14.82 +0.74 E XPERIMENTS Our method outperforms the w-system baseline Alignment training 2) Boundary-aware Phrase extraction 1) Morphological segmentation 3) MERT tuning – word BLEU 4) Decoding with twin LMs PT scoring 5) Translation Model Enrichment

23 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Result Significance  Absolute improvement of 0.74 BLEU over the m-system,  non-trival relative improvement of 5.6%  Outperform the w-system by 0.24 points (1.56% relative)  Statistically significant with p < 0.01 (Collins’ sign test) 23 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages BLEU m-system14.08 w-system14.58 ourSystem14.82 +0.74 A NALYSIS

24 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Translation Proximity Match  Automatically extract phrase triples (src, out, ref )  use src as the pivot: translation output (out) & reference (ref)  longest common subsequence ratio  high character-level similarity between out & ref (economic and social, taloudellisia ja sosiaalisia, taloudellisten ja sosiaalisten)  16,262 triples: 6,758 match the references exactly  the remaining triples were close wordforms 24 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Hypothesis: our approach yields translations close to the reference wordforms but unjustly penalized by BLEU A NALYSIS

25 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Human Evaluation  4 native Finnish speakers  50 random test sentences  follow WMT’09 evaluation:  provided judges with the source sentence, its reference translation &  outputs of (m-system, w-system, ourSystem) shown in random order  asked for three pairwise judgments 25 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages our vs. mour vs. ww vs. m Judge 1251819122119 Judge 2241619152514 Judge 327†12171127†15 Judge 4252026†1222 Total101‡6681‡5095†70 A NALYSIS The judges consistently preferred: (1) ourSystem to the m-system, (2) ourSystem to the w-system, (3) w-system to the m-system

26 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Sample translations src: we were very constructive and we negotiated until the last minute of these talks in the hague. ref: olimme erittain rakentavia ja neuvottelimme haagissa viime hetkeen saakka. our: olemme olleet hyvin rakentavia ja olemme neuvotelleet viime hetkeen saakka naiden neuvottelujen haagissa. w : olemme olleet hyvin rakentavia ja olemme neuvotelleet viime tippaan niin naiden neuvottelujen haagissa. m : olimme erittain rakentavan ja neuvottelimme viime hetkeen saakka naiden neuvotteluiden haagissa. Rank: our > m ≥ w src: it would be a very dangerous situation if the europeans were to become logistically reliant on russia. ref: olisi erittäin vaarallinen tilanne, jos eurooppalaiset tulisivat logistisesti riippuvaisiksi venäjästä. our: olisi erittäin vaarallinen tilanne, jos eurooppalaiset tulee logistisesti riippuvaisia venäjän. w : se olisi erittäin vaarallinen tilanne, jos eurooppalaisten tulisi logistically riippuvaisia venäjän. m : se olisi hyvin vaarallinen tilanne, jos eurooppalaiset haluavat tulla logistisesti riippuvaisia venäjän. Rank: our > w ≥ m 26 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Wrong case Match reference Confusing meaning Match reference Wrong case OOV Change the meaning ourSystem consistently outperforms the m-system & w-system, and seems to blend well translations from both baselines!

27 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan  Two key challenges:  bring the performance of morpheme systems to a level rivaling the standard word ones  incorporate morphological analysis directly into the translation process  We have built a preliminary framework for the second challenge  Future work:  Extend the morpheme level framework  Tackle the second challenge Conclusion 27 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Thank you!

28 Minh-Thang Luong, Preslav Nakov & Min-Yen Kan Analysis – Translation Model Comparison  PT m+phr vs. PT m  PT m+phr is about half the size of PT m  95.07% of the phrase pairs in PT m+phr are also in PT m  boundary-aware phrase extraction selects good phrase pairs from PT m to be retained in PT m+phr  PT m+phr vs. PT w  m  comparable in size: 22.5M and 28.9M pairs,  overlap is only 47.67% of PT m+phr  enriching the translation model with PT w  m helps improve coverage 28 A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Full (714K) PT m 43.5M PT w  m 28.9M PT m+phr 22.5M PT m+phr PT m 21.4M PT m+phr PT w  m 10.7M A NALYSIS


Download ppt "A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010,"

Similar presentations


Ads by Google