Download presentation
Presentation is loading. Please wait.
Published byStewart Shields Modified over 9 years ago
1
Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee Tou Ng, National University of Singapore
2
Introduction
3
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 3 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 3 Overview Statistical Machine Translation (SMT) systems Need large sentence-aligned bilingual corpora (bi-texts). Problem Such training bi-texts do not exist for most languages. Idea Adapt a bi-text for a related resource-rich language.
4
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 4 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Idea: reuse bi-texts from related resource-rich languages to improve resource-poor SMT Related languages have overlapping vocabulary (cognates) e.g., casa (‘house’) in Spanish, Portuguese similar word order syntax Idea & Motivation
5
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 5 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 5 Related EU – nonEU languages Swedish – Norwegian Bulgarian – Macedonian Related EU languages Spanish – Catalan Czech – Slovak Irish – Gaelic Scottish Standard German – Swiss German Related languages outside Europe MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi) Hindi – Urdu Turkish – Azerbaijani Russian – Ukrainian Malay – Indonesian Resource-rich vs. Resource-poor Languages We will explore these pairs.
6
6 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Our Main focus: Improving Indonesian-English SMT Using Malay-English
7
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 7 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 7 Malay vs. Indonesian Malay Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak. Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. Indonesian Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama. Mereka dikaruniai akal dan hati nurani dan hendaknya bergaul satu sama lain dalam semangat persaudaraan. ~50% exact word overlap from Article 1 of the Universal Declaration of Human Rights
8
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 8 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 8 Malay Can Look “More Indonesian”… Malay Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak. Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. ~75% exact word overlap Post-edited Malay to look “Indonesian” (by an Indonesian speaker). Indonesian Semua manusia dilahirkan bebas dan mempunyai martabat dan hak-hak yang sama. Mereka mempunyai pemikiran dan perasaan dan hendaklah bergaul satu sama lain dalam semangat persaudaraan. from Article 1 of the Universal Declaration of Human Rights We attempt to do this automatically: adapt Malay to look Indonesian Then, use it to improve SMT…
9
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 9 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Indonesian Malay English poor rich Method at a Glance Indonesian “Indonesian” English poor rich Step 1: Adaptation Indonesian + “Indonesian” English Step 2: Combination Adapt Note that we have no Malay-Indonesian bi-text!
10
10 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 1: Adapting Malay-English to “Indonesian”-English
11
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 11 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 11 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Bi-text Adaptation: Overview Given a Malay-English sentence pair 1.Adapt the Malay sentence to “Indonesian” Word-level paraphrases Phrase-level paraphrases Cross-lingual morphology 2.We pair the adapted “Indonesian” with English from Malay- English sentence pair Thus, we generate a new “Indonesian”-English sentence pair.
12
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 12 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 12 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010. Decode using a large Indonesian LM Word-Level Bi-text Adaptation: Overview
13
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 13 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malaysia’s GDP is expected to reach 8 per cent in 2010. 13 Pair each with the English counter-part Thus, we generate a new “Indonesian”-English bi-text. Word-Level Bi-text Adaptation: Overview
14
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 14 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Indonesian translations for Malay: pivoting over English Weights 14 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malay sentence ML1ML2ML3ML4ML5 English sentence EN1EN2EN3EN4 English sentence EN11EN3EN12 Indonesian sentence IN1IN2IN3 IN4 ML-EN bi-text IN-EN bi-text Word-Level Adaptation: Extracting Paraphrases Note: we have no Malay-Indonesian bi-text, so we pivot.
15
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 15 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) IN-EN bi-text is small, thus: Unreliable IN-EN word alignments bad ML-IN paraphrases Solution: improve IN-EN alignments using the ML-EN bi-text concatenate: IN-EN*k + ML-EN »k ≈ |ML-EN| / |IN-EN| word alignment get the alignments for one copy of IN-EN only 15 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 1 IN ML EN poor rich Works because of cognates between Malay and Indonesian.
16
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 16 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) IN-EN bi-text is small, thus: Small IN vocabulary for the ML-IN paraphrases Solution: Add cross-lingual morphological variants: Given ML word: seperminuman Find ML lemma: minum Propose all known IN words sharing the same lemma: » diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum 16 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 2 IN ML EN poor rich Note: The IN variants are from a larger monolingual IN text.
17
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 17 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-level pivoting Ignores context, and relies on LM Cannot drop/insert/merge/split/reorder words Solution: Phrase-level pivoting Build ML-EN and EN-IN phrase tables Induce ML-IN phrase table (pivoting over EN) Adapt the ML side of ML-EN to get “IN”-EN bi-text: »using Indonesian LM and n-best “IN” as before Also, use cross-lingual morphological variants 17 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 3 - Models context better: not only Indonesian LM, but also phrases. - Allows many word operations, e.g., insertion, deletion. IN ML EN poor rich
18
18 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 2: Combining IN-EN + “IN”-EN
19
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 19 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Combining IN-EN and “IN”-EN bi-texts Simple concatenation: IN-EN + “IN”-EN Balanced concatenation: IN-EN * k + “IN”-EN Sophisticated phrase table combination: (Nakov and Ng, EMNLP 2009), (Nakov and Ng, JAIR 2012) Improved word alignments for IN-EN Phrase table combination with extra features Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009) Preslav Nakov, Hwee Tou Ng
20
20 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Experiments & Evaluation
21
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 21 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Data Translation data (for IN-EN) IN2EN-train: 0.9M IN2EN-dev: 37K IN2EN-test: 37K EN-monoling.: 5M Adaptation data (for ML-EN “IN”-EN) ML2EN: 8.6M IN-monoling.: 20M (tokens)
22
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 22 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Isolated Experiments: Training on “IN”-EN only BLEU System combination using MEMT (Heafield and Lavie, 2010)
23
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 23 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 23 BLEU Combined Experiments: Training on IN-EN + “IN”-EN
24
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 24 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Experiments: Improvements 24 BLEU
25
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 25 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Improve Macedonian-English SMT by adapting Bulgarian-English bi-text Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words) OPUS movie subtitles Application to Other Languages & Domains BLEU
26
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 26 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 26 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Conclusion
27
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 27 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Adapt bi-texts for related resource-rich languages, using confusion networks word-level & phrase-level paraphrasing cross-lingual morphological analysis Achieved: +6.7 BLEU over ML2EN +2.6 BLEU over IN2EN +1.5-3.0 BLEU over comb(IN2EN,ML2EN) Future work add split/merge as word operations better integrate word-level and phrase-level methods apply our methods to other languages & NLP problems Thank you! Conclusion & Future Work Supported by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.
28
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 28 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 28 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Further Analysis
29
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 29 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Paraphrasing Non-Indonesian Malay Words Only So, we do need to paraphrase all words.
30
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 30 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Human Judgments Morphology yields worse top-3 adaptations but better phrase tables, due to coverage. Is the adapted sentence better Indonesian than the original Malay sentence? 100 random sentences
31
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 31 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Reverse Adaptation Idea: Adapt dev/test Indonesian input to “Malay”, then, translate with a Malay-English system Input to SMT: - “Malay” lattice - 1-best “Malay” sentence from the lattice Adapting dev/test is worse than adapting the training bi-text: So, we need both n-best and LM
32
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 32 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 32 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work
33
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 33 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (1) Machine translation between related languages E.g. Cantonese–Mandarin (Zhang, 1998) Czech–Slovak (Hajic & al., 2000) Turkish–Crimean Tatar (Altintas & Cicekli, 2002) Irish–Scottish Gaelic (Scannell, 2006) Bulgarian–Macedonian (Nakov & Tiedemann, 2012) We do not translate (no training data), we “adapt”.
34
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 34 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (2) Adapting dialects to standard language (e.g., Arabic) (Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011) manual rules Normalizing Tweets and SMS (Aw & al., 2006; Han & Baldwin, 2011) informal text: spelling, abbreviations, slang same language
35
EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 35 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (3) Adapt Brazilian to European Portuguese (Marujo & al. 2011) rule-based, language-dependent tiny improvements for SMT Reuse bi-texts between related languages (Nakov & Ng. 2009) no language adaptation (just transliteration)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.