The Impact of Arabic Morphological Segmentation on Broad-Scale Phrase-based SMT Alon Lavie and Hassan Al-Haj Language Technologies Institute Carnegie Mellon.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Statistical MT with Syntax and Morphology: Challenges and Some Solutions Alon Lavie LTI Colloquium September 2, 2011 Joint work with: Greg Hanneman, Jonathan.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者：郝柏翰 2013/06/04 Thorsten Brants, Ashok.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.

Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Morphology & Machine Translation Eric Davis MT Seminar 02/06/08 Professor Alon Lavie Professor Stephan Vogel.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

A Language Independent Method for Question Classification COLING 2004.

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.

Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Arabic Word Segmentation for Better Unit of Analysis Yassine Benajiba 1 and Imed Zitouni 2 1 CCLS, Columbia University 2 IBM T.J. Watson Research Center.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.

Is Neural Machine Translation the New State of the Art?

Multi-Engine Machine Translation

Statistical Machine Translation Part II: Word Alignments and EM

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Language Technologies Institute Carnegie Mellon University

Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.

Suggestions for Class Projects

Statistical NLP: Lecture 13

Build MT systems with Moses

Alon Lavie Language Technologies Institute

Statistical Machine Translation Papers from COLING 2004

Statistical NLP Spring 2011

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presentation transcript:

The Impact of Arabic Morphological Segmentation on Broad-Scale Phrase-based SMT Alon Lavie and Hassan Al-Haj Language Technologies Institute Carnegie Mellon University With contributions from Kenneth Heafield, Silja Hildebrand and Michael Denkowski

Prelude Among the things I work on these days: –METEOR –MT System Combination (MEMT) –Start-Up: Safaba Translation Solutions Important Component in all three: –METEOR Monolingual Knowledge-Rich Aligner January 26, 2011Haifa MT Workshop2

January 26, 2011Haifa MT Workshop3 The METEOR Monolingual Aligner Developed as a component in our METEOR Automated MT Evaluation system Originally word-based, extended to phrasal matches Finds maximal one-to-one alignment match with minimal “crossing branches” (reordering) Allows alignment of: –Identical words –Morphological variants of words (using stemming) –Synonymous words (based on WordNet synsets) –Single and multi-word Paraphrases (based on statistically-learned and filtered paraphrase tables) Implementation: efficient search algorithm for best scoring weighted string match

January 26, 2011Haifa MT Workshop4 The Monolingual Aligner Examples:

Multi-lingual METEOR Latest version METEOR 1.2 Support for: –English: exact/stem/synonyms/paraphrases –Spanish, French, German: exact/stem/paraphrases –Czech: exact/paraphrases METEOR-tuning: –Version of METEOR for MT system parameter optimization –Preliminary promising results –Stay tuned… METEOR is free and Open-source: – January 26, 2011Haifa MT Workshop5

METEOR Analysis Tools January 26, 2011Haifa MT Workshop6 METEOR v1.2 comes with a suite of new analysis and visualization tools called METEOR-XRAY

And now to our Feature Presentation… January 26, 2011Haifa MT Workshop7

Motivation Morphological segmentation and tokenization decisions are important in phrase-based SMT –Especially for morphologically-rich languages Decisions impact the entire pipeline of training and decoding components Impact of these decisions is often difficult to predict in advance Goal: a detailed investigation of this issue in the context of phrase-based SMT between English and Arabic –Focus on segmentation/tokenization of the Arabic (not English) –Focus on translation from English into Arabic January 26, 2011Haifa MT Workshop8

Research Questions Do Arabic segmentation/tokenization decisions make a significant difference even in large training data scenarios? English-to-Arabic vs. Arabic-to-English What works best and why? Additional considerations or impacts when translating into Arabic (due to detokenization) Output Variation and Potential for System Combination? January 26, 2011Haifa MT Workshop9

Methodology Common large-scale training data scenario (NIST MT 2009 English-Arabic) Build a rich spectrum of Arabic segmentation schemes (nine different schemes) –Based on common detailed morphological analysis using MADA (Habash et al.) Train nine different complete end-to-end English-to- Arabic (and Arabic-to-English) phase-based SMT systems using Moses (Koehn et al.) Compare and analyze performance differences January 26, 2011Haifa MT Workshop10

Arabic Morphology Rich inflectional morphology with several classes of clitics and affixes that attach to the word conj + part + art + base + pron January 26, 2011Haifa MT Workshop11

Arabic Orthography Deficient (and sometimes inconsistent) orthography –Deletion of short vowels and most diacritics –Inconsistent use of أ, آ, إ, ا –Inconsistent use of ى, ي Common Treatment (Arabic  English) –Normalize the inconsistent forms by collapsing them Clearly undesirable for MT into Arabic –Enrich: use MADA to disambiguate and produce the full form –Correct full-forms enforced in training, decoding and evaluation January 26, 2011Haifa MT Workshop12

Arabic Segmentation and Tokenization Schemes Based on common morphological analysis by MADA and tokenization byTOKAN (Habash et el.) Explored nine schemes (coarse to fine): –UT: unsegmented (full enriched form) –S0: w + REST –S1: w|f + REST –S2: w|f + part|art + REST –S3: w|f + part/s|art + base + pron-MF –S4: w|f + part|art + base + pron-MF –S4SF: w|f + part|art + base + pron-SF –S5: w|f + part +art + base + pron-MF –S5SF: w|f + part + art + base + pron-SF January 26, 2011Haifa MT Workshop13

Arabic Segmentation and Tokenization Schemes Based on common morphological analysis by MADA and tokenization byTOKAN (Habash et el.) Explored nine schemes (coarse to fine): –UT: unsegmented (full enriched form) –S0: w + REST –S1: w|f + REST –S2: w|f + part|art + REST –S3: w|f + part/s|art + base + pron-MF Morphological –S4: w|f + part|art + base + pron-MF Forms! –S4SF: w|f + part|art + base + pron-SF –S5: w|f + part +art + base + pron-MF –S5SF: w|f + part + art + base + pron-SF January 26, 2011Haifa MT Workshop14

Arabic Segmentation and Tokenization Schemes Based on common morphological analysis by MADA and tokenization byTOKAN (Habash et el.) Explored nine schemes (coarse to fine): –UT: unsegmented (full enriched form) –S0: w + REST –S1: w|f + REST –S2: w|f + part|art + REST –S3: w|f + part/s|art + base + pron-MF –S4: w|f + part|art + base + pron-MF –S4SF: w|f + part|art + base + pron-SF Surface –S5: w|f + part +art + base + pron-MF Forms! –S5SF: w|f + part + art + base + pron-SF January 26, 2011Haifa MT Workshop15

Arabic Segmentation and Tokenization Schemes Based on common morphological analysis by MADA and tokenization byTOKAN (Habash et el.) Explored nine schemes (coarse to fine): –UT: unsegmented (full enriched form) –S0: w + REST –S1: w|f + REST –S2: w|f + part|art + REST –S3: w|f + part/s|art + base + pron-MF Original PATB –S4: w|f + part|art + base + pron-MF ATBv3 –S4SF: w|f + part|art + base + pron-SF –S5: w|f + part + art + base + pron-MF –S5SF: w|f + part + art + base + pron-SF January 26, 2011Haifa MT Workshop16

Arabic Segmentation Schemes January 26, 2011Haifa MT Workshop17 MT02 Test Set: 728 sentences unsegmented words

Previous Work Most previous work has looked at these choices in context of Arabic  English MT –Most common approach is to use PATB or ATBv3 (Badr et al. 2006) investigated segmentation impact in the context of English  Arabic –Much smaller-scale training data –Only a small subset of our schemes January 26, 2011Haifa MT Workshop18

Arabic Detokenization English-to-Arabic MT system trained on segmented Arabic forms will decode into segmented Arabic –Need to put back together into full form words –Non-trivial because mapping isn’t simple concatenation and not always one-to-one –Detokenization can introduce errors –The more segmented the scheme, the more potential errors in detokenization January 26, 2011Haifa MT Workshop19

Arabic Detokenization We experimented with several detokenization methods: –C: simple concatenation –R: List of detokenization rules (Badr et al. 2006) –T: Mapping table constructed from training data (with likelihoods) –T+C: Table method with backoff to C –T+R: Table method with backoff to R –T+R+LM: T+R method augmented with a 5-gram LM of full- forms and viterbi search for max likelihood sequence. January 26, 2011Haifa MT Workshop20

Arabic Detokenization Evaluation set: 50K sentences (~1.3 million words) from NIST MT 2009 training data Rest of NIST MT 2009 training data used to construct mapping table T and train LM Evaluated using sentence error rate (SER) January 26, 2011Haifa MT Workshop21

Experimental Setup NIST MT 2009 constrained training parallel-data for Arabic-English: –~5 million sentence-pairs –~150 million unsegmented Arabic words –~172 million unsegmented English words Preprocessing: –English tokenized using Stanford tokenizer and lower-cased –Arabic analyzed by MADA, then tokenized using scripts and TOKAN according to the nine schemes Data Filtering: sentence pairs with > 99 tokens on either side or ratio of more than 4-to-1 were filtered out January 26, 2011Haifa MT Workshop22

Tuning and Testing Data Use existing NIST MT02, MT03, MT04, MT05 test sets developed for Arabic  English –Four English translation references for each Arabic sentence –Create English  Arabic sets by selecting First English reference –Use MT02 for tuning –Use MT03, MT04 and MT05 for testing January 26, 2011Haifa MT Workshop23

Training and Testing Setup Standard training pipeline using Moses –Word Alignment of tokenized data using MGIZA++ –Symetrized using grow-diag-final-and –Phrase extraction with max phrase length 7 –Lexically conditioned distortion model conditioned on both sides Language Model: 5-gram SRI-LM trained on tokenized Arabic-side of parallel data (152 million words) –Also trained 7-gram LM for S4 and S5 Tune: MERT to BLEU-4 on MT-02 Decode with Moses on MT-03, MT-04 and MT-05 Detokenized with T+R method Scored using BLEU, TER and METEOR on detokenized output January 26, 2011Haifa MT Workshop24

English-to-Arabic Results MT03 MT04 MT05 January 26, 2011Haifa MT Workshop25

Analysis Complex picture: –Some decompositions help, others don’t help or even hurt performance Segmentation decisions really matter – even with large amounts of training data: –Difference between best (S0) and worst (S5SF) On MT03 : +2.6 BLEU, TER, +2.7 METEOR points Map Key Reminder: –S0: w+REST, S2: conj+part|art+REST, S4: (ATBv3 ) split all except for the art, S5: split everything (pron in morph. form) S0 and S4 consistently perform the best, are about equal S2 and S5 consistently perform the worst S4SF and S5SF usually worse than S4 and S5 January 26, 2011Haifa MT Workshop26

Analysis Simple decomposition S0 (just the “w” conj) works as well as any deeper decomposition S4 (ATBv3) works well also for MT into Arabic Decomposing the Arabic definite article consistently hurts performance Decomposing the prefix particles sometimes hurts Decomposing the pronominal suffixes (MF or SF) consistently helps performance 7-gram LM does not appear to help compensate for fragmented S4 and S5 January 26, 2011Haifa MT Workshop27

Analysis: Phrase Tables January 26, 2011Haifa MT Workshop28 Phrase table filtered to MT03 test set (source side matches) PTE = Phrase Table Entropy ANTPn = average number of translations for source phrases of length n

Analysis Clear evidence that splitting off the Arabic definite article is bad for English  Arabic –S4  S5 results in 22% increase in PT size –Significant increase in translation ambiguity for short phrases –Inhibits extraction of some longer phrases –Allows ungrammatical phrases to be generated: Middle East  Al$rq Al>wsT Middle East  $rq >qsT Middle East  $rq Al>wsT January 26, 2011Haifa MT Workshop29

Output Variation How different are the translation outputs from these MT system variants? –Upper-bound: Oracle Combination on the single-best hypotheses from the different systems Select the best scoring output from the nine variants (based on posterior scoring against the reference) –Work in Progress - actual system combination: Hypothesis Selection CMU Multi-Engine MT approach MBR January 26, 2011Haifa MT Workshop30

Oracle Combination SystemBLEUTERMETEOR Best Ind. (S0) Oracle Combination January 26, 2011Haifa MT Workshop31 SystemBLEUTERMETEOR Best Ind. (S4) Oracle Combination SystemBLEUTERMETEOR Best Ind. (S0) Oracle Combination MT03 MT04 MT05

Output Variation Oracle gains of 5-7 BLEU points from selecting among nine variant hypotheses –Very significant variation in output! –Better than what we typically see from oracle selections over large n-best lists (for n=1000) January 26, 2011Haifa MT Workshop32

Arabic-to-English Running similar set of experiments in the Arabic  English direction –Use all four English references for Tuning and testing –Single same English LM for all systems Intuitive prediction on magnitude of differences between systems? –Smaller, same, or larger? January 26, 2011Haifa MT Workshop33

Arabic-to-English Results BLEUTERMETEOR UT S S S S S S S4SF S5SF January 26, 2011Haifa MT Workshop34 MT03

Analysis Results are preliminary Still some significant differences between the system variants –Less pronounced than for English  Arabic Segmentation schemes that work best are different than in the English  Arabic direction S4 (ATBv3) works well, but isn’t the best More fragmented segmentations appear to work better Segmenting the Arabic definite article is no longer a problem –S5 works well now We can leverage from the output variation –Preliminary hypothesis selection experiments show nice gains January 26, 2011Haifa MT Workshop35

Conclusions Arabic segmentation schemes has a significant impact on system performance, even in very large training data settings –Differences of BLEU between system variants Complex picture of which morphological segmentations are helpful and which hurt performance –Picture is different in the two translation directions –Simple schemes work well for English  Arabic, less so for Arabic  English –Splitting off Arabic definite article hurts for English  Arabic Significant variation in the output of the system variants can be leveraged for system combination January 26, 2011Haifa MT Workshop36

Current and Future Work System combination experiments –Hypothesis selection, MEMT and MBR –Contrast with lattice decoding (Dyer, 2008) and combining phrase-tables Arabic-to-English Experiments Better way to do this for other languages? January 26, 2011Haifa MT Workshop37

References Al-Haj, H. and A. Lavie. "The Impact of Arabic Morphological Segmentation on Broad-coverage English-to-Arabic Statistical Machine Translation". In Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas (AMTA-2010), Denver, Colorado, November Al-Haj, H. and A. Lavie. "The Impact of Arabic Morphological Segmentation on Broad-coverage English-to-Arabic Statistical Machine Translation“. MT Journal Special Issue on Arabic MT. Under review. January 26, 2011Haifa MT Workshop38