Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.

Slides:

Advertisements

Similar presentations

Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash COMS E6998: Topics in Computer Science: Machine Translation February.

Advertisements

Word Sense Disambiguation for Machine Translation Han-Bin Chen

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.

1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.

The Impact of Arabic Morphological Segmentation on Broad-Scale Phrase-based SMT Alon Lavie and Hassan Al-Haj Language Technologies Institute Carnegie Mellon.

Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.

Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

Statistical Machine Translation Part IX – Better Word Alignment, Morphology and Syntax Alexander Fraser ICL, U. Heidelberg CIS, LMU München

1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.

The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart.

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者：郝柏翰 2013/06/04 Thorsten Brants, Ashok.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

Statistical Machine Translation Part V - Advanced Topics Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

Statistical Machine Translation Part V – Better Word Alignment, Morphology and Syntax Alexander Fraser CIS, LMU München Seminar: Open Source.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation M arine C ARPUAT and D ekai W U Human Language Technology.

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.

Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.

Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.

Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.

Confidential. The material in this presentation is the property of Fair Isaac Corporation, is provided for the recipient only, and shall not be used, reproduced,

Tokenization & POS-Tagging

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Supertagging CMSC Natural Language Processing January 31, 2006.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Language Identification and Part-of-Speech Tagging

Approaches to Machine Translation

Issues in Arabic MT Alex Fraser USC/ISI 9/22/2018 Issues in Arabic MT.

Approaches to Machine Translation

CS4705 Natural Language Processing

Statistical Machine Translation Papers from COLING 2004

Presentation transcript:

Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006

Road Map Hybrid MT Columbia Morphological Preprocessing for SMT (Habash & Sadat, NAACL 2006) Combination of Preprocessing Schemes (Sadat & Habash, ACL 2006)

Why Hybrid MT? StatMT and RuleMT have complementary advantages –RuleMT: Handling of possible but unseen word forms –StatMT: Robust translation of seen words –RuleMT: Better global target syntactic structure –StatMT:Robust local phrase-based translation –RuleMT: Cross-genre generalizations/robustness –StatMT: Robust within-genre translation StatMT and RuleMT use complementary resources –Parallel corpora vs. dictionaries, parsers, analyzers, linguists Hybrids can potentially improve over either approach

Hybrid MT Challenges Linguistic phrase versus StatMT phrase “. on the other hand, the” Meaningful probabilities for linguistic resources Increased system complexity The potential to produce the combined worst rather than the combined best Low Arabic parsing performance (~70% Parseval F-score) Statistical hallucinations

Hybrid MT Continuum “Hybrid” is a moving target –StatMT systems use some rule-based components Orthographic normalization, number/date translation, etc. –RuleMT systems nowadays use statistical n-gram language modeling Hybrid MT systems –Different mixes of statistical/rule-based components Resource availability –General approach directions Adding rules/linguistics to StatMT systems Adding statistics/statistical resources to RuleMT systems –Depth of hybridization Morphology, syntax, semantics

Columbia MT Projects Arabic-English MT focus Different hybrid approaches SystemApproachCollaborations VirgoSyntax-aware SMT Halim Abbas (Columbia Student) QumqumStat Enriched Generation Heavy MT Bonnie Dorr & Necip Ayan (University of Maryland) Christof Monz (University of London) SMT + MADA Morphology Enriched SMT Fatiha Sadat, Roland Kuhn & George Foster (NRC Canada)

Columbia MT Projects Arabic-English MT focus Different hybrid approaches SystemApproachMTEval Submissions VirgoSyntax-aware SMT First Time Primary QumqumStat Enriched Generation Heavy MT Second Time (2005, 2006) - Contrast 2005 submission was part of UMD’s SMT + MADA Morphology Enriched SMT MADA used in MTEVAL 2006 submissions by NRC and RWTH

System Overview GHMT+SMTSyntax+SMTSMT+Morph Parallel Corpus (5M wds)  Language Model (1G wds)  Morphology  Syntax  Rule-based Reordering  RuleMTStatMT * Koehn Hybrid Scale * Columbia ContrastColumbia Primary

Research Directions Syntactic SMT preprocessing Syntax-aware phrase extraction Statistical linearization using richer CFGs Creation and integration of rule-generated phrase-tables Lowering dependence on source language resources Extension to other languages and dialects

Road Map Hybrid MT Columbia Morphological Preprocessing for SMT –Linguistic Issues –Previous Work –Schemes and Techniques –Evaluation Combination of Preprocessing Schemes

Arabic Linguistic Issues Rich Morphology –Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]] w+ l+ Al+ mktb and+ for+ the+ office –Morphotactics w+l+Al+mktb  wllmktb وللمكتب  و+ل+ال+مكتب Ambiguity –وجد wjdhe found –و+جد w+ jdand+grandfather

Previous Work Morphological & syntactic preprocessing for SMT –French-English (Berger et al., 1994) –German-English (Nießen and Ney 2000; 2004) –Spanish, Catalan and Serbian to English (Popović and Ney, 2004) –Czech-English (Goldwater and McClosky, 2005) –Arabic-English (Lee, 2004) We focus on morphological preprocessing –Larger set of conditions: schemes, techniques, learning curve, genre variation –No additional kinds of preprocessing (e.g. dates, numbers)

Road Map Hybrid MT Columbia Morphological Preprocessing for SMT –Linguistic Issues –Previous Work –Schemes and Techniques –Evaluation Combination of Preprocessing Schemes

Preprocessing Schemes Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags

Preprocessing Schemes MT04,1353 sentences, words

Preprocessing Schemes Scheme Accuracy –Measured against Penn Arabic Treebank

Preprocessing Techniques REGEX: Regular Expressions BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) –Pick first analysis –Use TOKAN (Habash 2006) A generalized tokenizer Assumes disambiguated morphological analysis Declarative specification of any preprocessing scheme MADA: Morphological Analysis and Disambiguation for Arabic (Habash&Rambow 2005) –Multiple SVM classifiers + combiner –Selects BAMA analysis –Use TOKAN

TOKAN A generalized tokenizer Assumes disambiguated morphological analysis Declarative specification of any tokenization scheme –D1w+ f+ REST –D2w+ f+ b+ k+ l+ s+ REST –D3w+ f+ b+ k+ l+ s+ Al+ REST +P: +O: –TBw+ f+ b+ k+ l+ REST +P: +O: –BWMORPH –L1LEXEME + POS –ENG w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S: Uses generator (Habash 2006)

Road Map Hybrid MT Columbia Morphological Preprocessing for SMT –Linguistic Issues –Previous Work –Schemes and Techniques –Evaluation Combination of Preprocessing Schemes

Experiments Portage Phrase-based MT (Sadat et al., 2005) Training Data: parallel 5 Million words only –All in News genre –Learning curve: 1%, 10% and 100% Language Modeling: 250 Million words Development Tuning Data: MT03 Eval Set Test Data: –MT04 (Mixed genre: news, speeches, editorials) –MT05 (All news)

Experiments (cont’d) Metric: BLEU (Papineni et al 2001) –4 references, case insensitive Each experiment –Select a preprocessing scheme –Select a preprocessing technique Some combinations do not exist –REGEX and EN

MT04 Results MADABAMAREGEX BLEU 100% 10% 1% Training > >

MT05 Results MADABAMAREGEX BLEU 100% 10% 1% Training > >

MT04 Genre Variation Best Schemes + Technique 1%, 100% BLEU + 71%+ 105%+ 2%+ 12%

Other Results Orthographic Normalization generally did better than the baseline ST –statistically significant at 1% training data only wa+ decliticization was generally similar to D1 Arabic Treebank scheme was similar to D2 Full lemmatization schemes behaved like EN but always worse 50% Training data 50% data >= 100% data Larger phrases size (14) did not have a significant difference from the size 8 we used

Latest Results (July 2006) Training Size 5M111M MT MT

Road Map Hybrid MT Columbia Morphological Preprocessing for SMT Combination of Preprocessing Schemes

Oracle Combination Preliminary study: oracle combination In MT04,100% data, MADA technique, 11 schemes, sentence level selection Achieved 46.0 Bleu –(24% improvement over best system 37.1)

System Combination Exploit scheme complementarity to improve MT quality Explore two methods of system combination –Rescoring-Only Combination (ROC) –Decoding-plus-Rescoring Combination (DRC) We us all 11 schemes with MADA technique

Rescoring-Only Combination (ROC) Rescore all the one-best outputs generated from separate scheme-specific systems and return the top choice Each scheme-specific system uses its own scheme-specific preprocessing, phrase tables and decoding weights

Rescoring-Only Combination (ROC) Standard combo –Trigram language model, phrase translation model, distortion model, and sentence length –IBM model 1 and 2 probabilities in both directions Other combo: add more features –Perplexity of source sentence (PPL) against a source LM (in same scheme) –Number of out-of-vocabulary words source sentence (OOV) –Source sentence length (SL) –An encoding of the specific scheme (SC)

Decoding-plus-Rescoring Combination (DRC) Step 1: Decode –For each preprocessing scheme Use union of phrase tables from all schemes Optimize and decode (with same scheme) Step 2: Rescore –Rescoring the one-best outputs of each preprocessing scheme

Results MT04 set Best single scheme D2 scores 37.1 CombinationAll schemes4 best ROC Standard PPL+SC PPL+SC+OOV PPL+SC+OOV+SL PPL+SC+SL DRC +PPL+SC

Results Statistical significance using bootstrap re- sampling (Koehn, 2004) DRCROCD2TBD1WAON

Conclusions For large amounts of training data, splitting off conjunctions and particles performs best For small amount of training data, following an English-like tokenization performs best Suitable choice of preprocessing scheme and technique yields an important increase in BLEU score if –there is little training data –there is a change in genre between training and test System combination is potentially highly rewarding especially when combining the phrase tables of different preprocessing schemes.

Future Work Study additional variant schemes that current results support Factored translation modeling Decoder extension to use multiple schemes in parallel Syntactic preprocessing Investigate combination techniques at the sentence and sub-sentence levels

Thank you! Questions? Nizar Habash