Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.

Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University habash@cs.columbia.edu NLP Meeting 10/19/2006

Road Map Hybrid MT Research @ Columbia Morphological Preprocessing for SMT (Habash & Sadat, NAACL 2006) Combination of Preprocessing Schemes (Sadat & Habash, ACL 2006)

Why Hybrid MT? StatMT and RuleMT have complementary advantages –RuleMT: Handling of possible but unseen word forms –StatMT: Robust translation of seen words –RuleMT: Better global target syntactic structure –StatMT:Robust local phrase-based translation –RuleMT: Cross-genre generalizations/robustness –StatMT: Robust within-genre translation StatMT and RuleMT use complementary resources –Parallel corpora vs. dictionaries, parsers, analyzers, linguists Hybrids can potentially improve over either approach

Hybrid MT Challenges Linguistic phrase versus StatMT phrase “. on the other hand, the” Meaningful probabilities for linguistic resources Increased system complexity The potential to produce the combined worst rather than the combined best Low Arabic parsing performance (~70% Parseval F-score) Statistical hallucinations

Hybrid MT Continuum “Hybrid” is a moving target –StatMT systems use some rule-based components Orthographic normalization, number/date translation, etc. –RuleMT systems nowadays use statistical n-gram language modeling Hybrid MT systems –Different mixes of statistical/rule-based components Resource availability –General approach directions Adding rules/linguistics to StatMT systems Adding statistics/statistical resources to RuleMT systems –Depth of hybridization Morphology, syntax, semantics

Columbia MT Projects Arabic-English MT focus Different hybrid approaches SystemApproachCollaborations VirgoSyntax-aware SMT Halim Abbas (Columbia Student) QumqumStat Enriched Generation Heavy MT Bonnie Dorr & Necip Ayan (University of Maryland) Christof Monz (University of London) SMT + MADA Morphology Enriched SMT Fatiha Sadat, Roland Kuhn & George Foster (NRC Canada)

Columbia MT Projects Arabic-English MT focus Different hybrid approaches SystemApproachMTEval Submissions VirgoSyntax-aware SMT First Time 2006 - Primary QumqumStat Enriched Generation Heavy MT Second Time (2005, 2006) - Contrast 2005 submission was part of UMD’s SMT + MADA Morphology Enriched SMT MADA used in MTEVAL 2006 submissions by NRC and RWTH

System Overview 0123456 GHMT+SMTSyntax+SMTSMT+Morph Parallel Corpus (5M wds)  Language Model (1G wds)  Morphology  Syntax  Rule-based Reordering  RuleMTStatMT * Koehn Hybrid Scale * Columbia ContrastColumbia Primary

Research Directions Syntactic SMT preprocessing Syntax-aware phrase extraction Statistical linearization using richer CFGs Creation and integration of rule-generated phrase-tables Lowering dependence on source language resources Extension to other languages and dialects

Road Map Hybrid MT Research @ Columbia Morphological Preprocessing for SMT –Linguistic Issues –Previous Work –Schemes and Techniques –Evaluation Combination of Preprocessing Schemes

Arabic Linguistic Issues Rich Morphology –Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]] w+ l+ Al+ mktb and+ for+ the+ office –Morphotactics w+l+Al+mktb  wllmktb وللمكتب  و+ل+ال+مكتب Ambiguity –وجد wjdhe found –و+جد w+ jdand+grandfather

Previous Work Morphological & syntactic preprocessing for SMT –French-English (Berger et al., 1994) –German-English (Nießen and Ney 2000; 2004) –Spanish, Catalan and Serbian to English (Popović and Ney, 2004) –Czech-English (Goldwater and McClosky, 2005) –Arabic-English (Lee, 2004) We focus on morphological preprocessing –Larger set of conditions: schemes, techniques, learning curve, genre variation –No additional kinds of preprocessing (e.g. dates, numbers)

Preprocessing Schemes Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes STSimple Tokenization D1Decliticize CONJ+ D2Decliticize CONJ+, PART+ D3Decliticize all clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags

Preprocessing Schemes MT04,1353 sentences, 36000 words

Preprocessing Schemes Scheme Accuracy –Measured against Penn Arabic Treebank

Preprocessing Techniques REGEX: Regular Expressions BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) –Pick first analysis –Use TOKAN (Habash 2006) A generalized tokenizer Assumes disambiguated morphological analysis Declarative specification of any preprocessing scheme MADA: Morphological Analysis and Disambiguation for Arabic (Habash&Rambow 2005) –Multiple SVM classifiers + combiner –Selects BAMA analysis –Use TOKAN

TOKAN A generalized tokenizer Assumes disambiguated morphological analysis Declarative specification of any tokenization scheme –D1w+ f+ REST –D2w+ f+ b+ k+ l+ s+ REST –D3w+ f+ b+ k+ l+ s+ Al+ REST +P: +O: –TBw+ f+ b+ k+ l+ REST +P: +O: –BWMORPH –L1LEXEME + POS –ENG w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S: Uses generator (Habash 2006)

Experiments Portage Phrase-based MT (Sadat et al., 2005) Training Data: parallel 5 Million words only –All in News genre –Learning curve: 1%, 10% and 100% Language Modeling: 250 Million words Development Tuning Data: MT03 Eval Set Test Data: –MT04 (Mixed genre: news, speeches, editorials) –MT05 (All news)

Experiments (cont’d) Metric: BLEU (Papineni et al 2001) –4 references, case insensitive Each experiment –Select a preprocessing scheme –Select a preprocessing technique Some combinations do not exist –REGEX and EN

MT04 Results MADABAMAREGEX BLEU 100% 10% 1% Training > >

MT05 Results MADABAMAREGEX BLEU 100% 10% 1% Training > >

MT04 Genre Variation Best Schemes + Technique EN+MADA @ 1%, D2+MADA @ 100% BLEU + 71%+ 105%+ 2%+ 12%

Other Results Orthographic Normalization generally did better than the baseline ST –statistically significant at 1% training data only wa+ decliticization was generally similar to D1 Arabic Treebank scheme was similar to D2 Full lemmatization schemes behaved like EN but always worse 50% Training data –D2 @ 50% data >= ST @ 100% data Larger phrases size (14) did not have a significant difference from the size 8 we used

Latest Results (July 2006) Training Size 5M111M MT04 37.143.7 MT05 38.5648.87

Road Map Hybrid MT Research @ Columbia Morphological Preprocessing for SMT Combination of Preprocessing Schemes

Oracle Combination Preliminary study: oracle combination In MT04,100% data, MADA technique, 11 schemes, sentence level selection Achieved 46.0 Bleu –(24% improvement over best system 37.1)

System Combination Exploit scheme complementarity to improve MT quality Explore two methods of system combination –Rescoring-Only Combination (ROC) –Decoding-plus-Rescoring Combination (DRC) We us all 11 schemes with MADA technique

Rescoring-Only Combination (ROC) Rescore all the one-best outputs generated from separate scheme-specific systems and return the top choice Each scheme-specific system uses its own scheme-specific preprocessing, phrase tables and decoding weights

Rescoring-Only Combination (ROC) Standard combo –Trigram language model, phrase translation model, distortion model, and sentence length –IBM model 1 and 2 probabilities in both directions Other combo: add more features –Perplexity of source sentence (PPL) against a source LM (in same scheme) –Number of out-of-vocabulary words source sentence (OOV) –Source sentence length (SL) –An encoding of the specific scheme (SC)

Decoding-plus-Rescoring Combination (DRC) Step 1: Decode –For each preprocessing scheme Use union of phrase tables from all schemes Optimize and decode (with same scheme) Step 2: Rescore –Rescoring the one-best outputs of each preprocessing scheme

Results MT04 set Best single scheme D2 scores 37.1 CombinationAll schemes4 best ROC Standard34.8737.12 +PPL+SC37.5837.45 +PPL+SC+OOV37.4- +PPL+SC+OOV+SL37.39- +PPL+SC+SL37.15- DRC +PPL+SC38.6737.73

Results Statistical significance using bootstrap re- sampling (Koehn, 2004) DRCROCD2TBD1WAON 100000000 97.72.20.1000 92.17.9000 98.80.70.30.2 53.824.122.1 59.340.7

Conclusions For large amounts of training data, splitting off conjunctions and particles performs best For small amount of training data, following an English-like tokenization performs best Suitable choice of preprocessing scheme and technique yields an important increase in BLEU score if –there is little training data –there is a change in genre between training and test System combination is potentially highly rewarding especially when combining the phrase tables of different preprocessing schemes.

Future Work Study additional variant schemes that current results support Factored translation modeling Decoder extension to use multiple schemes in parallel Syntactic preprocessing Investigate combination techniques at the sentence and sub-sentence levels

Thank you! Questions? Nizar Habash habash@cs.columbia.edu

Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.

Similar presentations

Presentation on theme: "Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.

Similar presentations

Presentation on theme: "Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006."— Presentation transcript:

Similar presentations

About project

Feedback