Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash COMS E6998: Topics in Computer Science: Machine Translation February.

Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash COMS E6998: Topics in Computer Science: Machine Translation February 7, 2013 Reading Set #1

Papers Discussed Nizar Habash and Fatiha Sadat. 2006. Arabic Preprocessing Schemes for Statistical Machine Translation. Nimesh Singh and Nizar Habash. 2012. Hebrew Morphological Preprocessing for Statistical Machine Translation.

Introduction Arabic and Hebrew Morphology Approach Experimental Settings Results Conclusions Outline 3

The Basic Idea Reduction of word sparsity improves translation quality This reduction can be achieved by –increasing training data, or by –morphologically driven preprocessing

Introduction Morphologically rich languages are especially challenging for SMT Model sparsity, high OOV rate especially under low-resource conditions A common solution is to tokenize the source words in a preprocessing step Lower OOV rate  Better SMT (in terms of BLEU) Increased token symmetry  Better SMT models conj+article+noun :: conj article noun wa+Al+kitAb :: and the book

Introduction Different tokenizations can be used No one “correct” tokenization. Tokenizations vary in terms of Scheme (what) and Technique (how) Accuracy Consistency Sparsity reduction The two papers consider different preprocessing options and other settings to study SMT from Arabic/Hebrew to English

Linguistic Issues Arabic & Hebrew are Semitic languages –Root-and-pattern morphology –Extensive use of affixes and clitics Rich Morphology –Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]] w+ l+ Al+ mktb and+ for+ the+ office –Morphotactics w+l+Al+mktb  wllmktb وللمكتب  و+ل+ال+مكتب

Linguistic Issues Orthographic & Morphological Ambiguity –وجدنا wjdnA wjd+nA wajad+nA (we found) w+jd+nA wa+jad~u+nA (and our grandfather) – בשורה bbšwrh בשורה bšwrh‘gospel’ ב + שורה b+šwrh‘in+(a/the) line’ ב + שור + ה b+šwr+h‘in her bull’ [lit. in+bull+her]

Arabic Orthographic Ambiguity wdrst AltAlbAt AlErbyAt ktAbA bAlSynyp w+drs+t Al+Talb+At Al+Erb+y+At ktAb+A b+Al+Syn+y+p And+study+they the+student+f.pl. the+Arab+f.pl. book+a in+the+Chinese The Arab students studied a book in Chinese the+arab students studied a+book in+chinese th+rb stdnts stdd +bk n+chns thrb stdnts stdd bk nchns to+herb so+too+dents studded bake in chains? Extra w+ Extra w+ Repe ated Al+ Repe ated Al+ Repe ated Al+ Repe ated Al+ MT LAB HINTS MT LAB HINTS

Arabic Morphemes ProcliticsWord BaseEnclitic CONJPARTDET/FUTPrefixSTEMSuffixPRON w+ f+ k+ b+ l+ Al+ROOT + PATTER N +y + ϵ + ϵ +p +yn +wn +An +At +y +w +A +y +nA +k +km +kn +h +hA +hm +hn l++t +nA +tm +tn + ϵ +wA +n +ny s+A+ t+ n+ y+ + ϵ +wn +wA +n +yn +y +An +A Verbs Nominals everything circumfix Clitics are optional, affixes are obligatory! MT LAB HINTS MT LAB HINTS

Approach Habash&Sadat 2006 / Singh&Habash 2012 Preprocessing scheme –What to tokenize Preprocessing Technique –How to tokenize Regular expressions Morphological analysis Morphological tagging / disambiguation Unsupervised morphological segmentation Not always independent

Arabic Preprocessing Schemes STSimple Tokenization D1Decliticize conjunctions: w+/f+ D2D1 + Decliticize particles: b+/l+/k+/s+ D3D2 + Decliticize article Al+ and pron’l clitics BWMorphological stem and affixes END3, Lemmatize, English-like POS tags, Subj ONOrthographic Normalization WAwa+ decliticization TBArabic Treebank L1Lemmatize, Arabic POS tags L2Lemmatize, English-like POS tags Input:wsyktbhA? ‘and he will write it?’ STwsyktbhA ? D1w+ syktbhA ? D2w+ s+ yktbhA ? D3w+ s+ yktb +hA ? BWw+ s+ y+ ktb +hA ? ENw+ s+ ktb/VBZ S:3MS +hA ?

Arabic Preprocessing Techniques REGEX: Regular Expressions BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) –Pick first analysis –Use TOKAN (Habash 2006) A generalized tokenizer Assumes disambiguated morphological analysis Declarative specification of any preprocessing scheme MADA: Morphological Analysis and Disambiguation for Arabic (Habash&Rambow 2005) –Multiple SVM classifiers + combiner –Selects BAMA analysis –Use TOKAN

Hebrew Preprocessing Techniques/Schemes Regular Expressions o RegEx-S1 = Conjunctions: ו ‘and’ and ש ‘that/who’ o RegEx-S2 = RegEx-S1 and Prepositions: ב ‘in’, כ ‘like/as’, ל ‘to/for’, and מ ‘from’ o RegEx-S3 = RegEx-S2 and the article ה ‘the’ o RegEx-S4 = RegEx-S3 and pronominal enclitics Morfessor (Creutz and Lagus, 2007) o Morf - Unsupervised splitting into morphemes Hebrew Morphological Tagger (Adler, 2009) o Htag - Hebrew morphological analysis and disambiguation

Tokenization System Statistics Token Increase Similarity to Baseline OOV Reduction (DEV) Accuracy Gold-S4Gold (Scheme) RegEx-S1113%87.4% 26% 70.1%99.7% (S1) RegEx-S2141%62.2% 50% 65.3%79.1% (S2) RegEx-S3163%46.3% 60% 68.2%70.6% (S3) RegEx-S4190%33.8% 66% 54.5% 17 Aggressive tokenization schemes have: More tokens More change from the baseline (untokenized) Fewer OOVs (baseline OOV is 7%)

Tokenization System Statistics Token Increase Similarity to Baseline OOV Reduction (DEV) Accuracy Gold-S4Gold (Scheme) RegEx-S1113%87.4% 26% 70.1%99.7% (S1) RegEx-S2141%62.2% 50% 65.3%79.1% (S2) RegEx-S3163%46.3% 60% 68.2%70.6% (S3) RegEx-S4190%33.8% 66% 54.5% Morf124%81.6% 96% 72.9% Htag130%71.8% 56% 94.0% Gold-S4136%68.4% 18

Arabic-English Experiments Portage Phrase-based MT (Sadat et al., 2005) Training Data: parallel 5 Million words only –All in News genre –Learning curve: 1%, 10% and 100% Language Modeling: 250 Million words Development Tuning Data: MT03 Eval Set Test Data MT04 –Mixed genre: news, speeches, editorials Metric: BLEU (Papineni et al 2001)

Arabic-English Experiments Each experiment –Select a preprocessing scheme –Select a preprocessing technique Some combinations do not exist –REGEX and EN

MADABAMAREGEX BLEU 100% 10% 1% Training > > Arabic-English Results

Hebrew-English Experiments Phrase-base d statistical MT M oses (Koehn et al., 2007) MERT (Och, 2003) tuned for BLEU (Papineni et al., 2002) Language models: English Gigaword (5-gram) plus training (3-gram) True casing for English output Training data  850,000 words

Hebrew-English Experiments Compare seven systems Vary only preprocessing Baseline, RegEx-S{1-4}, Morf, and Htag Metrics BLEU, NIST (Doddington, 2002), METEOR (Banerjee & Lavie, 2005)

Results Method Blind Test BLEU NISTMETEOROOV Baseline 19.315.495144.36 1311 RegEx-S1 20.395.646845.46985 RegEx-S2 21.695.808246.50671 RegEx-S3 21.615.876146.60567 RegEx-S4 21.075.806746.03461 Morf 22.255.975146.5348 Htag 22.796.103348.20556 Combo1 22.726.038147.2074 Combo2 22.696.027547.17250 Htag is consistently best, and Morf is consistently second best, in terms of BLEU and NIST 25

Method Blind Test BLEU NISTMETEOROOV Baseline 19.315.495144.36 1311 RegEx-S1 20.395.646845.46985 RegEx-S2 21.695.808246.50671 RegEx-S3 21.615.876146.60567 RegEx-S4 21.075.806746.03461 Morf 22.255.975146.5348 Htag 22.796.103348.20556 Combo1 22.726.038147.2074 Combo2 22.696.027547.17250 Morf has very low OOV, but still does worse than Htag (and even more poorly according to METEOR), indicating that it sometimes over-tokenizes. 26 Results

Method Blind Test BLEU NISTMETEOROOV Baseline 19.315.495144.36 1311 RegEx-S1 20.395.646845.46985 RegEx-S2 21.695.808246.50671 RegEx-S3 21.615.876146.60567 RegEx-S4 21.075.806746.03461 Morf 22.255.975146.5348 Htag 22.796.103348.20556 Combo1 22.726.038147.2074 Combo2 22.696.027547.17250 Within RegEx, BLEU peaks at S2/S3, similar to Arabic D2 (Habash & Sadat, 2006) 27 Results

Translation Example Hebrewיש לנו קומקום ופלאטה בחדר. ReferenceWe have an electric kettle and a hotplate in our room. BaselineWe have brought ופלאטה in the room. RegEx-S1We have קומקום and פלאטה in the room. RegEx-S2We have קומקום and פלאטה in the room. RegEx-S3We've got קומקום and פלאטה in the room. RegEx-S4We have kettle and ופלאט room. MorfWe've got a complete wonder anywhere. HtagWe've got kettle and פלאטה in the room. 28

Conclusions Preprocessing is useful for improving Arabic-English & Hebrew-English SMT –But as more data is added, the value diminishes Tokenization with a morphological tagger does best but requires a lot of linguistic knowledge Morfessor does quite well with no linguistic information necessary, and significantly reduces OOV (but perhaps erroneously) Optimal Scheme/Technique choice varies by training data size –In Arabic, for large amounts of training data, splitting off conjunctions and particles performs best –But, for small amount of training data, following an English-like tokenization performs best

Thank you! Questions? Nizar Habash habash@cs.columbia.edu

Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash COMS E6998: Topics in Computer Science: Machine Translation February.

Similar presentations

Presentation on theme: "Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash COMS E6998: Topics in Computer Science: Machine Translation February."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash COMS E6998: Topics in Computer Science: Machine Translation February.

Similar presentations

Presentation on theme: "Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash COMS E6998: Topics in Computer Science: Machine Translation February."— Presentation transcript:

Similar presentations

About project

Feedback