Presentation on theme: "Morphological Analysis for Phrase-Based Statistical Machine Translation LUONG Minh Thang Supervisor: Dr. KAN Min Yen National University of Singapore Web."— Presentation transcript:
Morphological Analysis for Phrase-Based Statistical Machine Translation LUONG Minh Thang Supervisor: Dr. KAN Min Yen National University of Singapore Web IR / NLP Group (WING)
Luong Minh Thang 2Machine translation: understand word structure2 Modern Machine Translation (MT) State-of-the-art systems – phrase-phrase translation – with data-intensive techniques but still, they – treat words as different entities – don’t understand the internal structure of words We investigate the incorporation of word structure knowledge (morphology) and adopt a language-independent approach
Luong Minh Thang Issues we address Morphologically-aware system – Out-of-vocabulary problem – Derive word structure from only raw data, language-general approach Translation to high-inflected languages – English-Finnish case study – Understand the characteristics Suggestion of self-correcting model 3Machine translation: understand word structure Seen “car” before, but not “cars” “cars” has two morphemes “car”+“s” auto car auto/si: your car auto/i/si: your cars auto/i/ssa/si: in your cars auto/i/ssa/si/ko: in your cars?
Luong Minh Thang What others have done? A majority of works address the translation direction from high- to low-inflected languages – Arabic-English, German-English, Finnish-English Only few works touch at the reverse direction, which is considered more challenging – English-Turkish: (El-Kahlout & Oflazer, 2007) – English-Russian, Arabic: (Toutanova et. al., 2008) employ feature-rich approach using abundant annotation data & language-specific tools. 4Machine translation: understand word structure We also look at the reverse direction, English-Finnish, but stick to our language-general approach!
Luong Minh Thang Agenda Baseline statistical MT Terminology Our morphologically-aware SMT system Baseline + Morphological layers Finnish study – morphological aspects Suggestion of self-correcting model Experiments & results 5Machine translation: understand word structure
Luong Minh Thang 6Machine translation: understand word structure Baseline statistical MT (SMT) - overview We construct our baseline using Moses (Koehn et.al, 2007), a state-of-the-art open-source SMT toolkit Translation modelLanguage modelReordering model Training Monolingual/Parallel train data Decoding Test data (source language) Output translation (target language) Evaluating BLEU score
Luong Minh Thang 7Machine translation: understand word structure Baseline statistical MT - Terminology Marianodabaunabotefadaalabrujaverde 123456789 01234567 NULLMarydidnotslapthegreenwitch Source Target Parallel data: pairs of sentences in both language (implies alignment correspondence) Monolingual data: from one language only Distortion limit parameter: control reordering - how far translated words could be from the source word We test the effect of this parameter later Reordering effect
Luong Minh Thang 8Machine translation: understand word structure Automatic evaluation in SMT Human judgment is expensive & labor-consuming Automatically evaluate using reference translation(s) Baseline SMT system Evaluating BLEU score Input: Mary did not slap the green witch Output: Maria daba una botefada a verde bruja Ref: Maria no daba una botefada a la bruja verde
Luong Minh Thang 9Machine translation: understand word structure Automatic evaluation in SMT – BLEU score BLEU score = length_ratio* exp (p1+..+ p4)/4 Ref: Maria no daba una botefada a la bruja verde Output: Maria daba una botefada a verde bruja p1 (unigram) = 7 Output: Maria daba una botefada a verde bruja p2 (bigram) = 4 Output: Maria daba una botefada a verde bruja p3 (trigram) = 2 Output: Maria daba una botefada a verde bruja p4 (4-gram) = 1 Match unigram, bigram, trigram, and up to N-gram
Luong Minh Thang 10Machine translation: understand word structure Baseline SMT – Shortcomings? Only deal with language of similar morphology level Suffer from data sparseness problem in high- inflected languages Type countToken count English47.49216.018.093 Finnish333.02917.971.620 (Statistics from 714 K) Type: number of different words (vocabulary size) Token: the total number of words
Luong Minh Thang Why high-inflected language is hard? Has huge vocabulary size. – Finnish vocabulary ~ 6 times English vocabulary Could f reely concatenate prefixes/suffixes to form new word Finnish: oppositio/kansa/n/edusta/ja (opposition/people/of/represent/-ative) = opposition of parliarment member Turkish: uygarlas,tiramadiklarimizdanmis,sinizcasina (uygar/las, tir/ama/dik/lar/imiz/dan/mis, siniz/casina) = (behaving) as if you are among those whom we could not cause to become civilized Make our system morphologically-aware to address these 11 This is a word!!! Machine translation: understand word structure
Luong Minh Thang Agenda Baseline statistical MT Terminology Our morphological-aware SMT system Baseline + Morphological layers Finnish study – morphological aspects Suggestion of self-correcting model Experiments & results 12Machine translation: understand word structure
Luong Minh Thang 13Machine translation: understand word structure Morpheme pre- & post-processing modules Morpheme Pre- processing cars car + s Morpheme Post-processing auto + t autot Parallel train data Translation & reordering model training Language model training Monolingual train data Decoding Final translation Test data
Luong Minh Thang 14Machine translation: understand word structure Incorporating morphological layers Decoding Final translation E Morpheme Pre- processing Test data Morpheme Post-processing Our morphologically- aware SMT Morpheme Pre-processing Parallel train data Translation & reordering model training Language model training Monolingual train data
Luong Minh Thang Preprocessing – morpheme segmentation (MS) We perform MS to address the data sparse problem – cars might not appears in the training, but car & s do (Oflazer, 2007) & (Toutanova, 2008) also perform MS but use morphological analyzers that – customized for a specific language – utilize richly-featured annotated data We use an unsupervised morpheme segmentation tool, Morfessor, that requires only unannotated monolingual data. 15Machine translation: understand word structure
Luong Minh Thang Morpheme segmentation - Morfessor Morfessor – segment words, unsupervised manner straight/STM + forward/STM + ness/SUF 3 tags: PRE (prefix), STM (stem), & SUF(suffix) 16Machine translation: understand word structure Type count PRE count STM count SUF count English47.49262681014 Finnish333.02917139835159 (Statistics from 714 K) Reduce data sparseness problem
Luong Minh Thang Post-processing – morpheme concatenation Output after decoding is a sequence of morphemes Pitäkää mme se omassa täsmällis essä tehtävä ssä ä n How to put them back into words? During translation, keep the tag info & “+” sign to (indicate internal morpheme) Use word structure : WORD = ( PRE* STM SUF* )+ 17Machine translation: understand word structure Decoding Final translation Morpheme Pre- processing Test data Morpheme Post-processing Pitäkää/STM+ mme/SUF se/STM omassa/STM täsmällis/STM+ essä/STM tehtävä/STM+ ssä/SUF+ ä/SUF+ n/SUF Pitäkäämme se omassatäsmällisessätehtävässään
Luong Minh Thang Agenda Baseline statistical MT Terminology Our morphological-aware SMT system Baseline + Morphological layers Finnish study – morphological aspects Suggestion of self-correcting model Experiments & results 18Machine translation: understand word structure
Luong Minh Thang Finnish study – two distinct characteristics More case endings than usual Indo-European languages – Normally correspond to prepositions or postpositions. – E.g.: auto/sta “out of the car”, auto/on “into the car” Use endings where Indo-European languages have function words. – Finnish possessive suffixes = English possessive pronouns – E.g.: auto/si “my car”, auto/mme “our car”. 19Machine translation: understand word structure
Luong Minh Thang 20Machine translation: understand word structure Structure of nominal – A word followed by many suffixes CategorySuffixFunction Sample word form Translation of the sample Number (2)-i- / -tPluralauto/tcars Case (15) genitive-nPossessionauto/nof the car inessive-ssaInsideauto/ssain the car Possessive (6) -ni1 st personauto/nimy car -si2 nd personauto/siyour car Particle (6) -kinToo, alsoauto/i/ssa/si/kinin your cars too -koInterrogativeauto/ssa/si/koin your car? Structure: Nominal + number + case + possessive + particle
Luong Minh Thang 21 CategorySuffixFunction Sample word form Translation of the sample Passive (2) -ta/tt/ttaUnspecified person sano/ta/anone says -an/en/inPersonal ending Tense (2) / Mood (4) -i-Pastsano/i/nI said -isi-Conditionalsano/isi/nI would say Personal ending (6) -n1 st personsano/nI say -t2 nd personsano/tyou say Particle (6) -kinToo, alsosano/i/n/kinI said also -koInterrogativesano/i/n/kodid I say? Structure of finite verb form – Finnish suffixes ~ English function words Machine translation: understand word structure Structure: Nominal + tense/mood + personal ending + particle
Luong Minh Thang 22Machine translation: understand word structure Potential challenges of high-inflected language to the system A word might be followed by several suffixes A potential that the system might get the stem right, but miss a suffix. Correct translation: my cars auto/i/ni (i: plural, ni: my) Intuition: use “my” and “s” to help ………. my/STM car/STM+ s/SUF ………...…………..auto/STM+ i/SUF………….... How to self-correct this suffix to i/ni?
Luong Minh Thang 23Machine translation: understand word structure Preliminary self-correcting model Suffixes in high-inflected language ~ function words in low-inflected language Besides prefixes & suffixes, make use of source function words Model as a sequence labeling task – Labels are suffixes my/STM car/STM+ s/SUF auto/STM+ i/SUF Stem t =“auto” Suffix t-1 Suffix t Suffix t+1 func=“my” suf =“s” Predict correct suffix = ini/ Stem t-1 Stem t+1
Luong Minh Thang Agenda Baseline statistical MT Terminology Our morphological-aware SMT system Baseline + Morphological layers Finnish study – morphological aspects Suggestion of self-correcting model Experiments & results 24Machine translation: understand word structure
Luong Minh Thang Datasets from European Parliament corpora 25Machine translation: understand word structure Train sizeDev sizeTest size Dataset 15K130 Dataset 210K270 Dataset 336K949 Dataset 461K1615 Four data sets of various sizes – select by first pick a keyword for each dataset, and extract all sentences contain the key word and its morphological variants modest in size as compared to 714K of the full corpora. We choose because: - Reduce running time - Simulate the real situation of scarce resources
Luong Minh Thang Experiments – Out-of-vocabulary (OOV) rates 26Machine translation: understand word structure OOV rate = number of un-translated words / total words Reduction rate = (baseline OOV – our OOV rates) / baseline OOV rate OOVReduction rate Dataset 1 (5K) Baseline SMT 18.25 34.74% Our SMT 11.91 Dataset 2 (10K) Baseline SMT 13.99 35.69% Our SMT 9 Dataset 3 (36K) Baseline SMT 8.68 19.80% Our SMT 6.96 Dataset 4 (61K) Baseline SMT 8.08 10.33% Our SMT 7.25 Reduction rate: 10.33% to 34.74%. Highest effect when data is limited
Luong Minh Thang Overall results with BLEU score 27Machine translation: understand word structure Use BLEU score metric – judge at – word level: unit in N-gram is word – morpheme level: unit in N-gram is morpheme Dataset 1Dataset 2Dataset 3Dataset 4 Word BLEU Baseline SMT 8.5510.0617.8916.9 Our SMT 8.3110.9215.5416.11 Morpheme BLEU Baseline SMT 13.8316.425.2425.46 Our SMT 16.4620.3726.1726.35 Word BLEU: our STM is as competitive as the baseline SMT Morpheme BLEU: our STM shows better morpheme coverage
Luong Minh Thang Overall results - distortion limit tuning 28Machine translation: understand word structure Distortion limit controls reordering Has influential effect on the performance (Virpioja, 2007) Distortion limit69unlimited Word BLEU Baseline SMT 10.06 9.618.98 Our SMT 10.92 11.2 9.5 Morpheme BLEU Baseline SMT 16.4 15.9614.32 Our SMT 20.37 20.45 19.13 Baseline STM is best at 6Our STM is best at 9 Our SMT is better in both word and morpheme BLUE
Luong Minh Thang Error analysis 29Machine translation: understand word structure Interested to know how many times the system could get the stem right but not the suffixes # stem with correct suffix # stem with incorrect suffix Incorrect- suffix ratio Dataset 114711143.02% Dataset 231719337.84% Dataset 3162569029.81% Dataset 43448161531.90% Real need of the self-correcting model
Luong Minh Thang Even further analysis – New results after thesis !!! 30 Machine translation: understand word structure Our datasets are specialized on their keywords Result will be more conclusive if we look at translations of phrases containing dataset keywords Baseline SMTOur SMT #src phrase /w keyword #num correct stem # right suffix # src phrase /w keyword #num correct stem # right suffix Dataset 1 (success)12922131313317 Dataset 2 (environment)27658302837237 Dataset 3 (report)953344262952398295 Dataset 4 (european)170796487917161088970 Conclusion: our SMT performs better in both tasks, getting the stems and suffixes right.
Luong Minh Thang Reference Kohen, P., et. al, 2007. Moses: open source toolkit for statistical machine translation Oflazer & Durgar El-Kahlout, 2007. Exploring different representational units in English-to- Turkish statistical machine translation Virproja, S., et. al., 2007. Morphology-aware statistical machine translations based on morphs induced in an unsupervised manner Toutanova et. al., 2008. Applying morphology generation models to machine translation 31Machine translation: understand word structure
Luong Minh Thang Q & A? Thank you 32Machine translation: understand word structure
Luong Minh Thang 33Machine translation: understand word structure Train data Translation model training EM algorithm, symmetrizing word alignment (GIZA++ tool) Phrase tables Language model Target train data Language model training N-gram extraction, SRILM Development data Tuning P(E|F) ~ ∑ λ i f i (E|F) Learn λ i (Minimume error rate training) λ*Iλ*I Decoding E = argmax E ∑ λ * i f i (E|F) (Beam search, Moses toolkit) Test data (F) Final translation E Baseline statistical MT
Luong Minh Thang 34Machine translation: understand word structure Standard SMT system – translation model Parallel train data Translation model training Learn how to translate from one source phrase to a target phrase Output phrase table car industry in europe ||| euroopan autoteollisuus car industry in the ||| autoteollisuuden car industry in ||| autoteollisuuden
Luong Minh Thang 35Machine translation: understand word structure Standard SMT system – language model Language model training Target train data Constraints on a sequence of words that could go togerther Output N-gram table -2.882216 commission 's argument 0 -3.182358 commission 's arguments 0 -3.620942 commission 's assertion 0 -3.11402 commission 's assessment 0
Luong Minh Thang 36Machine translation: understand word structure Standard SMT system - tuning Tuning Parallel development data Determine the weights to combine different models, e.g. translation or language model. P(E|F) ~ ∑ λ i f i (E|F) Learn λ i
Luong Minh Thang 37Machine translation: understand word structure Standard SMT system - Decoding Decoding Final translation E Test data F Use phrase table in translation model, N-gram table in language model, and parameters to combine them in tuning. Generate for each input sentence F, a set of best translations, and pick the highest-score one.