National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way† Nano Gough, Declan Groves† National Centre.

National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way*† Nano Gough*, Declan Groves† National Centre for Language Technology School of Computing, Dublin City University {away,ngough,dgroves}@computing.dcu.ie [*To appear in the Journal of Natural Language Engineering, June 2005] [† To appear in the Workshop on Building and Using Parallel Texts: Data-Driven MT and Beyond, ACL-05, June 2005]

National Centre for Language Technology Plan of the Talk 1.Basic Situation in MT today: Statistical MT (SMT) Example-Based MT (EBMT) 2.Differences between Phrase-based SMT & EBMT. 3.Our ‘Marker-based’ EBMT system. 4.Testing EBMT vs. word- & phrase-based SMT. 5.Results & Observations. 6.Concluding Remarks. 7.Future Research Avenues.

National Centre for Language Technology What is the Situation today in MT? Most MT research undertaken today is corpus-based (compared with rule- based methods). Two main data-driven approaches: 1.Example-Based MT (EBMT) 2.Statistical MT (SMT) SMT by far the more dominant paradigm.

National Centre for Language Technology How does EBMT work? French F1 F2 F3 F4 EX (input) search F2 F4 FX (output) English E1 E2 E3 E4

National Centre for Language Technology A (much simplified) Example Given in corpus John went to school  Jean est allé à l’école. The butcher’s is next to the baker’s  La boucherie est à côté de la boulangerie. Isolate useful fragments John went to  Jean est allé à the baker’s  la boulangerie We can now translate John went to the baker’s Jean est allé à la boulangerie.

National Centre for Language Technology How does SMT work? SMT deduces language & translation models from huge quantities of monolingual and bilingual data using a range of theoretical approaches to probability distribution and estimation. Translation model establishes the set of target language words (and more recently, phrases) which are most likely to be useful in translating the source string. –takes into account source and target word (and phrase) co-occurrence frequencies, sentence lengths and the relative sentence positions of source and target words. Language model tries to assemble these words (and phrases) in the best order possible. –trained by determining all bigram and/or trigram frequency distributions occurring in the training data

National Centre for Language Technology The Paradigms are Converging Harder than it has ever been to describe the differences between the two methods. This used to be easy: from the beginning, EBMT has sought to translate new texts by means of a range of sub-sentential data—both lexical and phrasal— stored in the system's memory. until quite recently, SMT models of translation were based on the simple IBM word alignment models of [Brown et al., 1990].

National Centre for Language Technology From word- to phrase-based SMT SMT systems now learn phrasal as well as lexical alignments [e.g. Koehn, Och, Marcu 2003; Och, 2003]. Unsurprisingly, the quality of today's phrase- based SMT systems is considerably better than that of the poorer word-based models. Despite the fact that EBMT models have been modelling lexical and phrasal correspondences for 20 years, no papers on SMT acknowledge this debt to EBMT, nor describe their approach as ‘example-based’ …

National Centre for Language Technology Differences between EBMT and Phrase-Based SMT? 1.EBMT alignments remain available for reuse in the system, whereas (similar) SMT alignments ‘disappear’ in the probability models. 2.SMT systems never ‘learn’ from previously encountered data, i.e. when SMT sees a string it’s seen before, it processes it in the same way as ‘unseen’ data—EBMT will just ‘look up’ such strings in its databases and output the translation quite straightforwardly. 3.Depending on the model, EBMT builds in (some) syntax at its core—most SMT systems only use models of syntax in a post hoc reranking process, and even here, [Koehn et al., JHU Workshop 2003] demonstrated that ‘bolting on’ syntax in this manner did not help improve translation quality; 4.Given (3), phrase-based SMT systems are likely to ‘learn’ (some) chunks that EBMT systems would not.

National Centre for Language Technology SMT chunks are different from EBMT chunks En: Mary did not slap the green witch  Sp: Maria no dió una botefada a la bruja verde. (Lit: `Mary not gave a slap to the witch green‘) From this aligned example, an SMT system would potentially learn the following ‘phrases’ (along with many others): slap the  dió una botefada a slap the  dió una botefada a la the green witch  a la bruja verde NB, SMT essentially learns n-gram sequences, rather than phrases per se. [Koehn & Knight, AMTA-04 SMT Tutorial Notes]

National Centre for Language Technology “The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes which appear in a limited set of grammatical contexts and which signal that context.” [Green, 1979] Markers for English (and French): Our Marker-Based EBMT System Determiners Quantifiers Prepositions Conjunctions Wh-Adverbs Possessive Pronouns Personal Pronouns

National Centre for Language Technology An Example En: you click apply to view the effect of the selection  Fr: vous cliquez sur appliquer pour visualiser l'effet de la sélection Source—target aligned sentences are traversed word by word and automatically tagged with their marker categories: you click apply to view the effect of the selection  vous cliquez sur appliquer pour visualiser l'effet de la sélection

National Centre for Language Technology Deriving Sub-Sentential Source—Target Chunks From these tagged strings, we generate the following aligned marker chunks: you click apply : vous cliquez sur appliquer to view : pour visualiser the effect : l'effet of the selection : de la sélection New source and target (not necessarily source—target!] fragments begin where marker words are met and end at the next marker word [+ cognates, MI etc  source—target sub-sentential alignments]. One further constraint: each chunk must contain at least one non-marker word (cf. 4 th marker chunk).

National Centre for Language Technology Deriving Lexical Mappings Where chunks contain just one non-marker word in both source and target, we assume they are translations. Thus we can extract the following ‘word-level’ translations: to : pour view : visualiser effect : effet you : vous the : l’ of : de

National Centre for Language Technology Deriving Generalised Templates In a final pre-processing stage, we produce a set of generalised marker templates by replacing marker words with their tags: click apply : cliquez sur appliquer view : visualiser effect : effet the selection : la sélection Any marker tag pair can now be inserted at the appropriate tag location. More general examples add flexibility to the matching process and improve coverage (and quality).

National Centre for Language Technology Summary of Knowledge Sources 1.the original sententially-aligned source—target pairs; 2.the marker-aligned chunks; 3.the generalised marker chunks; 4.the word-level lexicon. New strings are segmented into all possible n- grams that might be retrieved from the system's memories. Resources searched in the order provided here, from maximal (specific source—target sentence-pairs) to minimal context (word-for- word translation).

National Centre for Language Technology Application Areas for our EBMT System 1.Seeding System Memories with Penn-II Treebank phrases and translations [AMTA-02]. 2.Controlled Language & EBMT [MT Summit-03, EAMT-04, MT Journal-05]. 3.Integration with web-based MT Systems [CL Journal- 03]. 4.Using the Web for Translation Validation (and Correction, if required). 5.Scalable EBMT [TMI-04, NLE Journal-05, ACL-05]. Largest English  French EBMT System. Robust, Wide-Coverage, Good Quality. Outperforms good on-line MT Systems.

National Centre for Language Technology What are we interested in finding out? 1.Whether our marker-based EBMT system could outperform (1) word-based and (2) phrase-based SMT systems compiled from generally available tools; 2.Whether such SMT systems outperform our EBMT system when given ‘enough’ training text. 3.Whether seeding SMT (and EBMT) systems with SMT and/or EBMT data improves translation quality. NB, (astonishingly), no previous published research on comparing EBMT and SMT …

National Centre for Language Technology What have we done vs. what are we doing? 1.WBSMT vs. EBMT 2.PBSMT seeded with: –SMT chunks; –EBMT chunks –Both knowledge sources (‘Hybrid Example-Based SMT’). 3.PBSMT vs. EBMT Ongoing work 1.EBMT seeded with: –SMT chunks; –EBMT chunks –Merged knowledge sources (‘Hybrid Statistical EBMT’).

National Centre for Language Technology Word-Based SMT vs. EBMT 1.Marker-Based EBMT system [Gough & Way, TMI-04] 2.To develop language and translation models for the WBSMT system, we used: –Giza++ (for word-alignment) –the CMU-Cambridge statistical toolkit (for computing the language and translation models) –the ISI ReWrite Decoder (for deriving translations)

National Centre for Language Technology Experiment 1 Set-Up 207K English—French Sun TM. Randomly extracted 4K sentence test set. Split remaining sentences into three training sets: roughly 50K (1.1M words), 100K and 203K (4.8M words) sentence-pairs to test impact of training set size. Translation performed at each stage from English—French and French—English. Resulting translations evaluated using a range of automatic metrics.

National Centre for Language Technology WBSMT vs. EBMT: English—French BLEUPrec.Rec.WERSER TS1SMT.297.674.591.549.908 EBMT.332.653.618.543.892 TS2SMT.338.682.596.511.899 EBMT.453.736.698.448.775 TS3SMT.322.651.570.535.891 EBMT.441.673.688.524.656 All metrics bar one suggest that EBMT can outperform WBSMT from French—English; Only exception is for TS1, where WBSMT outperforms EBMT in terms of precision (.674 compared to.653)

National Centre for Language Technology WBSMT vs. EBMT: English—French In general, scores incrementally improve as training data increases. But apart from SER, metrics suggest that training on just over 100K sentences pairs yields better results than training on just over 200K. Why? Maybe due to overfitting or odd data … Surprising: generally assumed that increasing training data in Machine Learning approaches will improve the quality of the output translations (variance analysis:bootstrap-resampling on test set [Koehn, EMNLP-04]; different test sets). Note especially the similarity of the WER scores, and the difference in SER values. Much more significant improvement for EBMT (20.6%) than for WBSMT (0.1%).

National Centre for Language Technology WBSMT vs. EBMT: French—English BLEUPrec.Rec.WERSER TS1SMT.379.709.736.525.865 EBMT.257.542.631.697.892 TS2SMT.392.721..743.462.813 EBMT..426.673.796.552.662 TS3SMT.446.704.724.468.808 EBMT.461.678.744.508.512 All WBSMT scores higher than for French—English; For EBMT, better translations from French—English for BLEU, Recall and SER; worse for WER (FR-EN:.508, EN-FR:.448) and precision (FR-EN:.678, EN-FR:.736);

National Centre for Language Technology WBSMT vs. EBMT: French—English For TS1, EBMT does not outperform WBSMT from French— English for any of the five metrics. For TS2, EBMT beats WBSMT in terms of BLEU, Recall and SER (66.5% compared to 81.3 % for WBSMT), while WBSMT gets higher scores for Precision and WER (46.2% compared to 55.2%). For TS3, WBSMT again beats EBMT in terms of Precision (2.5%) and WER (4% - both less significant differences than for TS1 and TS2), but EBMT wins out according to the other three metrics—notably, by a huge 29.6% for SER. BLEU: WBSMT obtains significantly higher scores for French— English compared to English—French: 8% higher for TS1, 6% higher for TS2, and 12% higher for TS3. Apart from TS1, the EBMT scores for the two different language directions are much more in line, indicating perhaps that EBMT may be more consistent even for the same language pair in different directions.

National Centre for Language Technology Summary of Results 1.Both EBMT & WBSMT achieve better translation quality from French—English compared to English—French. Of the five automatic evaluation metrics for each of the three training sets, in 9/15 cases WBSMT wins out over our EBMT system. 2.For English—French, in 14/15 cases EBMT beats WBSMT. 3.Summing these results together, EBMT outperforms WBSMT in 20 tests, while WBSMT does better in 10 experiments. 4.Assuming all of these tests to be of equal importance, EBMT appears to outperform WBSMT by a factor of two to one. 5.While the results are a little mixed, it is clear that EBMT tends to outperform WBSMT on this sublanguage and on these training sets.

National Centre for Language Technology Experiment 2: Phrase-Based SMT vs. EBMT Same EBMT system as for WBSMT experiment To develop language and translation models for the SMT system, we used: –Giza++ to extract word-alignments; –Refine these to extract Giza++ phrase-alignments; –Construct Probability Tables; –Pass these to CMU-SRI statistical toolkit & Pharaoh Decoder to derive translations. Same Translation Pairs, Training Sets, Test Sets Resulting translations evaluated using a range of automatic metrics

National Centre for Language Technology PBSMT vs. EBMT: English—French DataBLEUPrec.Rec.WERSER TS3 Giza++.375.659.587.585.868 EBMT.364.666.576.613.879 WBSMT.322.651.570.535.891 EBMT.441.673.688.524.656 PBSMT with Giza++ sub-sentential alignments wins out over PBSMT with EBMT data, but cf. size of data sets: EBMT: 403,317 PBSMT: 1.73M PBSMT beats WBSMT, notably for BLEU; but 5% worse for WER. SER still (disappointingly) high EBMT beats PBSMT, esp. for BLEU, Recall, WER & SER

National Centre for Language Technology PBSMT vs. EBMT: French—English DataBLEUPrec.Rec.WERSER TS3Giza++.420.653.710.629.828 EBMT.395.615.664.748.862 WBSMT.446.704.724.468.808 EBMT.461.678.744.508.512 PBSMT with Giza++ sub-sentential alignments wins out over PBSMT with EBMT data (with same caveat) PBSMT with both knowledge sources better for F—E than for E—F PBSMT doesn’t beat WBSMT - ?? EBMT beats PBSMT

National Centre for Language Technology Experiment 3a: Seeding Pharaoh with Giza++ Words and EBMT Phrases: English—French BLEUPrec.Rec.WERSER TS3.396.677.591.593.854 Giza++ Data.375.659.587.585.868 Hybrid PBSMT system beats ‘baseline’ PBSMT for BLEU, P&R, and SER; slightly worse WER Data Size: 430K (cf. PBSMT 1.73M, EBMT 403K) Still worse than EBMT scores

National Centre for Language Technology Experiment 3b: Seeding Pharaoh with Giza++ Words and EBMT Phrases: French—English BLEUPrec.Rec.WERSER TS3.427.642.692.681.834 Giza++ Data.420.653.710.629.828 Hybrid PBSMT system beats ‘baseline’ PBSMT for BLEU; slightly worse for P&R, and SER; quite a bit worse for WER; Still shy of the results for EBMT.

National Centre for Language Technology Experiment 4a: Seeding Pharaoh with All Data, English—French BLEUPrec.Rec.WERSER TS3.426.703.610.543.836 Semi- Hybrid.396.677.591.593.854 EBMT.441.673.688.524.656 Hybrid System beats ‘semi-hybrid’ system on all metrics; Loses out to EBMT system, except for Precision. Data Set now >2M items.

National Centre for Language Technology Experiment 4b: Seeding Pharaoh with All Data, French—English BLEUPrec.Rec.WERSER TS3.489.693.717.564.784 Semi- Hybrid.427.642.692.681.834 EBMT.461.678.744.508.512 Hybrid System beats ‘semi-hybrid’ system on all metrics; Hybrid System beats EBMT on BLEU & Precision; EBMT ahead for Recall & WER; still well ahead for SER.

National Centre for Language Technology Summary of Results: WBSMT vs. EBMT None of these are ‘bad’ systems: for TS3, worst BLEU score is for WBSMT, E  F,.322; WBSMT loses out to EBMT 2:1 (but better overall for F  E); For TS3, WBSMT BLEU score of.446 and EBMT score of.461 are high scores; For WBSMT vs. EBMT experiments, odd finding: higher scores for 100K training set: investigate in future work.

National Centre for Language Technology Summary of Results: PBSMT vs. EBMT PBSMT scores better than for WBSMT, but odd result for F  E …?! Best PBSMT BLEU scores (with Giza++ data only):.375 (E  F),.420 (F  E); Seeding PBSMT with EBMT data gets good scores: for BLEU,.364 (E  F),.395 (F  E); note differences in data size (1.73M vs. 403K); PBSMT loses out to EBMT; PBSMT SER still very high (83—87%).

National Centre for Language Technology Summary of Results: Semi-Hybrid Systems Seeding Pharaoh with SMT words and EBMT phrases improves over baseline Giza++ seeded system; Data size diminishes considerably (430K vs. 1.73M); Still worse result for ‘semi-hybrid’ system for F  E than for WBSMT… ?! Still worse results than for EBMT.

National Centre for Language Technology Summary of Results: Fully Hybrid Systems Better results than for ‘semi-hybrid’ systems: E  F.426 (.396), F  E.489 (.427); Data size increases; For F  E, Hybrid system beats EBMT on BLEU (.461) & Precision; EBMT ahead for Recall & WER; still well ahead (27%) for SER.

National Centre for Language Technology Concluding Remarks Despite the convergence between EBMT and SMT, further gains to be made; Merging Giza++ and EBMT-induced data leads to an improved Hybrid Example-Based SMT system;  Lesson for SMT community: don’t disregard the large body of work on EBMT! We expect in further work that adding SMT sub-sentential data to our EBMT system will also lead to improvements;  Lesson for EBMT-ers: SMT data can help you too!

National Centre for Language Technology Future Work Carry out significance tests on these results. Investigate what’s going on in 2 nd 100K training set. Develop ‘Statistical EBMT System’ as described; Other issues in hybridity: –Use target LM in EBMT; –Replace EBMT recombination process with SMT decoder; –Try different decoders, LMs and TMs; –Factor in Marker Tags into SMT Probability Tables. Experiment with other training data in other sublanguage domains, especially those where larger corpora are available (e.g. Canadian Hansards, European Parliament …); Try other language pairs.

National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way† Nano Gough, Declan Groves† National Centre.

Similar presentations

Presentation on theme: "National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way† Nano Gough, Declan Groves† National Centre."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way*† Nano Gough*, Declan Groves† National Centre.

Similar presentations

Presentation on theme: "National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way*† Nano Gough*, Declan Groves† National Centre."— Presentation transcript:

Similar presentations

About project

Feedback

National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way† Nano Gough, Declan Groves† National Centre.

Presentation on theme: "National Centre for Language Technology Comparing Example-Based & Statistical Machine Translation Andy Way† Nano Gough, Declan Groves† National Centre."— Presentation transcript: