Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Similar presentations


Presentation on theme: "Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina."— Presentation transcript:

1 Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina Girletti University of Geneva – Translation Technology Department (TIM) lise.volkart ¦ pierrette.bouillon ¦ Asling, Translation and the Computer 40 London, November 2018

2 Introduction Context Research questions
Microsoft Translator Hub (trained with 288,211 segments and 76 terms from Swiss Post Data) DeepL (generic neural machine translation system) German-to-French Test set: Swiss Post’s annual report Research questions Can a generic neural system compete with a customised statistical MT system? Is BLEU a suitable metric for the evaluation of NMT?

3 Comparison of MTH and DeepL
3 types of evaluation Automatic evaluation (BLEU) Human evaluation I: post-editing productivity test Human evaluation II: comparative evaluation of the post-edited output

4 Automatic evaluation Results Corpus of 1,718 segments
Very similar scores, BLEU is slightly better for DeepL. System BLEU DeepL 25.23 MTH 23.46

5 Human evaluation I Post-editing productivity test
2 participants (in-house translator and freelance) 250 segments Full post-editing Time and HTER (Human-Targeted Error Rate)

6 Human evaluation I (continued)

7 Human evaluation I (continued)
Results (continued) Post-editing: 53.6% faster for DeepL HTER: 75.1% lower for DeepL

8 Human evaluation II Comparative evaluation of the post-edited output
Goal: to ensure that a lower PE time and lower HTER ≠ lower final quality 3 evaluators (MA translation students) Post-edited output from MTH vs. DeepL

9 Human evaluation II (continued)
Results

10 BLEU score’s reliability for NMT evaluation
Motivations Low correlation between automatic and human evaluations Previous studies  BLEU tends to underestimate the quality of NMT Methodology Calculating the underestimation rate (Shterionov et al., 2017) Number of segments that are better according to human but have lower BLEU, divided by the number of segments better according to human

11 BLEU score’s reliability for NMT evaluation
Results

12 Summary of the results DeepL obtains a slightly better BLEU than MTH
DeepL’s output requires less PE effort Final quality seems to be better while using DeepL BLEU seems to underestimate the quality of DeepL’s output


Download ppt "Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina."

Similar presentations


Ads by Google