Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina Girletti University of Geneva – Translation Technology Department (TIM) lise.volkart ¦ pierrette.bouillon ¦ Asling, Translation and the Computer 40 London, November 2018

Introduction Context Research questions
Microsoft Translator Hub (trained with 288,211 segments and 76 terms from Swiss Post Data) DeepL (generic neural machine translation system) German-to-French Test set: Swiss Post’s annual report Research questions Can a generic neural system compete with a customised statistical MT system? Is BLEU a suitable metric for the evaluation of NMT?

Comparison of MTH and DeepL
3 types of evaluation Automatic evaluation (BLEU) Human evaluation I: post-editing productivity test Human evaluation II: comparative evaluation of the post-edited output

Automatic evaluation Results Corpus of 1,718 segments
Very similar scores, BLEU is slightly better for DeepL. System BLEU DeepL 25.23 MTH 23.46

Human evaluation I Post-editing productivity test
2 participants (in-house translator and freelance) 250 segments Full post-editing Time and HTER (Human-Targeted Error Rate)

Human evaluation I (continued)

Human evaluation I (continued)
Results (continued) Post-editing: 53.6% faster for DeepL HTER: 75.1% lower for DeepL

Human evaluation II Comparative evaluation of the post-edited output
Goal: to ensure that a lower PE time and lower HTER ≠ lower final quality 3 evaluators (MA translation students) Post-edited output from MTH vs. DeepL

Human evaluation II (continued)
Results

BLEU score’s reliability for NMT evaluation
Motivations Low correlation between automatic and human evaluations Previous studies  BLEU tends to underestimate the quality of NMT Methodology Calculating the underestimation rate (Shterionov et al., 2017) Number of segments that are better according to human but have lower BLEU, divided by the number of segments better according to human

BLEU score’s reliability for NMT evaluation
Results

Summary of the results DeepL obtains a slightly better BLEU than MTH
DeepL’s output requires less PE effort Final quality seems to be better while using DeepL BLEU seems to underestimate the quality of DeepL’s output

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Similar presentations

Presentation on theme: "Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Similar presentations

Presentation on theme: "Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina."— Presentation transcript:

Similar presentations

About project

Feedback