Machine Translation Course 10 Diana Trandab ă ț 2015-2016.

Slides:

Advertisements

Similar presentations

© 2000 XTRA Translation Services Is MT technology available today ready to replace human translators?

Advertisements

How to Use a Translation Memory Prof. Reima Al-Jarf King Saud University, Riyadh, Saudi Arabia Homepage:

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Arthur Chan Prepared for Advanced MT Seminar

Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Post-Editing – Professional translation service redefined

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.

Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.

Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.

Evaluating Search Engine

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Russell Taylor Lecturer in Computing & Business Studies.

OT12 online seminar: Translation Memory Tools Paul Filkin, Director of Client Communities, SDL Language Technologies 1.

CODING Research Data Management. Research Data Management Coding When writing software or analytical code it is important that others and your future.

MACHINE TRANSLATION TRANSLATION(5) LECTURE[1-1] Eman Baghlaf.

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

Lecture 04.  DTP  Some features and their configuration  Fields and Filters  Summary.

Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.

Machine translation Context-based approach Lucia Otoyo.

Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.

Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.

Applications Software. Applications software is designed to perform specific tasks. There are three main types of application software: Applications packages.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. 1.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

practical aspects1 Translation Tools Translation Memory Systems Text Concordance Tools Useful Websites.

Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.

Evaluating the Output of Machine Translation Systems

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.

Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.

ITGS Databases.

1 Machine Assisted Human Translation (MAHT) (…aka “Translation Memory” or “CAT tool”) …and what it does for the translator…

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.

Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.

Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

SDL Trados Studio 2014 Getting Started. Components of a CAT Tool Translation Memory Terminology Management Alignment – transforming previously translated.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

Evaluating the Output of Machine Translation Systems Alon Lavie Associate Research Professor, Carnegie Mellon University President, Safaba Translation.

Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada

Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

Machine Translation Course 9

Human Computer Interaction Lecture 21 User Support

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Vorlesung Maschinelle Übersetzung, SS 2010

Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.

Tools of Software Development

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Presentation transcript:

Machine Translation Course 10 Diana Trandab ă ț

MT Evaluation is difficult – There is no single correct translation (language variability) – Human evaluation is subjective – How good is “good enough”? Depends on application – Is system A better than system B? Depends on specific criteria… MT Evaluation is a research topic in itself! How do we assess whether an evaluation method or metric is good?

Machine Translation Evaluation Quality assessment at sentence (segment) level vs. task-based evaluation Adequacy (is the meaning translated correctly?) vs. Fluency (is the output grammatical and fluent?) vs. Ranking (is translation-1 better than translation-2?) Manual evaluation: – Subjective Sentence Error Rates – Correct vs. Incorrect translations – Error categorization Automatic evaluation: – Exact Match, BLEU, METEOR, NIST etc.

Human MT evaluation Given machine translation output and source and/or reference translation, assess the quality of the machine translation output Adequacy: Does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? Fluency: Is the output fluent? This involves both grammatical correctness and idiomatic word choices. Source Sentence: Le chat entre dans la chambre. Adequate Fluent translation: The cat enters the room. Adequate Disfluent translation: The cat enters in the bedroom. Fluent Inadequate translation: My Granny plays the piano. Disfluent Inadequate translation: piano Granny the plays My.

Human MT evaluation - scales

Main Types of Human Assessments Adequacy and Fluency scores Human preference ranking of translations at the sentence- level Post-editing Measures: – Post-editor editing time/effort measures Human Editability measures: can humans edit the MT output into a correct translation? Task-based evaluations: was the performance of the MT system sufficient to perform a particular task?

MT Evaluation by Humans Problems: – Intra-coder Agreement: consistency of same human judge – Inter-coder Agreement: judgment agreement across multiple judges of quality – Very expensive (w.r.t time and money) – Not always available – Can’t help day-to-day system development

Goals for Evaluation Metrics Low cost: reduce time and money spent on carrying out evaluation Tunable: automatically optimize system performance towards metric Meaningful: score should give intuitive interpretation of translation quality Consistent: repeated use of metric should give same results Correct: metric must rank better systems higher

Other Evaluation Criteria When deploying systems, considerations go beyond quality of translations Speed: we prefer faster machine translation systems Size: fits into memory of available machines (e.g., handheld devices) Integration: can be integrated into existing workflow Customization: can be adapted to user's needs

Automatic Machine Translation Evaluation Objective, cheap, fast Measuring the “closeness” between the MT hypothesis and human reference translations – Precision: n-gram precision – Recall: Against the best matched reference Approximated by brevity penalty Highly correlated with human evaluations MT research has greatly benefited from automatic evaluations Typical metrics: BLEU, NIST, F-Score, Meteor, TER

Automatic Machine Translation Evaluation Automatic MT metrics are not sufficient: – What does a score of 30.0 or 50.0 mean? – Existing automatic metrics are crude and at times biased – Automatic metrics don’t provide sufficient insight for error analysis – Different types of errors have different implications depending on the underlying task in which MT is used Need for reliable human measures in order to develop and assess automatic metrics for MT evaluation

History of Automatic Evaluation of MT 2002: NIST starts MT Eval series under DARPA TIDES program, using BLEU as the official metric 2003: Och and Ney propose MERT for MT based on BLEU 2004: METEOR first comes out 2006: TER is released, DARPA GALE program adopts HTER as its official metric 2006: NIST MT Eval starts reporting METEOR, TER and NIST scores in addition to BLEU, official metric is still BLEU 2007: Research on metrics takes off… several new metrics come out 2007: MT research papers increasingly report METEOR and TER scores in addition to BLEU …

Precision and Recall SYSTEM A: Israeli officials responsibility of airport safety REFERENCE: Israeli officials are responsible for airport security Precision: correct / output-length = 3/6 = 50% Recall: correct/reference-length = 3/7 = 43% F-measure: 2* Precision * Recall / (Precision +Recall) = 46%

Precision and Recall SYSTEM A: Israeli officials responsibility of airport safety REFERENCE: Israeli officials are responsible for airport security SYSTEM B: airport security Israeli officials are responsible Precision: correct / output-length = 6/6 = 100% Recall: correct/reference-length = 6/7 = 85% F-measure: 2* Precision * Recall / (Precision +Recall) = 92% Is system B better than system A?

Word Error Rate Minimum number of editing steps to transform output to reference – match: words match, no cost – substitution: replace one word with another – insertion: add word – deletion: drop word Levenshtein distance

Word Error Rate MetricSystem ASystem B Word error rate57%71%

BLEU – Its Motivation Central Idea: – “The closer a machine translation is to a professional human translation, the better it is.” Implication – An evaluation metric could be evaluated – if it correlates with human evaluation, it would be a useful metric BLEU was proposed – as an aid – as a quick substitute of humans when needed

What is BLEU? A Big Picture Proposed by IBM [Papineni et al, 2002] Main ideas: – Exact matches of words – Match against a set of reference translations for greater variety of expressions – Account for Adequacy by looking at word precision – Account for Fluency by calculating n-gram precisions for n=1,2,3,4 – Introduces “Brevity Penalty” (system output/reference output)

BLEU score – Final score is weighted geometric average of the n-gram scores – Calculate aggregate score over a large test set

BLEU Example

Multiple Reference Translations To account for variability, use multiple reference translations – n-grams may match in any of the references – closest reference length used

N-gram Precision: an Example Candidate 1: It is a guide to action which ensures that the military always obey the commands the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Clearly Candidate 1 is better Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party

N-gram Precision To rank Candidate 1 higher than 2 – Just count the number of N-gram matches – The match could be position-independent – Reference could be matched multiple times – No need to be linguistically-motivated

BLEU – Example : Unigram Precision Candidate 1: It is a guide to action which ensures that the military always obey the commands of the party. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party. N-gram Precision : 17

Example : Unigram Precision (cont.) Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party. N-gram Precision : 8

Issue of N-gram Precision What if some words are over-generated? – e.g. “the” An extreme example Candidate: the the the the the the the. Reference 1: The cat is on the mat. Reference 2: There is a cat on the mat. N-gram Precision: 7 (Something wrong) Intuitively : reference word should be exhausted after it is matched.

Modified N-gram Precision : Procedure Procedure – Count the max number of times a word occurs in any single reference – Clip the total count of each candidate word – Modified N-gram Precision equal to Clipped count/Total no. of candidate word Example: Ref 1: The cat is on the mat. Ref 2: There is a cat on the mat. “the” has max count 2 Unigram count = 7 Clipped unigram count = 2 Total no. of counts = 7 Modified-ngram precision: – Clipped count = 2 – Total no. of counts =7 – Modified-ngram precision = 2/7

Different N in Modified N-gram Precision N > 1 (groups of words) is computed in a similar way – When 1-gram precision is high, the reference tends to satisfy adequacy – When longer n-gram precision is high, the reference tends to account for fluency

The METEOR Metric Metric developed by Lavie et al. at CMU/LTI METEOR = Metric for Evaluation of Translation with Explicit Ordering Main new ideas: – Include both Recall and Precision as score components – Look only at unigram Precision and Recall – Align MT output with each reference individually and take score of best pairing – Matching takes into account word variability (via stemming) and synonyms – Tuning of scoring component weights to optimize correlation with human judgments

The METEOR Metric Partial credit for matching stems SYSTEM: Jim went home REFERENCE: Jim goes home Partial credit for matching synonyms SYSTEM: Jim walks home REFERENCE: Jim goes home Use of paraphrases

METEOR vs BLEU Highlights of Main Differences: – METEOR word matches between translation and references includes semantic equivalents (inflections and synonyms) – METEOR combines Precision and Recall (weighted towards recall) instead of BLEU’s “brevity penalty” – METEOR uses a direct word-ordering penalty to capture fluency instead of relying on higher order n-grams matches – METEOR can tune its parameters to optimize correlation with human judgments Outcome: METEOR has significantly better correlation with human judgments, especially at the segment-level

Correlation with Human Judgments Human judgment scores for adequacy and fluency, each [1-5] (or sum them together) Correlation of metric scores with human scores at the system level – Can rank systems – Even coarse metrics can have high correlations Correlation of metric scores with human scores at the sentence level – Evaluates score correlations at a fine-grained level – Very large number of data points, multiple systems – Pearson correlation – Look at metric score variability for MT sentences scored as equally good by humans

Types of Machine Translation Unassisted Machine Translation: Unassisted MT takes pieces of text and translates them into output for immediate use with no human involvement The result is unpolished text and gives only a gist of the source text Assisted machine Translation: Assisted MT uses a human translator to clean up after, and sometimes before, translation in order to get better quality results. Usually the process is improved by limiting the vocabulary through use of a dictionary and the types of sentences/grammar allowed.

Assisted Machine Translation Assisted MT can be divided into Human Aided Machine Translation (HAMT), a machine that uses human help, and Machine Aided Human Translation (MAHT). Computer Aided Translation (CAT) is a more recent form of MAHT.

Types of Translation COMPUTER- AIDED MT Computer Assisted Translation Human Assisted MT

Computer-assisted translation Computer-assisted translation (CAT), also called computer- aided translation or machine-aided human translation (MAHT), is a form of translation wherein a human translator creates a target text with the assistance of a computer program. The machine supports a human translator. Computer-assisted translation can include standard dictionary and grammar software. It normally refers to a range of specialized programs available to the translator, including translation-memory, terminology- management, concordance, and alignment programs.

Translation Technology tools mean…. Advanced Word Processing functions – complex file management – understanding and manipulating complex file types – multilingual word processing – configuring templates and using tools such as using AutoText. Information mining on the WWW Translation Memory Systems Machine Translation

Translation Technology tools involves…. Terminology Extraction and Management Software and Web Localisation Image Editing (Photoshop) Corpora and Text Alignment XML and the Localisation Process…. and much, much more.

Translation Memory Translation memory software stores matching source and target language segments that were translated by a translator in a database for future reuse Newly encountered source language segments are compared to the database content, and the resulting output (exact, fuzzy or no match) is reviewed by the translator

Translation Memory (TM) Human translator works on Text A database Human translator works on Text B

Translation Memory (TM) tools DejaVu TRADOS SDLX Star Transit.

Professional Translation can be done … In principle, in any authoring editor (desktop/browser) – However, with limited productivity (in the range words per day) and high efforts maintaining consistency and accuracy. Using Microsoft Word + Plug-ins – Plug-in to translation productivity tool – Hard dealing with structured content Using a Dedicated Translation Editor (CAT) – Depending on various factors: productivity boost in the range 2000 to 5000 words per day – Well established market for professionals

Key Productivity Accelerators

Productivity Accelerators – “Don’t translate if it hasn’t changed” (but show it to provide context for the text that has actually changed/added) – “Don’t re-translate if you can reuse an (approved) existing translation” (but adapt as you need) – “Adapt an automated translation proposal” (instead of translating from scratch) – “Auto-propagate translations for identical source segments” (and ripple through any changes when you change your translation) – “While I type, provide a list of relevant candidates so that I can quickly auto-complete this part of my translation’” – “While I type, make it easy for me to place tags, recognised terms and other placeables so I can focus on the translatable text.’” – “Make it easy for me to search through Translation Memories, in both source or target text and from wherever I am in the document I’m translating’”

Great! See you next time!