Vorlesung Maschinelle Übersetzung, SS 2010

Slides:

Advertisements

Similar presentations

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Arthur Chan Prepared for Advanced MT Seminar

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Effect Size and Meta-Analysis

Evaluation of Machine Translation Systems: Metrics and Methodology Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon.

Evaluating Search Engine

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

Methodologies for Evaluating Dialog Structure Annotation Ananlada Chotimongkol Presented at Dialogs on Dialogs Reading Group 27 January 2006.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

INFO 624 Week 3 Retrieval System Evaluation

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

Information theory, fitness and sampling semantics colin johnson / university of kent john woodward / university of stirling.

® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.

Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.

Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.

Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Evaluating the Output of Machine Translation Systems

How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.

Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

END OF KEY STAGE TESTS SUMMER TERM 2016.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Evaluating the Output of Machine Translation Systems Alon Lavie Associate Research Professor, Carnegie Mellon University President, Safaba Translation.

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.

Machine Translation Course 10 Diana Trandab ă ț

Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

Stats Methods at IC Lecture 3: Regression.

Machine Translation Course 9

Survey Methodology Reliability and Validity

Multi-Engine Machine Translation

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Language Technologies Institute Carnegie Mellon University

Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Evaluation of IR Systems

Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Lecture 12: Data Wrangling

Objective of This Course

CSc4730/6730 Scientific Visualization

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Word embeddings (continued)

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Presented by: Anurag Paul

Presentation transcript:

Vorlesung Maschinelle Übersetzung, SS 2010 MT Evaluation Vorlesung Maschinelle Übersetzung, SS 2010

“You Improve What you Evaluate”

Need for MT Evaluation MT Evaluation is important: MT systems are becoming widespread, embedded in more complex systems How well do they work in practice? Are they reliable enough? Absolute quality vs. relative ranking MT is a technology still in research stages How can we tell if we are making progress? Metrics that can drive experimental development

Need for MT Evaluation MT Evaluation is difficult: Unlike in speech recognition (ASR), usually multiple equally good translations are possible Human evaluation is subjective How good is good enough? Depends on application Is system A better than system B? Depends on specific criteria... MT Evaluation is a research topic in itself! How do we assess whether an evaluation method is good?

Dimensions of MT Evaluation Human evaluation vs. automatic metrics Quality assessment at sentence (segment) level vs. task-based evaluation Adequacy (is the meaning translated correctly?) vs. Fluency (is the output grammatical and fluent?)

Human vs. Automatic Subjective MT evaluations Fluency and adequacy scored by human judges Expensive and time-consuming Objective (automatic) MT evaluations Inspired by word error rate (WER) metric used in ASR research Cheap and fast Highly correlated with subjective evaluation

Example Approaches Automatic evaluation BLEU, modified BLEU NIST Meteor WER, TER Human evaluation Darpa TIDES MT Evaluation Speech-to-speech translation systems (sentence-level, quality-based, end-to-end + some components) TC-STAR SLT Evaluation HTER Concept Transfer

Automatic Metrics for MT Evaluation Idea: compare output of a MT system to a „reference“ good (usually human) translation How close is the MT output to the reference translation? Advantages: Fast and cheap, minimal human labor, no need for bilingual speakers Can be used on an on-going basis during system development to test changes Disadvantages: Current metrics are crude, do not distinguish well between subtle differences in systems Individual sentence scores are not very reliable, aggregate scores on a large test set are required Metrics may not reflect perceived quality/usefulness of system

Similarity-based MT Evaluation Metrics Assess the „quality“ of a MT system by comparing its output with human produced „reference“ translations Premise: the more similar (in meaning) the translation is to the reference, the better Goal: an algorithm that is capable of approximating this similarity Large number of metrics, mostly based on exact word-level correspondences: Edit-distance metrics: Levenshtein, WER, pWER, ... Ngram-based metrics: BLEU, NIST, GTM, ORANGE, METEOR, ... Important issue: exact word matching is very crude estimate for sentence-level similarity in meaning

Automatic Metrics for MT Evaluation Example: Reference: „the Iraqi weapons are to be handed over to the army within two weeks“ MT output: „in two weeks Iraq‘s weapons will give army“ Possible metric components: Precision: correct words / total words in MT output Recall: correct words / total words in reference Combination of P and R (i,e. F1 = 2PR/(P+R)) Levenshtein edit distance: number of insertions, deletions, substitutions required to transform MT output to the reference Important issues: Features: matched words, n-grams, subsequences Metric: a scoring framework that uses the features Perfect word matches are weak features: synonyms, inflections: „Iraq‘s“ vs. „Iraqi“, „give“ vs. „hand over“

Desirable Automatic Metric High levels of correlation with quantified human notions of translation quality Sensitive to small differences in MT quality between systems and versions of systems Consistent – same MT system on similar texts should produce similar scores Reliable – MT systems that score similarly will perform similarly General – applicable to a wide range of domains and scenarios Fast and lightweight – easy to run

The BLEU Metric Proposed by IBM [Papineni et al, 2002] Main ideas: Exact matches of words Match against a set of reference translations for greater variety of expressions Account for adequacy by looking at word precision Account for fluency by calculating n-gram precisions for n=1,2,3,4 No recall (because difficult with multiple refs) To compensate for recall: introduce „brevity penalty“ Final score is weighted geometric average of the n-gram scores Calculate aggregate score over a large test set

The BLEU Metric Example: Reference: „the Iraqi weapons are to be handed over to the army within two weeks“ MT output: „in two weeks Iraq‘s weapons will give army“ BLEU precision metric (0-1): 1-gram precision: P1 = 4/8 2-gram precision: P2 = 1/7 3-gram precision: P3 = 0/6 4-gram precision: P4 = 0/5 BLEU precision score = 0 (weighted geometric average)

The BLEU Metric Clipping precision counts: Reference1: „the Iraqi weapons are to be handed over to the army within two weeks“ Reference2: „the Iraqi weapons will be surrendered to the army in two weeks“ MT output: „the the the the“ Precision count for „the“ is clipped at two: max count of the word in any reference Modified unigram score will be 2/4 (not 4/4)

The BLEU Metric Brevity Penalty (BP): Reference1: „the Iraqi weapons are to be handed over to the army within two weeks“ Reference2: „the Iraqi weapons will be surrendered to the army in two weeks“ MT output: „the Iraqi weapons will“ Precision score: 1-gram 4/4, 2-gram 3/3, 3-gram 2/2, 4-gram 1/1  BLEU score = 1.0 MT output is much too short, thus boosting precision, and BLEU doesn‘t have recall... An exponential Brevity Penalty reduces score, calculated based on the aggregate length (not individual sentences) Usually calculated on the test set level Final Score:

Brevity Panelty For Candidate Length c and Reference Length r:

The NIST MT Metric

Weaknesses in BLEU BLUE matches word ngrams of MT-translation with multiple reference translations simultaneously  Precision-based metric Is this better than matching with each reference translation separately and selecting the best match? BLEU Compensates for Recall by factoring in a “Brevity Penalty” (BP) Is the BP adequate in compensating for lack of Recall? BLEU’s ngram matching requires exact word matches Can stemming and synonyms improve the similarity measure and improve correlation with human scores? All matched words weigh equally in BLEU Can a scheme for weighing word contributions improve correlation with human scores? BLEU’s higher order ngrams account for fluency and grammaticality, ngrams are geometrically averaged Geometric ngram averaging is volatile to “zero” scores. Can we account for fluency/grammaticality via other means? 11-731: Machine Translation February 25, 2009

Further Issues Words are not created equal – some are more important for effective translation More effective matching with synonyms and inflected forms: Stemming Use a synonym knowledge base (WordNet) How to incorporate such information within the metric? Train weights for word matches Different weights for „content“ words and „function“ words Use a „Stop List“ Text normalization Spelling, Punctuation, Lowercase vs. Mixed Case Different word segmentation (some languages, e.g. Chinese) Calculate score at the character level Speech Translation: Automatic sentence segmentation from ASR, mismatch with reference translations Calculate score at the paragraph/document level MWER Segmentation: align output to reference segments

The METEOR Metric Alternate metric from CMU/LTI: METEOR = Metric for Evaluation of Translation with Explicit Ordering Main new ideas: Reintroduce Recall and combine it with Precision as score components Look only at unigram Precision and Recall Align MT output with each reference individually and take score of best pairing Matching takes into account word inflection variations (via stemming) Address fluency via a direct penalty: how fragmented is the matching of the MT output with the reference? 11-731: Machine Translation February 25, 2009

METEOR vs BLEU Highlights of Main Differences: METEOR word matches between translation and references includes semantic equivalents (inflections and synonyms) METEOR combines Precision and Recall (weighted towards recall) instead of BLEU’s “brevity penalty” METEOR uses a direct word-ordering penalty to capture fluency instead of relying on higher order n-grams matches METEOR can tune its parameters to optimize correlation with human judgments Outcome: METEOR has significantly better correlation with human judgments, especially at the segment-level 11-731: Machine Translation February 25, 2009

METEOR Components Unigram Precision: fraction of words in the MT that appear in the reference Unigram Recall: fraction of the words in the reference translation that appear in the MT F1= P*R/0.5*(P+R) Fmean = P*R/(α*P+(1-α)*R) Generalized Unigram matches: Exact word matches, stems, synonyms Match with each reference separately and select the best match for each sentence 11-731: Machine Translation February 25, 2009

The Alignment Matcher Find the best word-to-word alignment match between two strings of words Each word in a string can match at most one word in the other string Matches can be based on generalized criteria: word identity, stem identity, synonymy… Find the alignment of highest cardinality with minimal number of crossing branches Optimal search is NP-complete Clever search with pruning is very fast and has near optimal results Greedy three-stage matching: exact, stem, synonyms 11-731: Machine Translation February 25, 2009

Matcher Example the sri lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country’s Prime Minister 11-731: Machine Translation February 25, 2009

The Full METEOR Metric Matcher explicitly aligns matched words between MT and reference Matcher returns fragment count (frag) – used to calculate average fragmentation (frag -1)/(length-1) METEOR score calculated as a discounted Fmean score Discounting factor: DF = β * (frag**γ) Final score: Fmean * (1- DF) Scores can be calculated at sentence-level Aggregate score calculated over entire test set (similar to BLEU) 11-731: Machine Translation February 25, 2009

METEOR Metric Effect of Discounting Factor: 11-731: Machine Translation February 25, 2009

METEOR Example Example: P = 5/8 =0.625 R = 5/14 = 0.357 Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army P = 5/8 =0.625 R = 5/14 = 0.357 Fmean = 10*P*R/(9P+R) = 0.3731 Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50 Discounting factor: DF = 0.5 * (frag**3) = 0.0625 Final score: Fmean * (1- DF) = 0.3731 * 0.9375 = 0.3498 11-731: Machine Translation February 25, 2009

BLEU vs METEOR How do we know if a metric is better? Better correlation with human judgments of MT output Reduced score variability on MT outputs that are ranked equivalent by humans Higher and less variable scores on scoring human translations against the reference translations 11-731: Machine Translation February 25, 2009

BLEU vs. NIST Comparison of the correlation of BLEU and NIST scores with human judgments for four corpora

BLEU vs Human Scores 11-731: Machine Translation February 25, 2009

Human vs. Automatic Evaluation TC-STAR SLT Eval ´07, English-Spanish Three data points: ASR, Verbatim, FTE task

Correlation with Human Judgments Human judgment scores for adequacy and fluency, each [1-5] (or sum them together) Pearson or spearman (rank) correlations Correlation of metric scores with human scores at the system level Can rank systems Even coarse metrics can have high correlations Correlation of metric scores with human scores at the sentence level Evaluates score correlations at a fine-grained level Very large number of data points, multiple systems Pearson correlation Look at metric score variability for MT sentences scored as equally good by humans 11-731: Machine Translation February 25, 2009

Evaluation Setup Data: LDC Released Common data-set (DARPA/TIDES 2003 Chinese-to-English and Arabic-to-English MT evaluation data) Chinese data: 920 sentences, 4 reference translations 7 systems Arabic data: 664 sentences, 4 reference translations 6 systems Metrics Compared: BLEU, P, R, F1, Fmean, METEOR (with several features) 11-731: Machine Translation February 25, 2009

METEOR vs. BLEU: 2003 Data, System Scores 11-731: Machine Translation February 25, 2009

METEOR vs. BLEU: 2003 Data, Pairwise System Scores 11-731: Machine Translation February 25, 2009

Evaluation Results: System-level Correlations Chinese data Arabic data Average BLEU 0.828 0.930 0.879 Mod-BLEU 0.821 0.926 0.874 Precision 0.788 0.906 0.847 Recall 0.878 0.954 0.916 F1 0.881 0.971 Fmean 0.964 0.922 METEOR 0.896 0.934 February 25, 2009 11-731: Machine Translation

METEOR vs. BLEU Sentence-level Scores (CMU SMT System, TIDES 2003 Data) R=0.2466 R=0.4129 BLEU METEOR 11-731: Machine Translation February 25, 2009

Evaluation Results: Sentence-level Correlations Chinese data Arabic data Average BLEU 0.194 0.228 0.211 Mod-BLEU 0.285 0.307 0.296 Precision 0.286 0.288 0.287 Recall 0.320 0.335 0.328 Fmean 0.327 0.340 0.334 METEOR 0.331 0.347 0.339 February 25, 2009 11-731: Machine Translation

Adequacy, Fluency and Combined: Sentence-level Correlations Arabic Data Adequacy Fluency Combined BLEU 0.239 0.171 0.228 Mod-BLEU 0.315 0.238 0.307 Precision 0.306 0.210 0.288 Recall 0.362 0.236 0.335 Fmean 0.367 0.240 0.340 METEOR 0.370 0.252 0.347 February 25, 2009 11-731: Machine Translation

METEOR Mapping Modules: Sentence-level Correlations Chinese data Arabic data Average Exact 0.293 0.312 0.303 Exact+Pstem 0.318 0.329 0.324 Exact+WNste 0.330 0.321 Exact+Pstem+WNsyn 0.331 0.347 0.339 February 25, 2009 11-731: Machine Translation

Human Evaluation

Human Evaluation of MT Output Why perform human evaluation? Automatic MT metrics are not sufficient: What does a BLEU score of 30.0 or 50.0 mean? Existing automatic metrics are crude and at times biased Automatic metrics don’t provide sufficient insight for error analysis Different types of errors have different implications depending on the underlying task in which MT is used Need for reliable human measures in order to develop and assess automatic metrics for MT evaluation How good is a system? Which system is better? How effective is a system for a task? 11-731: Machine Translation March 2, 2009

Human Evaluation: Main Challenges Reliability and Consistency: difficulty in obtaining high-levels of intra and inter-coder agreement Intra-coder Agreement: consistency of same human judge Inter-coder Agreement: judgment agreement across multiple judges of quality Measuring Reliability and Consistency Developing meaningful metrics based on human judgments 11-731: Machine Translation March 2, 2009

Main Types of Human Assessments Adequacy and Fluency scores Human ranking of translations at the sentence-level Post-editing Measures: Post-editor editing time/effort measures HTER: Human Translation Edit Rate Human Editability measures: can humans edit the MT output into a correct translation? HTER used in DARPA project GALE Task-based evaluations: was the performance of the MT system sufficient to perform a particular task? 11-731: Machine Translation March 2, 2009

Adequacy and Fluency Adequacy: is the meaning translated correctly? By comparing MT translation to a reference translation (or to the source)? Fluency: is the output grammatical and fluent? By comparing MT translation to a reference translation, to the source, or in isolation? Scales: [1-5], [1-10], [1-7] Initiated during DARPA MT evaluations during mid-1990s Most commonly used until recently Main Issues: definitions of scales, agreement, normalization across judges 11-731: Machine Translation March 2, 2009

Accuracy-based segment-level end-to-end evaluation Example: Si # un attimo che vedo # le voglio mandare …. # attimo Yes I’m indicating… moment Yes # # I’m indicating… # moment G1: P VB K K G2: P VB P K G3: P VB K P Four-point scale: „Perfect“/„OK“/„Bad“/„Very bad“ OK: all meaning is translated, output somewhat disfluent Perfect: OK + output is fluent Bad: meaning is lost in translation Very bad: completely not understandable Acceptable: Perfect + OK

Human Assessment in WMT-09 WMT-09: Shared task on developing MT systems between several European languages (to English and from English) Also included a system combination track and an automatic MT metric evaluation track Official Metric: Human Rankings Detailed evaluation and analysis of results 2-day Workshop at EACL-09, including detailed analysis paper by organizers 11-731: Machine Translation March 2, 2009

Human Rankings at WMT-09 Instructions: Rank translations from Best to Worst relative to the other choices (ties are allowed) Annotators were shown at most five translations at a time. For most language pairs there were more than 5 systems submissions. No attempt to get a complete ordering over all the systems at once Relied on random selection and a reasonably large sample size to make the comparisons fair. Metric to compare MT systems: Individual systems and system combinations are ranked based on how frequently they were judged to be better than or equal to any other system. 11-731: Machine Translation March 2, 2009

Human Editing at WMT-09 Two Stages: Instructions: Humans edit the MT output to make it as fluent as possible Judges evaluate the edited output for adequacy (meaning) with a binary Y/N judgment Instructions: Step-1: Correct the translation displayed, making it as fluent as possible. If no corrections are needed, select “No corrections needed.” If you cannot understand the sentence well enough to correct it, select “Unable to correct.” Step-2: Indicate whether the edited translations represent fully fluent and meaning equivalent alternatives to the reference sentence. The reference is shown with context, the actual sentence is bold. 11-731: Machine Translation March 2, 2009

Editing Interface 11-731: Machine Translation March 2, 2009

Evaluating Edited Output 11-731: Machine Translation March 2, 2009

Nespole! User interface

Translation of Speeches

Pocket Dialog Translators Mobile Speech Translators Humanitarian Missions Government Tourism

Evaluation of Speech Translation Systems Nespole, C-STAR, .. Human-to-human spoken language translation for e-commerce application (e.g. travel & tourism) English, German, Italian, and French Translation via Interlingua Component Based Evaluation Speech recognition (Speech  Text) Analysis (Text  Interlingua  Text) SMT Components (Text  Text) Synthesis (Text  Speech) End-to-End Evaluation Text-to-Text, Speech-to-Text Speech-to-Speech

Evaluation Types and Methods Individual evaluation of components Speech recognizers, Analysis and generation engines, Synthesizers, IF End-to-end translation quality Sentence/segment level From speech input to speech/text output From transcribed input to text output Architecture effectiveness: network effects Task-based evaluation User studies – what works and how well?

Progress TC-STAR Speech Recognition [WER] Machine Translation [Bleue]

Speech Reco for Different Genres

Human vs. Automatic Evaluation TC-STAR SLT Eval ´07, English-Spanish Three data points: ASR, Verbatim, FTE task

Human vs. Machine Performance

Summary MT Evaluation is important for driving system development and the technology as a whole Different aspects need to be evaluated – not just translation quality of individual sentences Human evaluations are costly, but are most meaningful New automatic metrics are becoming popular, even though still rather crude, can drive system progress and rank systems Automatic metrics for MT Evaluation very active field of current research, new metrics with better correlation with human judgments are being developed

The NIST MT Metric NIST (National Institute of Science and Technology) 2002 Motivation: „Weight more heavily those n-grams that are more informative“  use n-gram counts over set of reference translations Geometric mean of co-occurrence over N in IBM-BLEU produces counterproductive variance due to low co-occurrences for the larger values of N  use arithmetic mean of N-gram score Pros: More sensitive than BLEU Cons: Info gain for 2-gram and up is not meaningful 80% of the score comes from unigram matches Most matched 5-grams have info gain 0 Score increases when test set size increases

BLEU vs. NIST How do we know if a metric is better? Better correlation with human judgments of MT output Reduced score variability on MT outputs that are ranked equivalent by humans Higher and less variable scores on scoring human translations against the reference translations

BLEU vs. NIST NIST: Strong emphasis on unigram scores, emphasizes adequacy BLEU-4: Higher emphasis on fluency very bad translations have very low 4-gram precision use BLEU-3, BLEU-2, BLEU-1 instead NIST score more meaningful for „bad“ translations, switch to BLEU once the system gets better

Issues in BLEU BLEU matches word n-grams of MT translation with multiple reference translations simultaneously  precision-based metric Is this better than matching with each reference translation separately and selecting the best match? BLEU compensates for Recall by factoring in a „Brevity Penalty“ (BP) Is the BP adequate in compensating for lack of Recall? BLEU‘s n-gram matching requires exact word matches Can stemming and and synonyms improve the similarity measure and improve correlation with human scores? All matched words weigh equally in BLEU Can a scheme for weighing word contributions improve correlation with human scores? BLEU‘s higher order n-grams account for fluency and grammaticality, n-grams are geometrically averaged Geometric n-gram averaging is volatile to „zero“ scores. Can we account for fluency/grammaticality via other means?

Human vs. Automatic Evaluation TC-STAR SLT Eval ´07, English-Spanish Three data points: ASR, Verbatim, FTE task

Assessing Coding Agreement Intra-annotator Agreement: 10% of the items were repeated and evaluated twice by each judge. Inter-annotator Agreement: 40% of the items were randomly drawn from a common pool that was shared across all annotators creating a set of items that were judged by multiple annotators. Agreement Measure: Kappa Coefficient P(A) is the proportion of times that the annotators agree P(E) is the proportion of time that they would agree by chance. 11-731: Machine Translation March 2, 2009

Assessing Coding Agreement Common Interpretation of Kappa Values: 0.0-0.2: slight agreement 0.2-0.4: fair agreement 0.4-0.6: moderate agreement 0.6-0.8: substantial agreement 0.8-1.0: near perfect agreement 11-731: Machine Translation March 2, 2009