METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Advertisements

Arthur Chan Prepared for Advanced MT Seminar

Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Evaluation of Machine Translation Systems: Metrics and Methodology Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon.

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.

Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.

MEMT: Multi-Engine Machine Translation Machine Translation Alon Lavie February 19, 2007.

Evaluating the Output of Machine Translation Systems

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin Merrill (Shyamsundar Jayaraman,

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin.

Evaluating the Output of Machine Translation Systems Alon Lavie Associate Research Professor, Carnegie Mellon University President, Safaba Translation.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Semantic Evaluation of Machine Translation Billy Wong, City University of Hong Kong 21 st May 2010.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.

Machine Translation Course 10 Diana Trandab ă ț

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

Machine Translation Course 9

Multi-Engine Machine Translation

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Linguistic Graph Similarity for News Sentence Searching

Vorlesung Maschinelle Übersetzung, SS 2010

Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.

Translation Error Rate Metric

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Statistical Machine Translation Papers from COLING 2004

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Week 6 Presentation Ngoc Ta Aidean Sharghi.

Presentation transcript:

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Satanjeev Banerjee, Kenji Sagae, Shyamsundar Jayaraman

Similarity-based MT Evaluation Metrics Assess the “quality” of an MT system by comparing its output with human produced “reference” translations Premise: the more similar (in meaning) the translation is to the reference, the better Goal: an algorithm that is capable of accurately approximating the similarity Wide range of past metrics, mostly focusing on word-level correspondences: Edit-distance metrics: Levenshtein, WER, PIWER, … Ngram-based metrics: Precision, Recall, F1-measure, BLEU, NIST, GTM… Main Issue: perfect word matching is very crude estimate for sentence-level similarity in meaning Sep 7, 2006 MT-Eval-06: METEOR

METEOR vs BLEU Highlights of Main Differences: METEOR word matches between translation and references includes semantic equivalents (inflections and synonyms) METEOR combines Precision and Recall (weighted towards recall) instead of BLEU’s “brevity penalty” METEOR uses a direct word-ordering penalty to capture fluency instead of relying on higher order n-grams matches Outcome: METEOR has significantly better correlation with human judgments, especially at the segment-level Sep 7, 2006 MT-Eval-06: METEOR

The METEOR Metric Main new ideas: Reintroduce Recall and combine it with Precision as score components Look only at unigram Precision and Recall Align MT output with each reference individually and take score of best pairing Matching takes into account word inflection variations (via stemming) and synonyms (via WordNet synsets) Address fluency via a direct penalty: how fragmented is the matching of the MT output with the reference? Sep 7, 2006 MT-Eval-06: METEOR

The Alignment Matcher Find the best word-to-word alignment match between two strings of words Each word in a string can match at most one word in the other string Matches can be based on generalized criteria: word identity, stem identity, synonymy… Find the alignment of highest cardinality with minimal number of crossing branches Optimal search is NP-complete Clever search with pruning is very fast and has near optimal results Greedy three-stage matching: exact, stem, synonyms Sep 7, 2006 MT-Eval-06: METEOR

Matcher Example the sri lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country’s Prime Minister Sep 7, 2006 MT-Eval-06: METEOR

The Full METEOR Metric Matcher explicitly aligns matched words between MT and reference Matcher returns fragment count (frag) – used to calculate average fragmentation (frag -1)/(length-1) METEOR score calculated as a discounted Fmean score Discounting factor: DF = 0.5 * (frag**3) Final score: Fmean * (1- DF) Scores can be calculated at sentence-level Aggregate score calculated over entire test set (similar to BLEU) Sep 7, 2006 MT-Eval-06: METEOR

METEOR Metric Effect of Discounting Factor: Sep 7, 2006 MT-Eval-06: METEOR

The METEOR Metric Example: P = 5/8 =0.625 R = 5/14 = 0.357 Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army P = 5/8 =0.625 R = 5/14 = 0.357 Fmean = 10*P*R/(9P+R) = 0.3731 Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50 Discounting factor: DF = 0.5 * (frag**3) = 0.0625 Final score: Fmean * (1- DF) = 0.3731*0.9375 = 0.3498 Sep 7, 2006 MT-Eval-06: METEOR

Evaluating METEOR How do we know if a metric is better? Better correlation with human judgments of MT output Reduced score variability on MT sentence outputs that are ranked equivalent by humans Higher and less variable scores on scoring human translations (references) against other human (reference) translations Sep 7, 2006 MT-Eval-06: METEOR

Correlation with Human Judgments Human judgment scores for adequacy and fluency, each [1-5] (or sum them together) Pearson or spearman (rank) correlations Correlation of metric scores with human scores at the system level Can rank systems Even coarse metrics can have high correlations Correlation of metric scores with human scores at the sentence level Evaluates score correlations at a fine-grained level Very large number of data points, multiple systems Pearson correlation Look at metric score variability for MT sentences scored as equally good by humans Sep 7, 2006 MT-Eval-06: METEOR

Evaluation Setup Data: LDC Released Common data-set (DARPA/TIDES 2003 Chinese-to-English and Arabic-to-English MT evaluation data) Chinese data: 920 sentences, 4 reference translations 7 systems Arabic data: 664 sentences, 4 reference translations 6 systems Metrics Compared: BLEU, P, R, F1, Fmean, METEOR (with several features) Sep 7, 2006 MT-Eval-06: METEOR

Evaluation Results: System-level Correlations Chinese data Arabic data Average BLEU 0.828 0.930 0.879 Mod-BLEU 0.821 0.926 0.874 Precision 0.788 0.906 0.847 Recall 0.878 0.954 0.916 F1 0.881 0.971 Fmean 0.964 0.922 METEOR 0.896 0.934 Sep 7, 2006 MT-Eval-06: METEOR

Evaluation Results: Sentence-level Correlations Chinese data Arabic data Average BLEU 0.194 0.228 0.211 Mod-BLEU 0.285 0.307 0.296 Precision 0.286 0.288 0.287 Recall 0.320 0.335 0.328 Fmean 0.327 0.340 0.334 METEOR 0.331 0.347 0.339 Sep 7, 2006 MT-Eval-06: METEOR

Adequacy, Fluency and Combined: Sentence-level Correlations Arabic Data BLEU 0.239 0.171 0.228 Mod-BLEU 0.315 0.238 0.307 Precision 0.306 0.210 0.288 Recall 0.362 0.236 0.335 Fmean 0.367 0.240 0.340 METEOR 0.370 0.252 0.347 Sep 7, 2006 MT-Eval-06: METEOR

METEOR Mapping Modules: Sentence-level Correlations Chinese data Arabic data Average Exact 0.293 0.312 0.303 Exact+Pstem 0.318 0.329 0.324 Exact+WNste 0.330 0.321 Exact+Pstem+WNsyn 0.331 0.347 0.339 Sep 7, 2006 MT-Eval-06: METEOR

Normalizing Human Scores Human scores are noisy: Medium-levels of intercoder agreement, Judge biases MITRE group performed score normalization Normalize judge median score and distributions Significant effect on sentence-level correlation between metrics and human scores Chinese data Arabic data Average Raw Human Scores 0.331 0.347 0.339 Normalized Human Scores 0.365 0.403 0.384 Sep 7, 2006 MT-Eval-06: METEOR

METEOR vs. BLEU Sentence-level Scores (CMU SMT System, TIDES 2003 Data) Sep 7, 2006 MT-Eval-06: METEOR

METEOR vs. BLEU Histogram of Scores of Reference Translations 2003 Data Mean=0.3727 STD=0.2138 Mean=0.6504 STD=0.1310 BLEU METEOR Sep 7, 2006 MT-Eval-06: METEOR

Using METEOR METEOR software package freely available and downloadable on web: http://www.cs.cmu.edu/~alavie/METEOR/ Required files and formats identical to BLEU  if you know how to run BLEU, you know how to run METEOR!! We welcome comments and bug reports… Sep 7, 2006 MT-Eval-06: METEOR

Conclusions Recall more important than Precision Importance of focusing on sentence-level correlations Sentence-level correlations are still rather low (and noisy), but significant steps in the right direction Generalizing matchings with stemming and synonyms gives a consistent improvement in correlations with human judgments Human judgment normalization is important and has significant effect Sep 7, 2006 MT-Eval-06: METEOR

References 2005, Banerjee, S. and A. Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments" . In Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005. Pages 65-72. 2004, Lavie, A., K. Sagae and S. Jayaraman. "The Significance of Recall in Automatic Metrics for MT Evaluation". In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004), Washington, DC, September 2004. Sep 7, 2006 MT-Eval-06: METEOR