Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

Slides:

Advertisements

Similar presentations

Statistical modelling of MT output corpora for Information Extraction.

Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

Improved TF-IDF Ranker

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Arthur Chan Prepared for Advanced MT Seminar

Predicting MT Fluency from IE Precision and Recall Tony Hartley, Brighton, UK Martin Rajman, EPFL, CH.

Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

MODL5003 Principles and applications of MT

Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

1 Lending a Hand: Sign Language Machine Translation Sara Morrissey NCLT Seminar Series 21 st June 2006.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.

Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.

Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.

Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.

“ Poetry is what gets lost in translation.” Robert Frost Poet (1874 – 1963) Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者：郝柏翰 2013/06/04 Thorsten Brants, Ashok.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.

Sanjay Chatterji Dev shri Roy Sudeshna Sarkar Anupam Basu CSE, IIT Kharagpur A Hybrid Approach for Bengali to Hindi Machine Translation.

Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.

Machine Translation Course 10 Diana Trandab ă ț

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

Machine Translation Course 9

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Vorlesung Maschinelle Übersetzung, SS 2010

Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Presentation transcript:

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006

Spring 2006 MT Seminar Overview The Weaknesses of Bleu  Introduction  Precision and Recall  Fluency and Adequacy  Variations Allowed by Bleu  Bleu and Tides 2005 An Improved Model  Overview of the Model  Experiment  Results Conclusions

Spring 2006 MT Seminar Introduction Bleu has been shown to have high correlations with human judgments Bleu has been used by MT researchers for five years, sometimes in place of manual human evaluations But does the minimization of the error rate accurately show improvements in translation quality?

Spring 2006 MT Seminar Precision and Bleu Of my answers, how many are right/wrong? Precision = B  C / C or A/C A Reference Translation Hypothesis Translation B C

Spring 2006 MT Seminar Precision and Bleu Bleu is a precision based metric The modified precision score, p n : P n = ∑ s  c ∑ ngram  s Count matched (ngram) ∑ s  c ∑ ngram  s Count(ngram)

Spring 2006 MT Seminar Recall and Bleu Of the potential answers how many did I retrieve/miss? Recall = B  C / B or A/B A Reference Translation Hypothesis Translation B C

Spring 2006 MT Seminar Recall and Bleu Because Bleu uses multiple reference translations at once, recall cannot be calculated

Spring 2006 MT Seminar Fluency and Adequacy to Evaluators Fluency  “How do you judge the fluency of this translation”  Judged with no reference translation and to the standard of written English Adequacy  “How much of the meaning expressed in the reference is also expressed in the hypothesis translation?”

Spring 2006 MT Seminar Variations Bleu allows for variations in word and phrase order that lead to less fluency No constraints occur on the order of matching n-grams

Spring 2006 MT Seminar Variations

Spring 2006 MT Seminar Variations The above two translations have the same bigram score.

Spring 2006 MT Seminar Bleu and Tides 2005 Bleu scores showed significant divergence from human judgments in the 2005 Tides Evaluation It ranked the system considered the best by humans as sixth in performance

Spring 2006 MT Seminar Bleu and Tides 2005 Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs System A: Iran has already stated that Kharazi’s statements to the conference because of the Jordanian King Abdullah II in which he stood accused Iran of interfering in Iraqi affairs. N-gram matches: 1-gram: 27; 2-gram: 20; 3-gram: 15; 4 gram: 10 Human scores: Adequacy: 3,2; Fluency 3,2 From Callison-Burch 2005

Spring 2006 MT Seminar Bleu and Tides 2005 Reference: Iran had already announced Kharazi would boycott the conference after Jordan’s King Abdullah II accused Iran of meddling in Iraq’s affairs System B: Iran already announced that Kharazi will not attend the conference because of statements made by Jordanian Monarch Abdullah II who has accused Iran of interfering in Iraqi affairs. N-gram matches: 1-gram: 24; 2-gram: 19; 3-gram: 15; 4 gram: 12 Human scores: Adequacy: 5,4; Fluency 5,4 From Callison-Burch 2005

Spring 2006 MT Seminar An Experiment with Bleu

Spring 2006 MT Seminar Bleu and Tides 2005 “This opens the possibility that in order to for Bleu to be valid only sufficiently similar systems should be compared with one another”

Spring 2006 MT Seminar Additional Flaws Multiple Human reference translations are expensive N-grams showing up in multiple reference translations are weighted the same Content words are weighed the same as common words  ‘The’ counts the same as ‘Parliament’ Bleu accounts for the diversity of human translations, but not synonyms

Spring 2006 MT Seminar An Extension of Bleu Described in Babych & Hartley, 2004 Adds weights to matched items using  tf/idf  S-score

Spring 2006 MT Seminar Addressing Flaws Can work with only one human translation  Can actually calculate recall  The paper is not very clear about this sentence is selected Content words are weighed the differently than common words  ‘The’ does not count the same as ‘Parliament’

Spring 2006 MT Seminar Calculating the tf/idf Score tf.idf(i,j) = (1 + log (tf i,j )) log (N / df i ), if tf i,j ≥ 1; where:  tf i,j is the number of occurrences of the word w i in the document d j ;  df i is the number of documents in the corpus where the word w i occurs; N is the total number of documents in the corpus. From Babych 2004

Spring 2006 MT Seminar Calculating the S-Score The S-score was calculated as:  P doc(i,j) is the relative frequency of the word in the text  P corp-doc(i) is the relative frequency of the same word in the rest of the corpus, without this text;  (N – df (i) ) / N is the proportion of texts in the corpus, where this word does not occur  P corp(i) is the relative frequency of the word in the whole corpus, including this particular text.

Spring 2006 MT Seminar Integrating the S-Score If for a lexical item in a text the S ‑ score > 1, all counts for the N-grams containing this item are increased by the S-score (not just by 1, as in the baseline BLEU approach). If the S-score ≤1; the usual N-gram count is applied: the number is increased by 1. From Babych 2004

Spring 2006 MT Seminar The Experiment Used 100 French-English texts from the DARPA-94 evaluation corpus Included two reference translations Results from 4 Different MT systems

Spring 2006 MT Seminar The Experiment Stage 1:  tf/idf & S-scores are calculated on the two reference translations Stage 2:  N-gram based evaluation using Precision, Recall of n- grams in MT output  N-gram matches were adjusted to N-gram weights or S-Score Stage 3:  Comparison with human scores

Spring 2006 MT Seminar Results for tf/idf System [ade] / [flu] BLEU [1&2] Prec. (w) 1/2 Recall (w) 1/2 Fscore (w) 1/2 CANDIDE / GLOBALINK / MS / REVERSO NA / NA SYSTRAN / Corr r(2) with [ade] – MT Corr r(2) with [flu] – MT

Spring 2006 MT Seminar Results for S-Score System [ade] / [flu] BLEU [1&2] Prec. (w) 1/2 Recall (w) 1/2 Fscore (w) 1/2 CANDIDE / GLOBALINK / MS / REVERSO NA / NA SYSTRAN / Corr r(2) with [ade] – MT Corr r(2) with [flu] – MT

Spring 2006 MT Seminar Results The n-gram model beats BLEU in adequacy The f-score metric is more strongly correlated with fluency Single Reference translations are stable (add stability chart?)

Spring 2006 MT Seminar Conclusions The Bleu model can be too coarse to show differentiate between very different MT systems Adequacy is harder to predict than fluency Adding weights and using recall and f- scores can bring higher correlations with adequacy and fluency scores

Spring 2006 MT Seminar References Chris Callison-Burch, Miles Osborne and Philipp Koehn Re-evaluating the Role of Bleu in Machine Translation Research, to appear in EACL-06. Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. July pp Babych B, Hartley A Extending BLEU MT Evaluation Method with Frequency Weighting, In Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain. July Dan Melamed, Ryan Green, and joseph P. Turian. Precision and recall of machine translation. In Proceedings of the Human Language Technology Conference (HLT), pages , Edmonton, Alberta, May HLT-NAACL. Deborah Coughlin Correlating automated andhuman assessments of machine translation quality.In Proceedings of MT Summit IX. LDC Linguistic data annotation specification:Assessment of fluency and adequacy in translations.Revision 1.5

Spring 2006 MT Seminar The Brevity Penalty is designed to compensate for overly terse translations BP = { c = length of corpus of hypothesis translations r = effective corpus length* Precision and Bleu 1 if c > r e 1-r/c if c ≤ r

Spring 2006 MT Seminar Thus, the total Bleu score is this: BLEU = BP * exp( ∑ w n log p n ) Precision and Bleu n n=1

Spring 2006 MT Seminar Flaws in the Use of Bleu Experiments with Bleu, but no manual evaluation (Callison-Burch 2005)