METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev Banerjee, Kenji Sagae, Shyamsundar Jayaraman Language Technologies Institute Carnegie Mellon University

Outline Similarity-based metrics for MT evaluation
Weaknesses in Precision-based MT metrics (BLEU, NIST) Simple unigram-based MT evaluation metrics METEOR Evaluation Methodology Experimental Evaluation Recent Work Future Directions January 18, 2005 MT Group Lunch

Automatic Metrics for MT Evaluation
Idea: compare output of an MT system to a “reference” good (usually human) translation: how close is the MT output to the reference translation? Advantages: Fast and cheap, minimal human labor, no need for bilingual speakers Can be used on an on-going basis during system development to test changes Disadvantages: Current metrics are very crude, do not distinguish well between subtle differences in systems Individual sentence scores are not very reliable, aggregate scores on a large test set are required Automatic metrics for MT evaluation very active area of current research January 18, 2005 MT Group Lunch

Automatic Metrics for MT Evaluation
Example: Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” Possible metric components: Precision: correct words / total words in MT output Recall: correct words / total words in reference Combination of P and R (i.e. F1= 2PR/(P+R)) Levenshtein edit distance: number of insertions, deletions, substitutions required to transform MT output to the reference Important Issues: Perfect word matches are too harsh: synonyms, inflections: “Iraq’s” vs. “Iraqi”, “give” vs. “handed over” January 18, 2005 MT Group Lunch

Similarity-based MT Evaluation Metrics
Assess the “quality” of an MT system by comparing its output with human produced “reference” translations Premise: the more similar (in meaning) the translation is to the reference, the better Goal: an algorithm that is capable of accurately approximating the similarity Wide Range of metrics, mostly focusing on word-level correspondences: Edit-distance metrics: Levenshtein, WER, PIWER, … Ngram-based metrics: Precision, Recall, F1-measure, BLUE, NIST, GTM… Main Issue: perfect word matching is very crude estimate for sentence-level similarity in meaning January 18, 2005 MT Group Lunch

Desirable Automatic Metric
High-levels of correlation with quantified human notions of translation quality Sensitive to small differences in MT quality between systems and versions of systems Consistent – same MT system on similar texts should produce similar scores Reliable – MT systems that score similarly will perform similarly General – applicable to a wide range of domains and scenarios Fast and lightweight – easy to run January 18, 2005 MT Group Lunch

The BLEU Metric Proposed by IBM [Papineni et al, 2002] Main ideas:
Exact matches of words Match against a set of reference translations for greater variety of expressions Account for Adequacy by looking at word precision Account for Fluency by calculating n-gram precisions for n=1,2,3,4 No recall (because difficult with multiple refs) To compensate for recall: introduce “Brevity Penalty” Final score is weighted geometric average of the n-gram scores Calculate aggregate score over a large test set January 18, 2005 MT Group Lunch

The BLEU Metric Example: BLUE metric:
Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” BLUE metric: 1-gram precision: 4/8 2-gram precision: 1/7 3-gram precision: 0/6 4-gram precision: 0/5 BLEU score = 0 (weighted geometric average) January 18, 2005 MT Group Lunch

The BLEU Metric Clipping precision counts:
Reference1: “the Iraqi weapons are to be handed over to the army within two weeks” Reference2: “the Iraqi weapons will be surrendered to the army in two weeks” MT output: “the the the the” Precision count for “the” should be clipped at two: max count of the word in any reference Modified unigram score will be 2/4 (not 4/4) January 18, 2005 MT Group Lunch

The BLEU Metric Brevity Penalty:
Reference1: “the Iraqi weapons are to be handed over to the army within two weeks” Reference2: “the Iraqi weapons will be surrendered to the army in two weeks” MT output: “the the” Precision score: unigram 2/2, bigram 1/1, BLEU = 1.0 MT output is much too short, thus boosting precision, and BLEU doesn’t have recall… An exponential Brevity Penalty reduces score, calculated based on the aggregate length (not individual sentences) January 18, 2005 MT Group Lunch

Weaknesses in BLEU (and NIST)
BLUE matches word ngrams of MT-translation with multiple reference translations simultaneously  Precision-based metric Is this better than matching with each reference translation separately and selecting the best match? BLEU Compensates for Recall by factoring in a “Brevity Penalty” (BP) Is the BP adequate in compensating for lack of Recall? BLEU’s ngram matching requires exact word matches Can stemming and synonyms improve the similarity measure and improve correlation with human scores? All matched words weigh equally in BLEU Can a scheme for weighing word contributions improve correlation with human scores? BLEU’s Higher order ngrams account for fluency and grammaticality, ngrams are geometrically averaged Geometric ngram averaging is volatile to “zero” scores. Can we account for fluency/grammaticality via other means? January 18, 2005 MT Group Lunch

Roadmap to a Desirable Metric
Establishing a metric with much improved correlations with human judgment score at the sentence-level will go a long way towards our overall goals Our Approach: Explicitly align the words in the MT translation with their corresponding matches in the reference translation, allowing for: Exact matches, stemmed word matches, synonym and semantically-related word matches Combine unigram Precision and Recall to account for the similarity in “content” (translation adequacy) Weigh the contribution of matched words based on a measure related to their importance Estimate translation fluency/grammaticality based on explicit measure related to word-order, fragmentation and/or average length of matched ngrams January 18, 2005 MT Group Lunch

Unigram-based Metrics
Unigram Precision: fraction of words in the MT that appear in the reference Unigram Recall: fraction of the words in the reference translation that appear in the MT F1= P*R/0.5*(P+R) Fmean = P*R/(0.9*P+0.1*R) With and without word stemming Match with each reference separately and select the best match for each sentence January 18, 2005 MT Group Lunch

The METEOR Metric New metric under development at CMU Main new ideas:
Reintroduce Recall and combine it with Precision as score components Look only at unigram Precision and Recall Align MT output with each reference individually and take score of best pairing Matching takes into account word inflection variations (via stemming) Address fluency via a direct penalty: how fragmented is the matching of the MT output with the reference? January 18, 2005 MT Group Lunch

The METEOR Metric Matcher explicitly aligns matched words between MT and reference Multiple stages: exact matches, stemmed matches, (synonym matches) Matcher returns fragment count – used to calculate average fragmentation (frag) METEOR score calculated as a discounted Fmean score Discounting factor: DF = 0.5 * (frag**3) Final score: Fmean * (1- DF) January 18, 2005 MT Group Lunch

METEOR Metric Effect of Discounting Factor: January 18, 2005
MT Group Lunch

The METEOR Metric Example: P = 5/8 =0.625 R = 5/14 = 0.357
Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army P = 5/8 = R = 5/14 = 0.357 Fmean = 10*P*R/(9P+R) = Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50 Discounting factor: DF = 0.5 * (frag**3) = Final score: Fmean * (1- DF) = * = January 18, 2005 MT Group Lunch

BLEU vs METEOR How do we know if a metric is better?
Better correlation with human judgments of MT output Reduced score variability on MT outputs that are ranked equivalent by humans Higher and less variable scores on scoring human translations against the reference translations January 18, 2005 MT Group Lunch

Evaluation Methodology
Correlation of metric scores with human scores at the system level Human scores are adequacy+fluency [2-10] Pearson correlation coefficients Confidence ranges for the correlation coefficients Correlation of score differentials between all pairs of systems [Coughlin 2003] Assumes a linear relationship between the score differentials January 18, 2005 MT Group Lunch

Evaluation Setup Data: DARPA/TIDES 2002 and 2003 Chinese-to-English MT evaluation data 2002 data: ~900 sentences, 4 reference translations 7 systems 2003 data: 6 systems Metrics Compared: BLEU, NIST, P, R, F1, Fmean, GTM, B&H, METEOR January 18, 2005 MT Group Lunch

Evaluation Results: 2002 System-level Correlations
Metric Pearson Coefficient Confidence Interval BLEU 0.461 ±0.058 BLEU-stemmed 0.528 ±0.061 NIST 0.603 ±0.049 NIST-stemmed 0.740 ±0.043 Precision 0.175 ±0.052 Precision-stemmed 0.257 ±0.065 Recall 0.615 ±0.042 Recall-stemmed 0.757 F1 0.425 ±0.047 F1-stemmed 0.564 Fmean 0.585 Fmean-stemmed 0.733 ±0.044 GTM 0.77 GTM-stemmed 0.68 METEOR 0.7191 January 18, 2005 MT Group Lunch

Evaluation Results: 2003 System-level Correlations
Metric Pearson Coefficient Confidence Interval BLEU 0.817 ±0.021 BLEU-stemmed 0.843 ±0.018 NIST 0.892 ±0.013 NIST-stemmed 0.915 ±0.010 Precision 0.683 ±0.041 Precision-stemmed 0.752 Recall 0.961 ±0.011 Recall-stemmed 0.940 ±0.014 F1 0.909 ±0.025 F1-stemmed 0.948 Fmean 0.959 ±0.012 Fmean-stemmed 0.952 GTM 0.79 GTM-stemmed 0.89 METEOR 0.964 January 18, 2005 MT Group Lunch

Evaluation Results: 2002 Pairwise Correlations
Metric Pearson Coefficient Confidence Interval BLEU 0.498 ±0.054 BLEU-stemmed 0.559 ±0.058 NIST 0.679 ±0.042 NIST-stemmed 0.774 ±0.041 Precision 0.298 ±0.051 Precision-stemmed 0.325 ±0.064 Recall 0.743 ±0.032 Recall-stemmed 0.845 ±0.029 F1 0.549 F1-stemmed 0.643 ±0.046 Fmean 0.711 ±0.033 Fmean-stemmed 0.818 GTM 0.71 GTM-stemmed 0.69 METEOR 0.8115 January 18, 2005 MT Group Lunch

Evaluation Results: 2003 Pairwise Correlations
Metric Pearson Coefficient Confidence Interval BLEU 0.758 ±0.027 BLEU-stemmed 0.793 ±0.025 NIST 0.886 ±0.017 NIST-stemmed 0.924 ±0.013 Precision 0.573 ±0.053 Precision-stemmed 0.666 ±0.058 Recall 0.954 ±0.014 Recall-stemmed 0.923 ±0.018 F1 0.881 ±0.024 F1-stemmed 0.950 Fmean ±0.015 Fmean-stemmed 0.940 GTM 0.78 GTM-stemmed 0.86 METEOR 0.965 January 18, 2005 MT Group Lunch

METEOR vs. BLEU Sentence-level Scores (CMU SMT System, TIDES 2003 Data)
January 18, 2005 MT Group Lunch

METEOR vs. BLEU Histogram of Scores of Reference Translations 2003 Data
Mean= STD=0.2138 Mean= STD=0.1310 BLEU METEOR January 18, 2005 MT Group Lunch

Further Issues Words are not created equal – some are more important for effective translation More effective matching with synonyms and inflected forms: Stemming Use a synonym knowledge-base (WordNet) How to incorporate such information within the metric? Train weights for word matches Different weights for “content” and “function” words January 18, 2005 MT Group Lunch

Some Recent Work [Bano]
Further experiments with advanced features: With and without stemming With and without WordNet Synonyms With and without small “Stop-list” January 18, 2005 MT Group Lunch

New Evaluation Results: 2002 System-level Correlations
Metric Pearson Coefficient METEOR 0.6016 METEOR+Pstem 0.7177 METEOR+WNstem 0.7275 METEOR+Pst+syn 0.8070 METEOR+WNst+syn 0.7966 METEOR+stop 0.7597 METEOR+Pst+stop 0.8638 METEOR+WNst+stop 0.8783 METEOR+Pst+syn+stop 0.9138 METEOR+WNst+syn+stop 0.9126 F1+Pstem+syn+stop 0.8872 Fmean+WNst+syn+stop 0.8843 Fmean+Pst+syn+stop 0.8799 F1+WNstem+syn+stop 0.8787 January 18, 2005 MT Group Lunch

New Evaluation Results: 2003 System-level Correlations
Metric Pearson Coefficient METEOR 0.9175 METEOR+Pstem 0.9621 METEOR+WNstem 0.9705 METEOR+Pst+syn 0.9852 METEOR+WNst+syn 0.9821 METEOR+stop 0.9286 METEOR+Pst+stop 0.9682 METEOR+WNst+stop 0.9764 METEOR+Pst+syn+stop METEOR+WNst+syn+stop 0.9802 Fmean+WNstem 0.9924 Fmean+WNstem+stop 0.9914 Fmean+Pstem 0.9905 Fmean+WNstem+syn January 18, 2005 MT Group Lunch

Current and Future Directions
Word weighing schemes Word similarity beyond synonyms Optimizing the fragmentation-based discount factor Alternative metrics for capturing fluency and grammaticality January 18, 2005 MT Group Lunch

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Similar presentations

Presentation on theme: "METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Similar presentations

Presentation on theme: "METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev."— Presentation transcript:

Similar presentations

About project

Feedback