Arthur Chan Prepared for Advanced MT Seminar

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:

Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Machine Translation- 5 Autumn 2008 Lecture Sep 2008.

Machine translation Context-based approach Lucia Otoyo.

Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Copyright © 2010 Pearson Education, Inc. Chapter 13 Experiments and Observational Studies.

Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.

IMSS005 Computer Science Seminar

“ Poetry is what gets lost in translation.” Robert Frost Poet (1874 – 1963) Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as.

Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.

A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

Modern MT Systems and the Myth of Human Translation: Real World Status Quo ● Intro ● MT & HT Definitions ● Comparison MT vs. HT ● Evaluation Methods ●

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Assessment. Levels of Learning Bloom Argue Anderson and Krathwohl (2001)

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,

SUMMARIES The short version. What is it? A summary is a brief restatement of the main ideas of a written text. They are written in your own words and.

Machine Translation Course 10 Diana Trandab ă ț

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

Machine Translation Course 9

Statistical Machine Translation Part II: Word Alignments and EM

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Vorlesung Maschinelle Übersetzung, SS 2010

CS 430: Information Discovery

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Presented by: Anurag Paul

Presentation transcript:

Arthur Chan Prepared for Advanced MT Seminar Overview of BLEU Arthur Chan Prepared for Advanced MT Seminar

This Talk N-gram precision (15 mins) Original BLEU scores (Papineni 2002) Procedures and Motivations (21 pages) N-gram precision (15 mins) Modified N-gram precision (15 mins) Experimental Studies Brevity Penalty (10 mins) Experimental Evidence (10 pages) Only if we have time A summary of the point of view of BLEU’s author Slides could be found at http://www.cs.cmu.edu/~archan/coursework/Original_BLEU_V4.ppt

Bilingual Evaluation Understudy (BLEU)

BLEU – Its Motivation Central Idea: Implication BLEU was proposed “The closer a machine translation is to a professional human translation, the better it is.” Implication A evaluation metric could be evaluated If it correlates with human evaluation, it would be a useful metric BLEU was proposed as an aid as a quick substitute of humans when needed

What is BLEU? A Big Picture Requires multiple good reference translations Depends on modified n-gram precision (or co-occurrence) Co-occurrence: if translated sentence hit n-gram in any reference sentences Computes Per-corpus n-gram co-occurrence n can have several values and a weighted sum is computed Penalizes very brief translation

N-gram Precision: an Example Candidate 1: It is a guide to action which ensures that the military always obey the commands the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Clearly Candidate 1 is better Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party

N-gram Precision To rank Candidate 1 higher than 2 Just count the number of N-gram matches The match could be position-independent Reference could be matched multiple times No need to be linguistically-motivated

BLEU – Example : Unigram Precision Candidate 1: It is a guide to action which ensures that the military always obey the commands of the party. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party. N-gram Precision : 17

Example : Unigram Precision (cont.) Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party. N-gram Precision : 8

Issue of N-gram Precision What if some words are over-generated? e.g. “the” An extreme example Candidate: the the the the the the the. Reference 1: The cat is on the mat. Reference 2: There is a cat on the mat. N-gram Precision: 7 (Something wrong) Intuitively : reference word should be exhausted after it is matched.

Modified N-gram Precision : Procedure Count the max number of times a word occurs in any single reference Clip the total count of each candidate word Modified N-gram Precision equal to Clipped count/Total no. of candidate word Example: Ref 1: The cat is on the mat. Ref 2: There is a cat on the mat. “the” has max count 2 Unigram count = 7 Clipped unigram count = 2 Total no. of counts = 7 Modified-ngram precision: Clipped count = 2 Total no. of counts =7 Modified-ngram precision = 2/7

Different N in Modified N-gram Precision N > 1 is computed in a similar way When 1-gram precision is high, the reference tends to satisfy adequacy When longer n-gram precision is high, the reference tends to account for fluency

Modified N-gram Precision on Blocks of Text A source sentence could be translated as multiple target sentences Procedure in the case of corpus evaluation: Compute the N-gram matches sentence by sentence Add the clipped counts for all candidate sentences Divide the sum by the total number of n-grams in the test corpus

Formula of Corpus-based N-gram Precision Note: Candidate means translated sentences

Experiment 1 of N-gram Precision: Can it differentiate good and bad translation? Source : Chinese, Target: English Human (Blue) vs (Machine) Light Blue Observation: Human scores much better than Machine Conclusion: BLEU is useful for translation with great difference in quality.

Experiment 2 of N-gram Precision: Can it differentiate with very close quality? From BLEU: H2 > H1 > S3 > S2 > S1 Same as human judgment Not shown in paper Conclusion: It is still quite useful when quality is similar

Combining modified n-gram precision The measure becomes more robust Precision has exponential decay => Geometric mean is used => sensitive to higher n-gram 4-gram was shown to be the best among (3,4,5)-gram Arithmetic means was also tried Underweighting of unigram found to be a good match with human.

Issues of Modified N-gram Precision : Sentence Length Candidate 3: of the Modified Unigram Precision : 2/2 Modified Bigram Precision : 1/1 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party.

Issues of Modified N-gram Precision : Trouble with Recalls Good candidate should only use (recall) one possible word choices Example: Candidate 1: I always invariably perpetually do. (Bad Translation) Candidate 2: I always do. (A complete Match) Reference 1: I always do. Reference 2: I invariably do. Reference 3: I perpetually do.

Authors on Recalls “Admittedly, one could align the reference translations to discover synonymous words and compute recall on concepts rather than words.” “Given that translation in length and differ in word order and syntax, such a computation is complicated.”

Solution: Brevity Penalty When a translation matches a reference BP = 1 When a translation is shorter than the reference BP < 1

Brevity Penalty Computation IBM’s BP –corpus-based best match lengths The closest reference sentence length E.g. If references have 12, 15, 17 words and candidate has 12 Exponential decay in r/c if c < r r is the sum of the best match lengths of the candidate sentence in the test corpus c is the total length of the candidate translation corpus (?) (?) is c the candidate sentence? (?) BP shouldn’t be computed by averaging sentence penalties in sentence-by-sentence basis => That will punish length deviation of short sentence very harshly.

Original Paper on the value c Pretty confusing “c is the total length of the candidate translation corpus.” in Section 2.2.2 “let c be the length of the candidate translation ……” in Section 2.3

Formulae of BLEU Computation

NIST version r: The average no. of words in a reference translation, average over all reference translations c: The number of words in translation being scored (Skipped here) NIST version also has different definitions of BP.

Experimental Evidence Detail: Please read the reserved slides Summary of Experimental Evidence from the original paper Ranking provided by BLEU is the same as ranking provided by Human The result is statistically significant with pairwise t-statistics Using BLEU, only one single reference is necessary BLEU shows that machine and human translation still have a big gap BLEU has been used in multiple languages and shown to be useful

Human vs. BLEU - Conclusion Human and Machine Translation has large difference in BLEU In footnote: “significant challenge for the current state-of-the-art systems” Bilingual group was very forgiving to fluency problem in the translation

Conclusion Presented the scheme and Motivation of original IBM BLEU. The scheme is motivated Shown to be correlated with human judgment Also shown to be useful in {Arabic,Chinese,French,Spanish} to English The author believes Averaging sentence judgments is better than approximate human judgment for every sentences “quantity leads to quality” Ideas could be used in summarization and NLG task

References Kishore Panineni, Salim Roukos, Todd Ward and Wei Jing Zhu, BLEU, a Method for Automatic Evaluation of Machine Translation. In ACL-02. 2002 George Doddington, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. Etiene Denoual, Yves Lepage, BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters. Alon Lavie, Kenji Sagae, Shyamsundar Jayaraman, The Significance of Recall in Automatic Metrics for MT Evaluation. Christopher Culy, Susanne Z. Riechemann, The Limits of N-Gram Translation Evaluation Metrics. Santanjeev Banerjee, Alon Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. About T-test: http://mathworld.wolfram.com/Pairedt-Test.html About T-distribution: http://mathworld.wolfram.com/Studentst-Distribution.html

Reserved: Experimental Evidence of BLEU Arthur Chan

Experimental Evidence of BLEU 500 sentences (40 general news stories) 4 references for each sentence

Means/Variance/t-statistics of BLEU Sentences are divided into 20 Blocks, each have 25 sentences

Experimental Evidence of BLEU (cont.) The difference of BLEU score is significant As shown by pair t-statistics pair t-statistics (? pairwise t-test) > 1.7 is significant

No. of reference required The system maintains the same rank order when Randomly choose 1 out of 4 sentences. => Using BLEU, as long as using big corpus and translations are from different translators single reference could be used

Human Evaluation Two groups of judges “Monolingual group” Native Speakers of English “Bilingual groups” Native Speakers of Chinese who lived in U. S. for several years. Each rate the sentence with opinion score from 1 (very bad) to 5 (very good)

Monolingual Group

Bilingual Group

Some observations in Human Evaluation Human evaluation shows the same ranking as BLEU does Bilingual group seems to focus on adequacy more than fluency

Human vs. BLEU BLEU shows high correlation with both monolingual (0.99) and bilingual group (0.96)

Human vs. BLEU (cont.)