Machine Translation- 5 Autumn 2008 Lecture 20 11 Sep 2008.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Automatic methods of MT evaluation Practical 18/04/2005 MODL5003 Principles and applications of machine translation slides available at:
Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Arthur Chan Prepared for Advanced MT Seminar
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
Machine Translation- 4 Autumn 2008 Lecture Sep 2008.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
MT Evaluation CA446 Week 9 Andy Way School of Computing Dublin City University, Dublin 9, Ireland
MODL5003 Principles and applications of MT
Evaluating Search Engine
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
CSE 830: Design and Theory of Algorithms
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:
CSCI 5582 Artificial Intelligence
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Assessing cognitive models What is the aim of cognitive modelling? To try and reproduce, using equations or similar, the mechanism that people are using.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 26 Jim Martin.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
Gaussian process modelling
Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 16 5 September 2007.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
SMT – Final thoughts Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some.
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Machine Translation- 2 Autumn 2008 Lecture 17 4 Sep 2008.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Estimating N-gram Probabilities Language Modeling.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Machine Translation Course 10 Diana Trandab ă ț
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Computing & Information Sciences Kansas State University Friday, 05 Dec 2008CIS 530 / 730: Artificial Intelligence Lecture 39 of 42 Friday, 05 December.
Machine Translation Course 9
Statistical Machine Translation Part II: Word Alignments and EM
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky
SMT – Final thoughts David Kauchak CS159 – Spring 2019
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
CS249: Neural Language Model
Presentation transcript:

Machine Translation- 5 Autumn 2008 Lecture Sep 2008

Decoding  Decoding… Given a trained model and a foreign sentence produce  Argmax P(e|f)  Can’t use Viterbi it’s too restrictive  Need a reasonable efficient search technique that explores the sequence space based on how good the options look…  A*

A*  Recall for A* we need Goal State Operators Heuristic

A*  Recall for A* we need Goal StateGood coverage of source OperatorsTranslation of phrases/words distortions deletions/insertions HeuristicProbabilities (tweaked)

A* Decoding  Why not just use the probability as we go along? Turns it into Uniform-cost not A* That favors shorter sequences over longer ones. Need to counter-balance the probability of the translation so far with its “progress towards the goal”.

A*/Beam  Sorry… Even that doesn’t work because the space is too large So as we go we’ll prune the space as paths fall below some threshold

A* Decoding

How do we evaluate MT? Human tests for fluency  Rating tests: Give the raters a scale (1 to 5) and ask them to rate Or distinct scales for  Clarity, Naturalness, Style Or check for specific problems  Cohesion (Lexical chains, anaphora, ellipsis)  Hand-checking for cohesion.  Well-formedness  5-point scale of syntactic correctness  Comprehensibility tests Noise test Multiple choice questionnaire  Readability tests cloze

How do we evaluate MT? Human tests for fidelity  Adequacy Does it convey the information in the original? Ask raters to rate on a scale  Bilingual raters: give them source and target sentence, ask how much information is preserved  Monolingual raters: give them target + a good human translation  Informativeness Task based: is there enough info to do some task? Give raters multiple-choice questions about content

Evaluating MT: Problems  Asking humans to judge sentences on a 5-point scale for 10 factors takes time and money(weeks or months!)  We can’t build language engineering systems if we can only evaluate them once every quarter!!!!  We need a metric that we can run every time we change our algorithm.  It would be OK if it wasn’t perfect, but just tended to correlate with the expensive human metrics, which we could still run in quarterly. Bonnie Dorr

Automatic evaluation  Miller and Beebe-Center (1958)  Assume we have one or more human translations of the source passage  Compare the automatic translation to these human translations Bleu NIST Meteor Precision/Recall

Reference proximity methods  Assumption of Reference Proximity (ARP): “…the closer the machine translation is to a professional human translation, the better it is” (Papineni et al., 2002: 311)  Finding a distance between 2 texts Minimal edit distance N-gram distance …

Minimal edit distance  Minimal number of editing operations to transform text1 into text2 deletions (sequence xy changed to x) insertions (x changed to xy) substitutions (x changed by y) transpositions (sequence xy changed to yx)  Algorithm by Wagner and Fischer (1974).  Edit distance implementation: RED method Akiba Y., K Imamura and E. Sumita. 2001

Problem with edit distance: Legitimate translation variation  ORI: De son côté, le département d'Etat américain, dans un communiqué, a déclaré: ‘Nous ne comprenons pas la décision’ de Paris.  HT-Expert: For its part, the American Department of State said in a communique that ‘We do not understand the decision’ made by Paris.  HT-Reference: For its part, the American State Department stated in a press release: We do not understand the decision of Paris.  MT-Systran: On its side, the American State Department, in an official statement, declared: ‘We do not include/understand the decision’ of Paris.

Legitimate translation variation  to which human translation should we compute the edit distance?  is it possible to integrate both human translations into a reference set?

N-gram distance  the number of common words (evaluating lexical choices);  the number of common sequences of 2, 3, 4 … N words (evaluating word order): 2-word sequences (bi-grams) 3-word sequences (tri-grams) 4-word sequences (four-grams) … N-word sequences (N-grams)  N-grams allow us to compute several parameters…

Matches of N-grams HT MT True positives False positives False negatives

Matches of N-grams (contd.) MT +MT – Human text +true positivesfalse negatives→ recall (avoiding false negatives) Human text –false positives ↓ precision (avoiding false positives)

Precision and Recall  Precision = how accurate is the answer? “Don’t guess, wrong answers are deducted!”  Recall = how complete is the answer? “Guess if not sure!”, don’t miss anything!

Translation variation and N-grams  N-gram distance to multiple human reference translations  Precision on the union of N-gram sets in HT1, HT2, HT3…  N-grams in all independent human translations taken together with repetitions removed  Recall on the intersection of N-gram sets  N-grams common to all sets – only repeated N-grams! (most stable across different human translations)

Human and automated scores  Empirical observations: Precision on the union gives indication of Fluency Recall on intersection gives indication of Adequacy  Automated Adequacy evaluation is less accurate – harder  Now most successful N-gram proximity -- BLEU evaluation measure (Papineni et al., 2002)  BiLingual Evaluation Understudy

BiLingual Evaluation Understudy (BLEU —Papineni, 2001)  Automatic Technique, but ….  Requires the pre-existence of Human (Reference) Translations  Approach: Produce corpus of high-quality human translations Judge “closeness” numerically (word-error rate) Compare n-gram matches between candidate translation and 1 or more reference translations Slide from Bonnie Dorr

BLEU evaluation measure  computes Precision on the union of N-grams  accurately predicts Fluency  produces scores in the range of [0,1]

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. BLEU Evaluation Metric (Papineni et al, ACL-2002) N-gram precision (score is between 0 & 1) –What percentage of machine n-grams can be found in the reference translation? –An n-gram is an sequence of n words –Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”) Brevity penalty –Can’t just type out single word “the” (precision 1.0!) *** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t) Slide from Bonnie Dorr

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. BLEU Evaluation Metric (Papineni et al, ACL-2002) BLEU4 formula (counts n-grams up to length 4) exp (1.0 * log p * log p * log p * log p4 – max(words-in-reference / words-in-machine – 1, 0) p1 = 1-gram precision P2 = 2-gram precision P3 = 3-gram precision P4 = 4-gram precision Slide from Bonnie Dorr

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Slide from Bonnie Dorr

BLEU in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 Slide from Bonnie Dorr

BLEU in Action 枪手被警方击毙。 (Foreign Original) the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 green = 4-gram match (good!) red = word not matched (bad!) Slide from Bonnie Dorr

Bleu Comparison Chinese-English Translation Example: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Slide from Bonnie Dorr

How Do We Compute Bleu Scores?  Intuition: “What percentage of words in candidate occurred in some human translation?”  Proposal: count up # of candidate translation words (unigrams) # in any reference translation, divide by the total # of words in # candidate translation  But can’t just count total # of overlapping N-grams! Candidate: the the the the the the Reference 1: The cat is on the mat  Solution: A reference word should be considered exhausted after a matching candidate word is identified. Slide from Bonnie Dorr

“Modified n-gram precision”  For each word compute: (1) total number of times it occurs in any single reference translation (2) number of times it occurs in the candidate translation  Instead of using count #2, use the minimum of #2 and #2, I.e. clip the counts at the max for the reference transcription  Now use that modified count.  And divide by number of candidate words. Slide from Bonnie Dorr

Modified Unigram Precision: Candidate #1 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1) What’s the answer???17/18 Slide from Bonnie Dorr

Modified Unigram Precision: Candidate #2 It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0) What’s the answer????8/14 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Slide from Bonnie Dorr

Modified Bigram Precision: Candidate #1 It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1) What’s the answer???? 10/17 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Slide from Bonnie Dorr

Modified Bigram Precision: Candidate #2 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0) What’s the answer????1/13 Slide from Bonnie Dorr

Catching Cheaters Reference 1: The cat is on the mat Reference 2: There is a cat on the mat the(2) the the the(0) the(0) the(0) the(0) What’s the unigram answer?2/7 What’s the bigram answer?0/7 Slide from Bonnie Dorr

Bleu distinguishes human from machine translations Slide from Bonnie Dorr

BLEU Tends to Predict Human Judgments slide from G. Doddington (NIST) (variant of BLEU)

Bleu problems with sentence length  Candidate: of the  Solution: brevity penalty; prefers candidates translations which are same length as one of the references Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. Problem: modified unigram precision is 2/2, bigram 1/1! Slide from Bonnie Dorr

NIST  based on the BLEU metricBLEU  Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given.BLEUn-gram  For example, if the bigram "on the" is correctly matched, it will receive lower weight than the correct matching of bigram "interesting calculations", as this is less likely to occur.  Different way of calculating brevity penalty

Meteor (Metric for Evaluation of Translation with Explicit ORdering)  based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.harmonic meanprecision recall  Uses stemming and synonymy matching, along with the standard exact word matching.stemmingsynonymy

Recent developments: N-gram distance  paraphrasing instead of multiple RT  more weight to more “important” words relatively more frequent in a given text  relations between different human scores  accounting for dynamic quality criteria