Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Arthur Chan Prepared for Advanced MT Seminar
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
Testing Theories: Three Reasons Why Data Might not Match the Theory.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
COFFEE: an objective function for multiple sequence alignments
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
Mutual Information Mathematical Biology Seminar
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Chapter 2: Algorithm Discovery and Design
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
Kendall & KendallCopyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall 9 Kendall & Kendall Systems Analysis and Design, 9e Process Specifications.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Chapter 2: Algorithm Discovery and Design
Chapter 2: Algorithm Discovery and Design
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
Fundamentals of Python: From First Programs Through Data Structures
 Main Idea/Point-of-View  Specific Detail  Conclusion/Inference  Extrapolation  Vocabulary in Context.
Fundamentals of Python: First Programs
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Chapter 9 Superposition and Dynamic Programming 1 Chapter 9 Superposition and dynamic programming Most methods for comparing structures use some sorts.
Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.
Testing Theories: Three Reasons Why Data Might not Match the Theory Psych 437.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
Invitation to Computer Science, Java Version, Second Edition.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
Brain Wave Analysis in Optimal Color Allocation for Children’s Electronic Book Design Wu, Chih-Hung Liu, Chang Ju Tzeng, Yi-Lin.
Kyoshiro SUGIYAMA, AHC-Lab., NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
1 Lecture 19: Hypothesis Tests Devore, Ch Topics I.Statistical Hypotheses (pl!) –Null and Alternative Hypotheses –Testing statistics and rejection.
How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Quality Software Project Management Software Size and Reuse Estimating.
1 Alignment Entropy as an Automated Predictor of Bitext Fidelity for Statistical Machine Translation Shankar Ananthakrishnan Rohit Prasad Prem Natarajan.
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 4 Looping.
On the Relation between SAT and BDDs for Equivalence Checking Sherief Reda Rolf Drechsler Alex Orailoglu Computer Science & Engineering Dept. University.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Chapter 9 Audit Sampling – Part a.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Evaluating the Output of Machine Translation Systems Alon Lavie Associate Research Professor, Carnegie Mellon University President, Safaba Translation.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science.
Machine Translation Course 10 Diana Trandab ă ț
Machine Translation Course 9
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Vorlesung Maschinelle Übersetzung, SS 2010
Unit 5: Hypothesis Testing
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Translation Error Rate Metric
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Significance Tests: The Basics
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
Presentation transcript:

Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human Annotation

2 Outline  Motivations  Definition of Translation Edit Rate (TER)  Human-Targeted TER (HTER)  Comparisons with BLEU and METEOR  Correlations with Human Judgments

3 Motivations  Subjective human judgments of performance have been the gold standard of MT evaluation metrics  However… –Human Judgments are coarse grained –Meaning and fluency judgments tend to be conflated –Poor interannotator agreement at the segment level  We want a more objective and repeatable human measure of fluency and meaning –We want a measure of the amount of work needed to fix a translation to make it both fluent and correct –Count the number of edits for a human to fix the translation

4 What is (H)TER?  Translation Edit Rate (TER): Number of edits needed to change a system output so that it exactly matches a given reference –MT research has become increasingly phrased-based, and we want a notion of edits that captures that –Allow movement of phrases using shifts  Human-targeted TER (HTER): Minimal number of edits needed to change a system output so that it is fluent and has correct meaning –Infinite number of references could be used to find the one-best reference to count minimum number of edits –We have normally have 4 references at most though –Generate new targeted reference that is very close to system output –Measure TER between targeted reference and system output

5 Formula of Translation Edit Rate (TER)  With more than one reference: –TER = / –TER is calculated against best (closest) reference  Edits include insertions, deletions, substitutions and shifts –All edits count as 1 edit –Shift moves a sequence of words within the hypothesis –Shift of any sequence of words (any distance) is only 1 edit  Capitalization and punctuation errors are included

6 Why Use Shifts? REF: saudi arabia denied this week information published in the american new york times HYP: this week the saudis denied information published in the new york times  WER too harsh when output is distorted from reference  With WER, no credit is given to the system when it generates the right string in the wrong place  TER shifts reflect the editing action of moving the string from one location to another

7 Why Use Shifts? REF: saudi arabia denied this week information published in the american new york times HYP: this week the saudis denied information published in the new york times  WER too harsh when output is distorted from reference  With WER, no credit is given to the system when it generates the right string in the wrong place  TER shifts reflect the editing action of moving the string from one location to another

8 Why Use Shifts? REF: **** **** SAUDI ARABIA denied THIS WEEK information published in the AMERICAN new york times HYP: THIS WEEK THE SAUDIS denied **** **** information published in the ******** new york times  WER too harsh when output is distorted from reference  With WER, no credit is given to the system when it generates the right string in the wrong place  TER shifts reflect the editing action of moving the string from one location to another

9 Why Use Shifts? REF: **** **** SAUDI ARABIA denied THIS WEEK information published in the AMERICAN new york times HYP: THIS WEEK THE SAUDIS denied **** **** information published in the ******** new york times  WER too harsh when output is distorted from reference  With WER, no credit is given to the system when it generates the right string in the wrong place  TER shifts reflect the editing action of moving the string from one location to another

10 Example REF: saudi arabia denied this week information published in the american new york times HYP: this week the saudis denied information published in the new york times

11 Example REF: saudi arabia denied this week information published in the american new york times the saudis denied [this week] information published in the new york times Edits:  Shift “ this week ” to after “ denied ”

12 Example REF: SAUDI ARABIA denied this week information published in the american new york times THE SAUDIS denied [this week] information published in the new york times Edits:  Shift “ this week ” to after “ denied ”  Substitute “ Saudi Arabia ” for “ the Saudis ”

13 Example REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times THE SAUDIS denied [this week] information published in the ******** new york times Edits:  Shift “ this week ” to after “ denied ”  Substitute “ Saudi Arabia ” for “ the Saudis ”  Insert “ American ”

14 Example REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times THE SAUDIS denied [this week] information published in the ******** new york times Edits:  Shift “ this week ” to after “ denied ”  Substitute “ Saudi Arabia ” for “ the Saudis ”  Insert “ American ”  1 Shift, 2 Substitutions, 1 Insertion –4 Edits (TER = 4/13 = 31%)

15 Calculation of Number of Edits  Optimal sequence of edits (with shifts) is very expensive to find  Use a greedy search to select the set of shifts –At each step, calculate min-edit (Levenshtein) distance (number of insertions, deletions, substitutions) using dynamic programming –Choose shift that most reduces min-edit distance –Repeat until no shift remains that reduces min-edit distance  After all shifting is complete, the number of edits is the number of shifts plus the remaining edit distance

16 Shift Constraints REF: DOWNER SAID " IN THE END, ANY bad AGREEMENT will NOT be an agreement we CAN SIGN. " HYP: HE OUT " EVENTUALLY, ANY WAS *** bad, will *** be an agreement we WILL SIGNED. ”  Shifted words must match the reference words in the destination position exactly  The word sequence of the hypothesis in the original position and the corresponding reference words must not match  The word sequence of the reference that corresponds to the destination position must be misaligned before the shift

17 REF: DOWNER SAID " IN THE END, ANY bad AGREEMENT will NOT be an agreement we CAN SIGN. " HYP: HE OUT " EVENTUALLY, ANY WAS *** bad, will *** be an agreement we WILL SIGNED. ”  Shifted words must match the reference words in the destination position exactly  The word sequence of the hypothesis in the original position and the corresponding reference words must not match  The word sequence of the reference that corresponds to the destination position must be misaligned before the shift Shift Constraints

18 REF: DOWNER SAID " IN THE END, ANY bad AGREEMENT will NOT be an agreement we CAN SIGN. " HYP: HE OUT " EVENTUALLY, ANY WAS *** bad, will *** be an agreement we WILL SIGNED. ”  Shifted words must match the reference words in the destination position exactly  The word sequence of the hypothesis in the original position and the corresponding reference words must not match  The word sequence of the reference that corresponds to the destination position must be misaligned before the shift Shift Constraints

19 Shift Constraints REF: DOWNER SAID " IN THE END, ANY bad AGREEMENT will NOT be an agreement we CAN SIGN. " HYP: HE OUT " EVENTUALLY, ANY WAS *** bad, will *** be an agreement we WILL SIGNED. ”  Shifted words must match the reference words in the destination position exactly  The word sequence of the hypothesis in the original position and the corresponding reference words must not match  The word sequence of the reference that corresponds to the destination position must be misaligned before the shift

20 Shift Constraints REF: DOWNER SAID " IN THE END, ANY bad AGREEMENT will NOT be an agreement we CAN SIGN. " HYP: HE OUT " EVENTUALLY, ANY WAS *** bad, will *** be an agreement we WILL SIGNED. ”  Shifted words must match the reference words in the destination position exactly  The word sequence of the hypothesis in the original position and the corresponding reference words must not match  The word sequence of the reference that corresponds to the destination position must be misaligned before the shift

21 HTER: Human-targeted TER  Procedure to create targeted references –Start with an automatic system output (hypothesis) and one or more human references. –Fluent speaker of English creates a new reference translation targeted for this system output by editing the hypothesis until it is fluent and has the same meaning as the reference(s) –Targeted references not required to be elegant English  Compute minimum TER including new reference

22 Post-Editing Tool  Post-Editing tool displays all references and hypothesis  Tool shows where hypothesis differs from best reference  Tool shows current TER for ‘reference in progress’  Requires average 3-7 minutes per sentence to annotate –Time was relatively consistent over 4 annotators –Time could be reduced by a better post-editing tool  Example: Ref1: The expert, who asked not to be identified, added, "This depends on the conditions of the bodies." Ref2: The experts who asked to remain unnamed said, "the matter is related to the state of the bodies." Hyp: The expert who requested anonymity said that "the situation of the matter is linked to the dead bodies". Targ: The expert who requested anonymity said that "the matter is linked to the condition of the dead bodies".

23 Post-Editing Tool  Post-Editing tool displays all references and hypothesis  Tool shows where hypothesis differs from best reference  Tool shows current TER for ‘reference in progress’  Requires average 3-7 minutes per sentence to annotate –Time was relatively consistent over 4 annotators –Time could be reduced by a better post-editing tool  Example: Ref1: The expert, who asked not to be identified, added, "This depends on the conditions of the bodies." Ref2: The experts who asked to remain unnamed said, "the matter is related to the state of the bodies." Hyp: The expert who requested anonymity said that "the situation of the matter is linked to the dead bodies". Targ: The expert who requested anonymity said that "the matter is linked to the condition of the dead bodies".

24 Post-Editing Tool  Post-Editing tool displays all references and hypothesis  Tool shows where hypothesis differs from best reference  Tool shows current TER for ‘reference in progress’  Requires average 3-7 minutes per sentence to annotate –Time was relatively consistent over 4 annotators –Time could be reduced by a better post-editing tool  Example: Ref1: The expert, who asked not to be identified, added, "This depends on the conditions of the bodies." Ref2: The experts who asked to remain unnamed said, "the matter is related to the state of the bodies." Hyp: The expert who requested anonymity said that "the situation of the matter is linked to the dead bodies". Targ: The expert who requested anonymity said that "the matter is linked to the condition of the dead bodies".

25 Post-Editing Tool  Post-Editing tool displays all references and hypothesis  Tool shows where hypothesis differs from best reference  Tool shows current TER for ‘reference in progress’  Requires average 3-7 minutes per sentence to annotate –Time was relatively consistent over 4 annotators –Time could be reduced by a better post-editing tool  Example: Ref1: The expert, who asked not to be identified, added, "This depends on the conditions of the bodies." Ref2: The experts who asked to remain unnamed said, "the matter is related to the state of the bodies." Hyp: The expert who requested anonymity said that "the situation of the matter is linked to the dead bodies". Targ: The expert who requested anonymity said that "the matter is linked to the condition of the dead bodies".

26 Post-Editing Tool  Post-Editing tool displays all references and hypothesis  Tool shows where hypothesis differs from best reference  Tool shows current TER for ‘reference in progress’  Requires average 3-7 minutes per sentence to annotate –Time was relatively consistent over 4 annotators –Time could be reduced by a better post-editing tool  Example: Ref1: The expert, who asked not to be identified, added, "This depends on the conditions of the bodies." Ref2: The experts who asked to remain unnamed said, "the matter is related to the state of the bodies." Hyp: The expert who requested anonymity said that "the situation of the matter is linked to the dead bodies". Targ: The expert who requested anonymity said that "the matter is linked to the condition of the dead bodies".

27 Post-Editing Instructions  Three Requirements For Creating Targeted References 1.Meaning in references must be preserved 2.The targeted reference must be easily understood by a native speaker of English 3.The Targeted Reference must be as close to the System Output as possible without violating 1 and 2.  Grammaticality must be preserved –Acceptable: The two are leaving this evening –Not Acceptable: The two is leaving this evening  Alternate Spellings (British or US or contractions) are allowed  Meaning of targeted reference must be equivalent to at least one of the references

28 Targeted Reference Examples  Four Palestinians were killed yesterday by Israeli army bullets during a military operation carried out by the Israeli army in the old town of Nablus.  I tell you truthfully that reality is difficult the load is heavy and the territory is vibrant and gyrating.  Iranian radio points to lifting 11 people alive from the debris in Bam

29 Experimental Design  Two systems from MTEval 2004 Arabic –100 randomly chosen sentences –Each system output was previously judged for fluency and adequacy by two human judges at NIST –S1 is one of the worst systems; S2 is one of the best  Four annotators corrected system output –Two annotators for each sentence from each system –Annotators were undergraduates employed by BBN for annotation  We ensured that the new targeted references were sufficiently accurate and fluent –Other annotators checked (and corrected) all targeted references for fluency and meaning –Second pass changed 0.63 words per sentence

30 Results (Average of S1 and S2)  Insertion of Hypothesis Words (missing in reference)  Deletion of Reference Words (missing in hypothesis)  TER reduced by 33% using targeted references –Substitutions reduced by largest factor –33% of errors using untargeted references are due to small sample of references  Majority of edits are substitutions and deletions InsDelSubShiftTER TER (4 UnTarg Ref) HTER (1 Targ Ref)

31 Results (Average of S1 and S2)  Insertion of Hypothesis Words (missing in reference)  Deletion of Reference Words (missing in hypothesis)  TER reduced by 33% using targeted references –Substitutions reduced by largest factor –33% of errors using untargeted references are due to small sample of references  Majority of edits are substitutions and deletions InsDelSubShiftTER TER (4 UnTarg Ref) HTER (1 Targ Ref)

32 Results (Average of S1 and S2)  Insertion of Hypothesis Words (missing in reference)  Deletion of Reference Words (missing in hypothesis)  TER reduced by 33% using targeted references –33% of edits using untargeted references are due to small sample of references –Substitutions reduced by largest factor  Majority of edits are substitutions and deletions InsDelSubShiftTER TER (4 UnTarg Ref) HTER (1 Targ Ref)

33 Results (Average of S1 and S2)  Insertion of Hypothesis Words (missing in reference)  Deletion of Reference Words (missing in hypothesis)  TER reduced by 33% using targeted references –33% of edits using untargeted references are due to small sample of references –Substitutions reduced by largest factor  Majority of edits are substitutions and deletions InsDelSubShiftTER TER (4 UnTarg Ref) HTER (1 Targ Ref)

34 Results (Average of S1 and S2)  Insertion of Hypothesis Words (missing in reference)  Deletion of Reference Words (missing in hypothesis)  TER reduced by 33% using targeted references –33% of errors using untargeted references are due to small sample of references –Substitutions reduced by largest factor  Majority of edits are substitutions and deletions InsDelSubShiftTER TER (4 UnTarg Ref) HTER (1 Targ Ref)

35 BLEU and METEOR  BLEU (Papineni et al. 2002) –Counts number of n-grams (size 1-4) of the system output that match in the reference set –Contributed to recent improvements in MT  METEOR (Banerjee and Lavie 2005) –Counts number of exact word matches between system output and reference –Unmatched words are stemmed, and then matched –Additional penalties for reordering words  To compare with error measures –1.0 - BLEU and METEOR used in this talk  HBLEU and HMETEOR –BLEU and METEOR when using human-targeted references

36 System Scores  BLEU and METEOR shown  Low scores are better

37 Correlation with Human Judgments  Segment Level Correlations (200 data points)  Targeted correlations are the average of 2 correlations (2 targ refs)  HTER correlates best with human judgments  Targeted references increase correlation for evaluation metrics  METEOR correlates better than TER  HTER correlates better than HMETEOR

38 Correlation Between (H)TER / BLEU / Meteor

39 Correlations between Human Judges  Each human judgment is the average of fluency and adequacy judgments TERHTERBLEUHBLEUMET.HMET.HJ-1HJ-2 HJ HJ

40 Correlations between Human Judges  Each human judgment is the average of fluency and adequacy judgments  Subjective human judgments are noisy –Exhibit lower correlation than might be expected TERHTERBLEUHBLEUMET.HMET.HJ-1HJ-2 HJ HJ

41 Correlations between Human Judges  Each human judgment is the average of fluency and adequacy judgments  Subjective human judgments are noisy –Exhibit lower correlation than might be expected  HTER correlates a little better with a single human judgment than another human judgment does –Rather than having judges give subjective scores, they should create targeted references TERHTERBLEUHBLEUMET.HMET.HJ-1HJ-2 HJ HJ

42 Correlations between Human Judges  Each human judgment is the average of fluency and adequacy judgments  Subjective human judgments are noisy –Exhibit lower correlation than might be expected  HTER correlates a little better with a single human judgment than another human judgment does –Rather than having judges give subjective scores, they should create targeted references  TER correlates with single human judgment about as well as another human judgment TERHTERBLEUHBLEUMET.HMET.HJ-1HJ-2 HJ HJ

43 Correlation Between HTER Post-Editors r = 0.85

44 Examining MT Errors with HTER  Subjective human judgments aren’t useful for diagnosing MT errors  HTER indicates portion of output that is incorrect Hypothesis: he also saw the riyadh attack similar in november 8 which killed 17 people. REF: riyadh also saw a similar attack on november 8 which killed 17 people. HYP: he riyadh also saw the similar attack in november 8 which killed 17 people.

45 Examining MT Errors with HTER  Subjective human judgments aren’t useful for diagnosing MT errors  HTER indicates portion of output that is incorrect Hypothesis: he also saw the riyadh attack similar in november 8 which killed 17 people. REF: ** riyadh also saw A similar attack ON november 8 which killed 17 people. HYP: HE [ riyadh ] also saw [ similar ] IN november 8 which killed 17 people.

46 Conclusions  Targeted References decreases TER by 33% –In all subsequent studies TER reduction is ~50%  HTER has high correlation with human judgments –But is very expensive –Targeted references not readily reusable  HTER makes fine distinctions among correct, near correct, bad translations –Correct translations have HTER = 0 –Bad translations have high HTER –May be substitute for Subjective Human Judgments  HTER is easy to explain to people outside of MT community: –Amount of work to correct the translations

47 Future Work and Impact  Compute HTER and Human Judgment correlations at the system level, rather than segment level –Caveat: HTER expensive to generate for many systems  Better post-editing tool –Suggests edits to the annotator  Investigate non-uniform weights for (H)TER  HTER currently used in GALE Evaluation  TER computation code available at

48 Questions