Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams 11-734 2008-03-05 Instructors: Alon Lavie Stephan Vogel.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Arthur Chan Prepared for Advanced MT Seminar

Support Vector Machines

Machine learning continued Image source:

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

1 Josef van Genabith & Andy Way TransBooster ( ) LaDEva: Labelled Dependency-Based MT Evaluation ( ) GramLab ( ) Previous MT Work.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.

Kyoshiro SUGIYAMA, AHC-Lab., NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro.

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.

Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.

Sentence Compression Based on ILP Decoding Method Hongling Wang, Yonglei Zhang, Guodong Zhou NLP Lab, Soochow University.

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.

2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.

Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

NTU & MSRA Ming-Feng Tsai

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

PRESENTED BY: PEAR A BHUIYAN

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Relation Extraction CSCI-GA.2591

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Statistical Machine Translation Papers from COLING 2004

Improving IBM Word-Alignment Model 1(Robert C. MOORE)

A Path-based Transfer Model for Machine Translation

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel

Outline Background Machine Learning Linguistic Information Combined Approaches Conclusions

Background Fully automatic MT Eval is as hard as MT  If we could judge with certainty that a translation is correct, reverse the process and generate a correct translation Reference translations help to close this gap

Background: Adequacy and Fluency Adequacy  How much of the meaning in the source sentence that is preserved in the hypothesis  Reference translations are assumed to achieve this sufficiently Fluency  How closely the hypothesis sentence conforms to the norms of the target language  Reference translations are a subset of target language

Background: Human Judgments Judge on a scale for adequacy and fluency Agreement between judges is low Judgment scores normalized  Blatz et al (2003)

Background: Evaluating Metrics Correlation with human assessments (judgments)  Pearson Correlation  Spearman Rank Correlation Adding more references helps BLEU but hurts NIST (Finch et al. 2004)

Background: BLEU Papineni et al. (2001) First automatic MT metric to be widely adopted Geometric mean of modified n-gram precision Criticisms:  Poor sentence level correlation  Favors statistical systems  Ignores recall  Local word choice more important than global accuracy

Background: METEOR Banerjee and Lavie (2005). Addresses some of the shortcomings of BLEU  Uses recall of best reference  Attempts to align hypothesis and reference  Better correlation with human judgments Optionally uses WordNet and Porter stemming

Outline Background Machine Learning Linguistic Information Combinations Conclusions

Machine Learning: Kulesza & Shieber (2004) Frame the MT Evaluation problem as a classification task Can we predict if a sentence is generated by a human or a machine by comparing against reference translations?

Machine Learning: Kulesza & Shieber (2004) Derived a set of features (partially based on BLEU)  Unmodified n-gram precisions (1 to 5)  Min and max ratio of hypothesis to reference length  Word error rate minimum edit distance between hypothesis and any reference  Position-independent word error rate shorter translation removed from longer and size of remaining set returned

Machine Learning: Kulesza & Shieber (2004) Trained an SVM using classification  Positive: human translation  Negative: machine translation Score is output of SVM  Distance to hyperplane is treated as a measure of confidence Classification Accuracy  ~59% for human examples (positive)  ~70% for machine examples (negative)

Machine Learning: Kulesza & Shieber (2004) Compared to BLEU, WER, PER, F- Measure at the sentence level

Outline Background Machine Learning Linguistic Information Combinations Conclusions

Linguistic Information: Liu & Gildea (2005) Introduce syntactic information Use Collins parser on hypothesis and reference translations Looked at three different metrics for comparing trees

Linguistic Information: Liu & Gildea (2005) Subtree Metric (STM) D – depth of trees considered Count is # times subtree appears in any reference Clipped count limits count to the maximum number of times it appears in any one reference

Linguistic Information: Liu & Gildea (2005) Kernel-based Subtree Metric (TKM) H(t) is a vector of counts for all subtrees of t  H(t 1 ) · H(t 2 ) counts subtrees in common Use convolution kernels (Collins & Duffy, 2001) to compute in polynomial time  counting all subtrees would be exponential in the size of the trees

Linguistic Information: Liu & Gildea (2005) Headword Chain Metric (HWCM) Convert phrase-structure parse into dependency parse Each mother-daughter relationship in the dependency parse is a headword chain of length 2  No siblings included in any headword chain Score computed in the same fashion as STM Other two metrics have dependency versions

Linguistic Information: Liu & Gildea (2005) Data is from MT03 and JHU Summer Workshop (2003) Correlation with fluency judgments for one MT system (E15) Correlation with overall judgments for one MT system (E15)

Linguistic Information: Liu & Gildea (2005) Corpus level judgments for MT03

Linguistic Information: Pozar & Charniak (2006) Propose the Bllip metric Intuition: meaning-preserving transformations in sentences should not heavily impact dependency structure  Perhaps intuitive, but unsubstantiated

Linguistic Information: Pozar & Charniak (2006) Parse hypothesis and reference translations with the Charniak parser Construct dependency parses from the output parse trees Given a lexical head pair (w1, w2) it is a dependency if:  w1 != w2  w1 is the lexical head of a constituent immediately dominating the constituent of which w2 is the head

Linguistic Information: Pozar & Charniak (2006) Construct all dependency pairs for the hypothesis and reference translation  If multiple reference translations, compare them one at a time Compute precision and recall to score  Formula for doing so not explicitly stated, but probably F1

Linguistic Information: Pozar & Charniak (2006) Evaluation was performed by comparing the biggest discrepancies between Bllip and BLEU and determining which was more accurate Results suggest Bllip makes better choices than BLEU  Results aren’t directly given

Linguistic Information: Pozar & Charniak (2006) Fairly weak paper  Evaluation is basically just “eye-balled” But, simple headword bi-chains seem to perform as well as BLEU Unfortunately, cannot be reliably compared

Linguistic Information: Owczarzak et al. (2007) Extended work by Liu & Gildea (2005)  They used unlabeled dependency parses Insight: having more information about grammatical relations might be helpful  X is the subject of Y  X is a determiner of Y

Linguistic Information: Owczarzak et al. (2007) Used an LFG parser to generate f- structures that contain information about grammatical relations

Linguistic Information: Owczarzak et al. (2007) Types of dependencies  Predicate only Predicate-value pair, i.e. grammatical relations  Non-predicate Tense Passive Adjectival degree (comparative, superlative) Verb particle Etc. Extended HWCM from Liu & Gildea (2005) to use these labeled dependencies

Linguistic Information: Owczarzak et al. (2007) How do you account for parser noise? The positions of adjuncts should not affect f-structure in an LFG parse Constructed re-orderings for 100 English sentences  Re-ordered sentence treated as translation hypothesis  Original sentence treated as reference translation

Linguistic Information: Owczarzak et al. (2007)

Solution: introduce n-best parses Tradeoff with computation time  Used 10-best

Linguistic Information: Owczarzak et al. (2007) Obtained precision and recall for each hypothesis, reference pair  Four examples for each machine hypothesis Extended matching using WordNet synonyms Extended with partial matches  One part of a grammatical relation matches and the other may or may not Computed F1  Tried different values for the weighted harmonic mean but saw no significant improvement * * Personal communication with Karolina Owczarzak

Linguistic Information: Owczarzak et al. (2007) Evaluated using Pearson correlation with un- normalized human judgment scores  Values ranging from 1 to 5 Their metric using 50-best parses and WordNet performed the best on fluency METEOR with WordNet performed best on adequacy and overall 50-best + partial matching performed slightly lower than METEOR overall Significantly outperformed BLEU * Personal communication with Karolina Owczarzak

Outline Background Machine Learning Linguistic Information Combinations Conclusions

Combinations: Albrecht & Hwa (2007) Extended work by Kulesza & Shieber (2004) Included work by Liu and Gildea with headword chains Compared classification to regression using SVMs

Combinations: Albrecht & Hwa (2007) Classification attempts to learn decision boundaries Regression attempts to learn a continuous function  MT evaluation metrics are continuous  No clear boundary between “good” and “bad” Instead of trying to classify as human or machine (Human-Likeness Classifier), try to learn the function of human judgments  Score hypothesis according to a rating scale

Combinations: Albrecht & Hwa (2007) Features  Syntax based compared to reference HWCM STM  String-based metrics over large English corpus  Syntax-based metrics over a dependency treebank

Combinations: Albrecht & Hwa (2007) Data was LDC Multiple Translation Chinese Part 4 Spearman correlation instead of Pearson Classification accuracy  Positively related but it’s possible to improve classification accuracy and not improve correlation  Human-Likeness classification seems inconsistent

Combinations: Albrecht & Hwa (2007) It is possible to train using regression with reasonable size sets of training instances Regression generalizes across data sets Results showed highest correlation overall of metrics compared

Combinations: Albrecht & Hwa (2007)

Outline Background Machine Learning Linguistic Information Combinations Conclusions

Evaluating the performance of MT evaluation metrics still has plenty of room for improvement Given that humans don’t agree well on MT quality, correlation with human judgments is inherently limited

Conclusions Machine learning  Only scratching the surface of possibilities Finding the right way to frame the problem is not straightforward  Learning the function of how humans assess translations performs better than attempting to classify a translation as human or machine

Conclusions Linguistic Information  Intuitively, this should be helpful  METEOR performs very well with limited linguistic information (synonymy)  Automatic parsers/NLP tools are noisy, so possibly compound the problem

Conclusions Linguistic Information and Machine Learning  Combining the two leads to good results (Albrecht & Hwa 2007)

Conclusions New directions  Machine learning with richer linguistic information Labeled dependencies Paraphrases  Are other machine learning algorithms better suited than SVMs?  Are there better ways of framing the evaluation question?  How well can these approaches be extended to task- specific evaluation?

Questions?

References Joshua S. Albrecht and Rebecca Hwa A re-examination of machine learning approaches for sentence-level MT evaluation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-2007). Satanjeev Banerjee and Alon Lavie Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, June. John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing Confidence estimation for machine translation. Technical Report Natural Language Engineering Workshop Final Report, Johns Hopkins University. Alex Kulesza and Stuart M. Shieber A learning approach to improving sentence-level MT evaluation. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), Baltimore, MD, October.

References Ding Liu and Daniel Gildea Syntactic features for evaluation of machine translation. In ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, June. Karolina Owczarzak, Josef van Genabith, and Andy Way Labelled Dependencies in Machine Translation Evaluation. Proceedings of the ACL 2007 Workshop on Statistical Machine Translation: Prague, Czech Republic. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA. Michael Pozar and Eugene Charniak Bllip: An Improved Evaluation Metric for Machine Translation. Master’s Thesis, Brown University.