Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Automatic summarization Dragomir R. Radev University of Michigan

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Arthur Chan Prepared for Advanced MT Seminar

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Grammar Development Platform Miriam Butt October 2002.

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

Contextual Bitext-derived Paraphrases in Automatic MT Evaluation HLT-NAACL, 09 June 2006 Karolina Owczarzak, Declan Groves, Josef Van Genabith, Andy Way.

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Evaluation of Machine Translation Systems: Metrics and Methodology Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon.

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

1 MT in the NCLT Andy Way NCLT, School of Computing, Dublin City University, Dublin 9, Ireland

1 Josef van Genabith & Andy Way TransBooster ( ) LaDEva: Labelled Dependency-Based MT Evaluation ( ) GramLab ( ) Previous MT Work.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel.

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.

Syntax for MT EECS 767 Feb. 1, Outline Motivation Syntax-based translation model  Formalization  Training Using syntax in MT  Using multiple.

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Grammatical Machine Translation Stefan Riezler & John Maxwell.

Arthur Chan Prepared for Advanced MT Seminar

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation M arine C ARPUAT and D ekai W U Human Language Technology.

Grammar Engineering: What is it good for? Miriam Butt (University of Konstanz) and Martin Forst (NetBase Solutions) Colombo 2014.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.

Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.

1 Alignment Entropy as an Automated Predictor of Bitext Fidelity for Statistical Machine Translation Shankar Ananthakrishnan Rohit Prasad Prem Natarajan.

1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Supertagging CMSC Natural Language Processing January 31, 2006.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

NLP. Machine Translation Tree-to-tree – Yamada and Knight Phrase-based – Och and Ney Syntax-based – Och et al. Alignment templates – Och and Ney.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.

Semantic Evaluation of Machine Translation Billy Wong, City University of Hong Kong 21 st May 2010.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Approaches to Machine Translation

Vorlesung Maschinelle Übersetzung, SS 2010

PRESENTED BY: PEAR A BHUIYAN

Approaches to Machine Translation

Statistical Machine Translation Papers from COLING 2004

Presentation transcript:

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology School of Computing Dublin City University

Overview Automatic evaluation for Machine Translation (MT): BLEU, NIST, GTM, METEOR, TER Lexical-Functional Grammar (LFG) in language processing: parsing to simple logical forms LFG in MT evaluation –assessing level of parser noise: the adjunct attachment experiment –checking for bias: the Europarl experiment –correlation with human judgement: the MultiTrans experiment Future work

Automatic MT evaluation Automatic MT metrics: fast and cheap way to evaluate your MT system Basic and most popular: BLEU, NIST John resigned yesterdayvs.Yesterday, John quit 1-grams: 2/3 (john, yesterday) 2-grams: 0/2 3-grams: 0/1 Total = 2/6 n-grams = 0.33 String comparison - not sensitive to legitimate syntactic and lexical variation Need large test sets and/or multiple references

Automatic MT evaluation Other attempts to include more variation into evaluation –General Text Matcher (GTM): precision and recall on translation- reference pairs, weights contiguous matches more –Translation Error Rate (TER): edit distance for translation-reference pair, number of insertions, deletions, substitutions and shifts –METEOR: sum of n-gram matches for exact string forms, stemmed words, and WordNet synonyms –Kauchak and Barzilay (2006): using WordNet synonyms with BLEU –Owczarzak et al. (2006): using paraphrases derived from the test set through word/phrase alignment with BLEU and NIST

Lexical-Functional Grammar Sentence structure representation: –c-structure (constituent): CFG trees, reflects surface word order and structural hierarchy –f-structure (functional): abstract grammatical (syntactic) relations John resigned yesterdayvs.Yesterday, John resigned S NP VP John V NP-TMP | | resigned yesterday SUBJ PRED john NUM sg PERS 3 PRED resign TENSE past ADJ PRED yesterday S NP NP VP | | | Yesterday John V | resigned SUBJ PRED john NUM sg PERS 3 PRED resign TENSE past ADJ PRED yesterday c-structure:f-structure: triples: SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) triples – preds only: SUBJ(resign, john) ADJ(resign, yesterday)

LFG Parser Cahill et al. (2004): LFG parser based on Penn II Treebank (demo at Automatic annotation of Charniak’s/Bikel’s output parse with attribute-value equations, resolving to f-structures Evaluation of parser quality: comparison of dependencies produced by the parser with the set of dependencies in human annotation of same text, precision and recall Our LFG parser reaches high precision and recall scores

LFG in MT evaluation Parse translation and reference into LFG f-structures rendered as dependency triples Comparison of translation and reference text on structural (dependency) level Calculate precision and recall on translation and reference dependency sets Comparison of two automatically produced outputs: –how much noise does the parser introduce? John resigned yesterday SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) Yesterday, John resigned SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) vs.= 100%

The adjunct attachment experiment 100 English Europarl sentences containing adjuncts or coordinated structures Hand-modified to change the placement of the adjunct or the order of coordinated elements, no change in meaning or grammaticality Schengen, on the other hand, is not organic.<- original “reference” On the other hand, Schengen is not organic. <- modified “translation” Change limited to c-structure, no change in f-structure A perfect parser should give both identical set of dependencies

The adjunct attachment experiment - results baselinemodified TER METEOR BLEU NIST (96.94%) GTM dep dep_preds Parser: baselinemodified TER METEOR BLEU NIST (96.18%) GTM dep100 dep_preds100 Ideal parser:

The Europarl experiment N-gram-based metrics (BLEU, NIST) favour n-gram-based translation (statistical MT) Owczarzak et al. (2006): BLEU: Pharaoh > Logomedia (0.0349) NIST: Pharaoh > Logomedia (0.6219) Human: Pharaoh < Logomedia (0.19) 4000 sentences from Spanish-English Europarl Two translations: –Logomedia –Pharaoh Evaluated with BLEU, NIST, GTM, TER, METEOR (+-WordNet), dependency-based method (basic, predicate-only, +-WordNet, +-bitext- generated paraphrases) WordNet & paraphrases: used to create new best-matching reference for the translation, then evaluated with dependency-based method

The Europarl experiment - results metric PH score – LM score TER BLEU 7.16% NIST 6.58% dep 4.93% dep+paraphr 4.80% GTM 3.89% METEOR 3.80% dep_preds 3.79% dep+paraphr_preds 3.70% dep+WordNet 3.55% dep+WordNet_preds 2.60% METEOR+WordNet 1.56% Europarl 4000 Logomedia vs Pharaoh: Sets of increasing similarity:

The MultiTrans experiment Correlation of dependency-based method with human evaluation Comparison with correlation of BLEU, NIST, GTM, METEOR, TER Linguistic Data Consortium Multiple Translation Chinese Parts 2 and 4: –multiple translations of Chinese newswire text –four human-produced references –segment-level human scores for a subset of the translations –total: 16,800 translation-reference-human score segments Pearson’s correlation coefficient -1 = negative correlation 0 = no correlation 1 = positive correlation

The MultiTrans experiment - results Correlation with human judgement of translation quality Dependency-based method sensitive to grammatical structure of the sentence: more grammatical translation = more fluent translation Different position of a word = different local (and global) structure: the word appears in dependency triples that do not match the reference

Future work Use n-best parses to reduce parser noise and increase number of matches Generate a paraphrase set through word alignment from a large bitext (Europarl), use instead of WordNet Create weights for individual dependency scores that contribute to segment-level score, train to maximize correlation with human judgement

Conclusions New automatic method for evaluation of MT output LFG dependency triples – simple logical form Evaluation on structural level, not surface string form Allows legitimate syntactic variation Allow legitimate lexical variation when used with WordNet or paraphrases Correlates higher than other metrics with human evaluation of fluency

References Satanjeev Banerjee and Alon Lavie METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization: Aoife Cahill, Michael Burke, Ruth O’Donovan, Josef van Genabith, and Andy Way Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. Proceedings of ACL 2004: George Doddington Automatic Evaluation of MT Quality using N-gram Co-occurrence Statistics. Proceedings of HLT 2002: David Kauchak and Regina Barzilay Paraphrasing for Automatic Evaluation. Proceedings of HLT-NAACL 2006: Philipp Koehn, Franz Och and Daniel Marcu Statistical Phrase-Based Translation. Proceedings of HLT- NAACL 2003: Philipp Koehn Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of MT Summit 2005: Philipp Koehn Pharaoh: a beam search decoder for phrase-based statistical machine translation models. Proceedings of the AMTA 2004 Workshop on Machine Translation: From real users to research: Karolina Owczarzak, Declan Groves, Josef van Genabith, and Andy Way Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation. Proceedings of the HLT-NAACL 2006 Workshop on Statistical Machine Translation: Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL 2002: Mathew Snover, Bonnie Dorr, Richard Schwartz, John Makhoul, Linnea Micciula A Study of Translation Error Rate with Targeted Human Annotation. Proceedings of AMTA 2006: Joseph P. Turian, Luke Shen, and I. Dan Melamed Evaluation of Machine Translation and Its Evaluation. Proceedings of MT Summit 2003: