Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology School of Computing Dublin City University

Overview Automatic evaluation for Machine Translation (MT): BLEU, NIST, GTM, METEOR, TER Lexical-Functional Grammar (LFG) in language processing: parsing to simple logical forms LFG in MT evaluation –assessing level of parser noise: the adjunct attachment experiment –checking for bias: the Europarl experiment –correlation with human judgement: the MultiTrans experiment Future work

Automatic MT evaluation Automatic MT metrics: fast and cheap way to evaluate your MT system Basic and most popular: BLEU, NIST John resigned yesterdayvs.Yesterday, John quit 1-grams: 2/3 (john, yesterday) 2-grams: 0/2 3-grams: 0/1 Total = 2/6 n-grams = 0.33 String comparison - not sensitive to legitimate syntactic and lexical variation Need large test sets and/or multiple references

Automatic MT evaluation Other attempts to include more variation into evaluation –General Text Matcher (GTM): precision and recall on translation- reference pairs, weights contiguous matches more –Translation Error Rate (TER): edit distance for translation-reference pair, number of insertions, deletions, substitutions and shifts –METEOR: sum of n-gram matches for exact string forms, stemmed words, and WordNet synonyms –Kauchak and Barzilay (2006): using WordNet synonyms with BLEU –Owczarzak et al. (2006): using paraphrases derived from the test set through word/phrase alignment with BLEU and NIST

Lexical-Functional Grammar Sentence structure representation: –c-structure (constituent): CFG trees, reflects surface word order and structural hierarchy –f-structure (functional): abstract grammatical (syntactic) relations John resigned yesterdayvs.Yesterday, John resigned S NP VP John V NP-TMP | | resigned yesterday SUBJ PRED john NUM sg PERS 3 PRED resign TENSE past ADJ PRED yesterday S NP NP VP | | | Yesterday John V | resigned SUBJ PRED john NUM sg PERS 3 PRED resign TENSE past ADJ PRED yesterday c-structure:f-structure: triples: SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) triples – preds only: SUBJ(resign, john) ADJ(resign, yesterday)

LFG Parser Cahill et al. (2004): LFG parser based on Penn II Treebank (demo at http://lfg-demo.computing.dcu.ie/lfgparser.html)http://lfg-demo.computing.dcu.ie/lfgparser.html Automatic annotation of Charniak’s/Bikel’s output parse with attribute-value equations, resolving to f-structures Evaluation of parser quality: comparison of dependencies produced by the parser with the set of dependencies in human annotation of same text, precision and recall Our LFG parser reaches high precision and recall scores

LFG in MT evaluation Parse translation and reference into LFG f-structures rendered as dependency triples Comparison of translation and reference text on structural (dependency) level Calculate precision and recall on translation and reference dependency sets Comparison of two automatically produced outputs: –how much noise does the parser introduce? John resigned yesterday SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) Yesterday, John resigned SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) vs.= 100%

The adjunct attachment experiment 100 English Europarl sentences containing adjuncts or coordinated structures Hand-modified to change the placement of the adjunct or the order of coordinated elements, no change in meaning or grammaticality Schengen, on the other hand, is not organic.<- original “reference” On the other hand, Schengen is not organic. <- modified “translation” Change limited to c-structure, no change in f-structure A perfect parser should give both identical set of dependencies

The adjunct attachment experiment - results baselinemodified TER0.06.417 METEOR1.00.9970 BLEU1.00.8725 NIST11.523211.1704 (96.94%) GTM10099.18 dep10096.56 dep_preds10094.13 Parser: baselinemodified TER0.07.841 METEOR1.00.9956 BLEU1.00.8485 NIST11.169010.7422 (96.18%) GTM10099.35 dep100 dep_preds100 Ideal parser:

The Europarl experiment N-gram-based metrics (BLEU, NIST) favour n-gram-based translation (statistical MT) Owczarzak et al. (2006): BLEU: Pharaoh > Logomedia (0.0349) NIST: Pharaoh > Logomedia (0.6219) Human: Pharaoh < Logomedia (0.19) 4000 sentences from Spanish-English Europarl Two translations: –Logomedia –Pharaoh Evaluated with BLEU, NIST, GTM, TER, METEOR (+-WordNet), dependency-based method (basic, predicate-only, +-WordNet, +-bitext- generated paraphrases) WordNet & paraphrases: used to create new best-matching reference for the translation, then evaluated with dependency-based method

The Europarl experiment - results metric PH score – LM score TER 1.997 BLEU 7.16% NIST 6.58% dep 4.93% dep+paraphr 4.80% GTM 3.89% METEOR 3.80% dep_preds 3.79% dep+paraphr_preds 3.70% dep+WordNet 3.55% dep+WordNet_preds 2.60% METEOR+WordNet 1.56% Europarl 4000 Logomedia vs Pharaoh: Sets of increasing similarity:

The MultiTrans experiment Correlation of dependency-based method with human evaluation Comparison with correlation of BLEU, NIST, GTM, METEOR, TER Linguistic Data Consortium Multiple Translation Chinese Parts 2 and 4: –multiple translations of Chinese newswire text –four human-produced references –segment-level human scores for a subset of the translations –total: 16,800 translation-reference-human score segments Pearson’s correlation coefficient -1 = negative correlation 0 = no correlation 1 = positive correlation

The MultiTrans experiment - results Correlation with human judgement of translation quality Dependency-based method sensitive to grammatical structure of the sentence: more grammatical translation = more fluent translation Different position of a word = different local (and global) structure: the word appears in dependency triples that do not match the reference

Future work Use n-best parses to reduce parser noise and increase number of matches Generate a paraphrase set through word alignment from a large bitext (Europarl), use instead of WordNet Create weights for individual dependency scores that contribute to segment-level score, train to maximize correlation with human judgement

Conclusions New automatic method for evaluation of MT output LFG dependency triples – simple logical form Evaluation on structural level, not surface string form Allows legitimate syntactic variation Allow legitimate lexical variation when used with WordNet or paraphrases Correlates higher than other metrics with human evaluation of fluency

References Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization: 65-73. Aoife Cahill, Michael Burke, Ruth O’Donovan, Josef van Genabith, and Andy Way. 2004. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. Proceedings of ACL 2004: 320-327. George Doddington. 2002. Automatic Evaluation of MT Quality using N-gram Co-occurrence Statistics. Proceedings of HLT 2002: 138-145. David Kauchak and Regina Barzilay. 2006. Paraphrasing for Automatic Evaluation. Proceedings of HLT-NAACL 2006: 45-462. Philipp Koehn, Franz Och and Daniel Marcu. 2003. Statistical Phrase-Based Translation. Proceedings of HLT- NAACL 2003: 48-54. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of MT Summit 2005: 79-86. Philipp Koehn. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. Proceedings of the AMTA 2004 Workshop on Machine Translation: From real users to research: 115-124. Karolina Owczarzak, Declan Groves, Josef van Genabith, and Andy Way. 2006. Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation. Proceedings of the HLT-NAACL 2006 Workshop on Statistical Machine Translation: 86-93. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL 2002: 311-318. Mathew Snover, Bonnie Dorr, Richard Schwartz, John Makhoul, Linnea Micciula. 2006. A Study of Translation Error Rate with Targeted Human Annotation. Proceedings of AMTA 2006: 223-231. Joseph P. Turian, Luke Shen, and I. Dan Melamed. 2003. Evaluation of Machine Translation and Its Evaluation. Proceedings of MT Summit 2003: 386-393.

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

Similar presentations

Presentation on theme: "Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

Similar presentations

Presentation on theme: "Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology."— Presentation transcript:

Similar presentations

About project

Feedback