Presentation is loading. Please wait.

Presentation is loading. Please wait.

Contextual Bitext-derived Paraphrases in Automatic MT Evaluation HLT-NAACL, 09 June 2006 Karolina Owczarzak, Declan Groves, Josef Van Genabith, Andy Way.

Similar presentations


Presentation on theme: "Contextual Bitext-derived Paraphrases in Automatic MT Evaluation HLT-NAACL, 09 June 2006 Karolina Owczarzak, Declan Groves, Josef Van Genabith, Andy Way."— Presentation transcript:

1 Contextual Bitext-derived Paraphrases in Automatic MT Evaluation HLT-NAACL, 09 June 2006 Karolina Owczarzak, Declan Groves, Josef Van Genabith, Andy Way National Centre for Language Technology, Dublin City University

2 Overview Automatic MT evaluation and its limitations Generation of paraphrases from word alignments Using paraphrases in evaluation Correlation with human judgments Paraphrase quality

3 Automatic evaluation of MT quality Most popular metrics: BLEU and NIST T R1R1 R2R2 R3R3 BLEU

4 Automatic evaluation of MT quality Most popular metrics: BLEU and NIST T R1R1 R2R2 R3R3 BLEU But we have no answer to it But we don’t have an answer However we cannot react However we have no reply to that

5 Automatic evaluation of MT quality Most popular metrics: BLEU and NIST BLEU However we cannot react Butwehavenoanswertoit Butwehaveananswerdon’t Howeverwehavenoreplytothat 7-grams: 0/1 6-grams: 0/2 5-grams: 0/3 4-grams: 0/4 3-grams: 1/5 2-grams: 3/6 1-grams: 6/7= or smoothed

6 Automatic evaluation of MT quality Insensitive to admissible lexical differences answer ≠ reply Insensitive to admissible syntactic differences yesterday it was raining ≠ it was raining yesterday we don’t have ≠ we have no

7 Automatic evaluation of MT quality Attempts to come up with better metrics: - word order: Translation Error Rate (Snover et al. 2005) Maximum Matching String (Turian et al. 2003) - lexical and word-order issues: CDER (Leusch et al. 2006) METEOR (Banerjee and Lavie 2005) linear regression model (Russo-Lassner et al. 2005) Need POS taggers, stemmers, thesauri, WordNet

8 …le… …ce jour… …je suis… …pays… …agréable… …the… …that day… …I am… …country… Word and phrase alignment Statistical Machine Translation Source Lg TextTarget Lg Text …nice… …pleasant… …good… …nice… agréable {nice, pleasant, good} nous n’avons pas{we don’t have, we have no}

9 Generating paraphrases For each word/phrase e i find all words/phrases f i1, …, f in that e i aligns with, then for each f i find all words/phrases e k≠i1, …, e k≠in that f i aligns with (Bannard and Callison- Burch 2005) pleasant 0.75 agreeable 0.25 good 0.8 great 0.2 good 0.99 nice agréable 0.5 bon 0.25 bonne * 0.75 = * 0.25 = * 0.8 = * 0.2 = * 0.99 = nice = {good (0.4475), pleasant (0.375), agreeable (0.125), great (0.05)}

10 e a = {p ea1, …., p ean } … e z = {p ezn, …., p ezn } Paraphrases in automatic MT evaluation Src (fr) Ref (en) Trans (en) +=

11 Paraphrases in automatic MT evaluation e a = {p ea1, …., p ean } … e z = {p ezn, …., p ezn } Ref (en) + = Ref 1 (en) Ref 4 (en) Ref 3 (en) Ref 2 (en) … For each segment: w1w1 w9w9 w2w2 w4w4 w5w5 w6w6 w8w8 w7w7 w0w0 w3w3

12 Experiment 1 Test set: 2000 sentences, French-English Europarl Two translations:  Pharaoh – phrase-based SMT  Logomedia – rule-based MT Scored with BLEU and NIST  Original reference  Best-matching reference using paraphrases derived from the test set Paraphrase lists generated using GIZA++ and refined word alignment strategy (Och and Ney, 2003; Koehn et al., 2003; Tiedemann, 2004) Subset of 100 sentences from each translation scored by two human judges (accuracy, fluency)

13 Examples of paraphrases area – field, this area, sector, aspect, this sector above all – specifically, especially agreement – accordance believe that – believe, think that, feel that, think extensive – widespread, broad, wide make progress on – can move forward risk management – management of risks Not: {country}, {sphere, domain, orbit, field, arena}, {region}, {expanse, surface area} Syntactic variation

14 Examples of reference segments Example 1 Candidate translation: the question of climates with is a good example Original reference: the climate issue is a good example of this Best-match reference: the climate question is a good example of this Example 2 Candidate translation: thank you very much mr commissioner Original reference: thank you commissioner Best-match reference: thank you very much commissioner

15 Results BLEUNIST Original ref Best-match ref Translation by Pharaoh on EP 2000 sent

16 Pearson’s correlation with human judgment Subset 100 Metric Original refBest-matching ref H & BLEU H & BLEU smoothed H & NIST Subset of 100 sentences from the translation by Pharaoh (EP 2000 sent)

17 Paraphrase quality 700,000 sentence pairs, French-English Europarl Paraphrase lists generated using GIZA++ and refined word alignment strategy (Och and Ney, 2003; Koehn et al., 2003; Tiedemann, 2004) Quality of paraphrases evaluated with respect to syntactic and semantic accuracy Bannard and Callison-Burch, 2005 AccuracySyntacticSemantic Best paraphrase48.9%64.5%

18 Results Paraphrases Derived from BestAll 2000-sent. corpus70%52% 700,000-sent. corpus74%74%41% Syntactic accuracy Paraphrases Derived from BestAll 2000-sent. corpus79%73% 700,000-sent. corpus89%43% Semantic accuracy

19 Filtering paraphrases Some inaccuracy still useful be supported – support, supporting Filters: - exclude closed class items: prepositions, personal pronouns, possessive pronouns, auxiliary verbs have and be Fr. à » Eng. to, in, at to ≠ in ≠ at - prevent paraphrases of the form: e i – (w) e i (w), where w  (prepositions, pronouns, auxiliary verbs, modal verbs, negation, conjunction) *aspect – aspect is *hours – hours for *available – not available - POS taggers, parsers

20 References Satanjeev Banerjee and Alon Lavie METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization: Colin Bannard and Chris Callison-Burch Paraphrasing with Bilingual Parallel Corpora. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005): Philipp Koehn, Franz Och and Daniel Marcu Statistical Phrase-Based Translation. Proceedings of the Human Language Technology Conference (HLT-NAACL 2003): Grazia Russo-Lassner, Jimmy Lin, and Philip Resnik A Paraphrase-based Approach to Machine Translation Evaluation. Technical Report LAMP-TR-125/CS-TR-4754/UMIACS-TR , University of Maryland, College Park, MD. Mathew Snover, Bonnie Dorr, Richard Schwartz, John Makhoul, Linnea Micciula and Ralph Weischedel A Study of Translation Error Rate with Targeted Human Annotation. Technical Report LAMP-TR-126, CS-TR-4755, UMIACS-TR , University of Maryland, College Park. MD. Jörg Tiedemann Word to word alignment strategies. Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004): Gregor Leusch, Nicola Ueffing and Hermann Ney CDER: Efficient MT Evaluation Using Block Movements. To appear in Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006). Franz Josef Och and Hermann Ney A Systematic Comparison of Various Statistical Alignment Modes. Computational Linguistics, 29:19—51. Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng, Viren Jain, Zhen Jin, and Dragomir Radev Syntax for statistical machine translation. Technical report, Center for Language and Speech Processing, John Hopkins University, Baltimore, MD.


Download ppt "Contextual Bitext-derived Paraphrases in Automatic MT Evaluation HLT-NAACL, 09 June 2006 Karolina Owczarzak, Declan Groves, Josef Van Genabith, Andy Way."

Similar presentations


Ads by Google