Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

Similar presentations


Presentation on theme: "Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England."— Presentation transcript:

1 Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England

2 Outline Brief introduction to machine translation evaluation methods Corpus content for MT evaluation by end-users Why compile a new corpus? How large should our new corpus be? Which language pairs should be included? Which text types should be included? Conclusions

3 Machine translation evaluation methods (1) Evaluation by developers Test suites are used to evaluate the translation of specific linguistic phenomena (eg. before and after system modifications) Test suites contain short annotated test items with correct target translations They are used to test the handling of grammatical phenomena Vocabulary is limited Items are not rated in terms of frequency or relevance to a particular application Scoring is objective

4 Machine translation evaluation methods (2) Evaluation by end-users Texts are translated by different MT systems (and often humans) for comparison A number of methods can be used to evaluate MT output … Fidelity (the preservation of original content) can be evaluated by comparing segments of MT output with segments from the source text (bilingual evaluators) or from expert human translations (monolingual evaluators). Each segment is given a score Fluency (the extent to which the translation reads like an original text) can be evaluated by scoring each target text sentence Texts can be selected to reflect user needs

5 Machine translation evaluation methods (3) Evaluation by end-users Scoring by human evaluators is subjective, so: Several evaluators are used and a mean score is calculated for each text Evaluators rate a number of texts translated by each system Human evaluation is expensive, so: Recent research has involved the investigation of automated evaluation methods

6 Corpus content for MT evaluation by end-users Essential: Source texts in one or more languages Machine translations of source texts by systems for evaluation Not always essential: One or more expert human translations in selected target language(s) to be used as reference translations or for inclusion in evaluation with MT output Available evaluation scores if corpus is to be used to validate new automated evaluation methods

7 Why compile a new corpus for MT evaluation? (1) Existing corpora have limitations: many projects have involved the use of small numbers of texts in only one language pair Carroll (Pierce 1966) 144 Russian sentences (scientific) 3 English human translations Nagao et al. (1985) 1,682 Japanese sentences (scientific) 0 human translations Shiwen (1993) 3,200 English sentences (random) 1 Chinese human translation IBM BLEU (Papineni et al. 2001) Approx. 500 Chinese sentences (news stories) Up to 4 English human translations

8 Why compile a new corpus for MT evaluation? (2) Much research has made use of the DARPA 1994 corpus: Source texts: 100 French, 100 Spanish, 100 Japanese All newspaper articles of approx words/800 Japanese characters 2 English human translations of each source text 5 machine translations of each source text Scores for adequacy, fluency and informativeness for all 100 translations in each language pair by 5 MT systems and 1 human

9 Why compile a new corpus for MT evaluation? (3) We need: a corpus that reflects user needs (not just newspaper articles) a larger number of language pairs with English as a source and target language sub-corpora (for each language pair) large enough to provide reliable evaluation results at least one human translation and several machine translations of each source text human evaluation results for selected attributes (eg. fidelity and fluency) for the validation of new automated evaluation methods a corpus available to all for MT evaluation research

10 How large should our new corpus be? (1) The corpus cannot be unnecessarily large: human MT evaluations are time-consuming and expensive expert human translations of each source text, if not already available, will be costly to produce However: we need enough words to obtain reliable MT evaluation results

11 How large should our new corpus be? (2) We carried out a statistical analysis of the DARPA 1994 scores for all 3 language pairs: We calculated the mean score for each attribute and the overall score for each system with varying numbers of texts (1 to 100): Source language French: Adequacy MT1MT2MT3MT4MT5HT Text 1 Score for text 1 Text 2 Mean score for texts 1 & 2 etc.

12 DARPA 1994 (French-English) Mean adequacy scores for varying numbers of texts

13 DARPA 1994 (French-English) Mean overall scores for varying numbers of texts

14 How large should our new corpus be? (3) Results from statistical analysis: 10 texts (3,500 words), and often fewer, allow us to identify the highest (human) and lowest ranking system for individual attributes and overall scores 10 texts allow us to identify the highest-ranking MT system as well (but up to 30 texts required for informativeness) After approx. 30 texts (10,500 words) scores begin to remain consistent within a relatively small variance fluctuation After approx. 40 texts (14,000 words) we have a clearer picture of how all five MT systems compare and further sampling confirms this Further research: the same statistical analysis will be performed using texts from our new corpus and our chosen metrics

15 Which language pairs should be included? (1) A variety of language pairs allows for the testing of portability of new evaluation methods The availability of MT systems for evaluation will influence our choice Our survey of MT users (ongoing since January 2003) is also providing guidelines …

16 Language pairs translated by MT users (English as source language)

17 Language pairs translated by MT users (English as target language)

18 Which language pairs should be included? (2) Phase One: French, German, Spanish and Italian plus texts in typologically different languages (Chinese, Japanese) translated into English Phase Two: Consider additional source languages (eg. Portuguese and Russian into English) Phase Three: English translated into other languages

19 Which text types should be included? MT systems are used in translation companies and international organisations to translate a number of different text types and topics These text types and a variety of topics must be represented in our corpus Our survey of MT users is providing guidelines on the kinds of texts and topics most frequently translated using MT systems …

20 Text types machine translated by companies

21 Text types machine translated by single users

22 Conclusions We aim to provide a minimum of 14,000 words per language pair (further research to be conducted) Text types will be based on responses to our survey, reflecting real MT use The text types for each language pair will be the same, to give balance Our corpus will be dynamic: updated to reflect changing trends in the MT user market The key feature will be detailed scores from human evaluations, available for research (particularly in automated MT evaluation) We plan to make our corpus and human evaluation results available online in 2004

23 Thank you We welcome your questions


Download ppt "Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England."

Similar presentations


Ads by Google