Presentation on theme: "Human Evaluation of Machine Translation Systems MODL5003 Principles and applications of machine translation Lecture 13/03/2006 Bogdan Babych"— Presentation transcript:
Human Evaluation of Machine Translation Systems MODL5003 Principles and applications of machine translation Lecture 13/03/2006 Bogdan Babych firstname.lastname@example.org (Slides: Debbie Elliott, Tony Hartley)
13 March 2006MODL5003 Principles and applications of MT 1 Outline MT Evaluation – general perspective Purposes of MT evaluation Why evaluating translation quality is difficult A brief history of MT evaluation Examples of MT evaluation methods for users Where next?
13 March 2006MODL5003 Principles and applications of MT 2 MT Evaluation – a big space Requirements –Task: assimilation, dissemination,... –Text: type, provenance,... –User: translators, consumers,... Quality attributes –Internal: architecture, resources,... –External: readability, fidelity, well- formedness,...
13 March 2006MODL5003 Principles and applications of MT 3 What is evaluated It looks good to me evaluation Test suites –Syntactic coverage, degradation Corpus-based evaluation –Real texts Dont have to read them all ! –Coverage of typical problems –Interaction between different levels –General performance (bird's-eye view) Aspects more/less important for overall quality
13 March 2006MODL5003 Principles and applications of MT 4 Purposes of MT evaluation International Standard for Language Engineering (ISLE) http://www.mpi.nl/ISLE/ http://www.mpi.nl/ISLE/ Framework for the Evaluation of Machine Translation in ISLE (FEMTI) http://www.issco.unige.ch/projects/isle/femti/framed- glossary.html Defines 7 types of MT evaluation: 1.Feasibility testing 2.Requirements elicitation 3.Internal evaluation 4.Diagnostic evaluation 5.Declarative evaluation 6.Operational evaluation 7.Usability evaluation
13 March 2006MODL5003 Principles and applications of MT 5 Purposes of MT evaluation 1. Feasibility testing An evaluation of the possibility that a particular approach has any potential for success after further research and implementation. (White 2000) (Eg. sub-problems connected to a particular language pair) Purpose: To decide whether to invest in further research into a particular approach For: Researchers, sponsors of research
13 March 2006MODL5003 Principles and applications of MT 6 Purposes of MT evaluation 2. Requirements elicitation Researchers and developers create prototypes designed to demonstrate particular functional capabilities that might be implemented Purpose: To elicit reactions from potential investors before implementing new approaches. For: Researchers and developers, project managers, end-users
13 March 2006MODL5003 Principles and applications of MT 7 Purposes of MT evaluation 3. Internal evaluation Researchers and developers test components of a prototype or pre-release system. This can involve the use of test suites to evaluate output quality during the course of system modifications. Purpose: To measure how well each component performs its function To test linguistic coverage (that a new grammar rule works in all circumstances) Iterative testing: to check that particular modifications do not have adverse effects elsewhere For: Researchers, developers, investors
13 March 2006MODL5003 Principles and applications of MT 8 Purposes of MT evaluation 4. Diagnostic evaluation Researchers and developers of prototype systems evaluate functionality characteristics and analyse intermediate results produced by the system Purpose: To discover why a system did not give the expected results For: Researchers and developers
13 March 2006MODL5003 Principles and applications of MT 9 Purposes of MT evaluation 5. Declarative evaluation Evaluators rate the quality of MT output Purpose: To measure how well a system translates To measure fidelity (how much of the source text content is correctly conveyed in the target text) To measure the fluency of the target text To measure the usability of a MT output for a particular purpose To evaluate a systems improvability (to what extent can dictionary update improve output quality?) To help decide which system to buy To indicate whether buying a system will be cost-effective (will post-editing MT output be cheaper than translating from scratch?) For: End-users, researchers, developers, managers, investors, vendors
13 March 2006MODL5003 Principles and applications of MT 10 Purposes of MT evaluation 6. Operational evaluation Managers calculate purchase and running costs and compare with benefits Purpose: To determine the cost-benefit of an MT system in a particular operational environment, and whether a system will serve its required purpose For: Managers, investors, vendors
13 March 2006MODL5003 Principles and applications of MT 11 Purposes of MT evaluation 7. Usability evaluation Evaluators test how easy the application is to use. Systems are evaluated using questionnaires on usability. Evaluators may record how long it takes to complete particular tasks Purpose: To measure how useful the product will be for the end-user in a specific context To evaluate user-friendliness For: End-users, researchers, developers, managers, investors, vendors
13 March 2006MODL5003 Principles and applications of MT 12 Why evaluating translation quality is difficult No perfect standard exists for comparison Scoring is subjective (so several evaluators and texts are needed) The "training effect" can influence results Bilingual evaluators or human reference translations are usually required Different evaluations are needed depending on use of MT output (eg. filtering, gisting, information gathering, post-editing for internal or external use)
13 March 2006MODL5003 Principles and applications of MT 13 A brief history of MT evaluation: the 1950s and 1960s 1954: First public demonstration MT (Georgetown University/IBM) Research in USA, Western Europe, Soviet Union and Japan 1966: ALPAC Report (funded by US Government sponsors of MT to advise on further R & D) … advised against further investment in MT concluded that MT was slower, less accurate and more expensive that human translation recommended research into: - practical methods for evaluation of translations - evaluation of quality and cost of various sources of translations - evaluation of the relative speed and cost of various sorts of machine-aided translation
13 March 2006MODL5003 Principles and applications of MT 14 A brief history of MT evaluation: the 1970s and 1980s 1976: EC bought a version of Systran and began to develop own system Eurotra in 1978 EC needed recommendations for evaluation: Van Slype report (Critical Methods for Evaluating the Quality of Machine Translation.) published in 1979 Aims of the report: - to establish the state of MT evaluation - to advise the EC on evaluation methodology and research - to provide examples of evaluation methods and their applications Available online: http://issco-www.unige.ch/projects/isle/van-slype.pdfhttp://issco-www.unige.ch/projects/isle/van-slype.pdf 1980s: Greater need for MT evaluation: MT attracting commercial interest, tailor-made systems designed for large corporations 1987: First MT Summit (opportunity to publish research on evaluation)
13 March 2006MODL5003 Principles and applications of MT 15 A brief history of MT evaluation: The 1990s 1992: AMTA workshop: MT Evaluation: Basis for Future Directions JEIDA Report presented (Japan Electronic Industry Development Association): Methodology and Criteria on Machine Translation Evaluation. This stressed the importance of judging systems according to context of use and user requirements 1993: Machine Translation journal devoted to MT evaluation 1992 -1994 DARPA (Defense Advanced Research Projects Agency) MT evaluations 1993 - 1999 EAGLES (Expert Advisory Group on Language Engineering) set up by European Commission One of aims: To propose standards, guidelines and recommendations for good practice in the evaluation of language engineering products The EAGLES 7-step recipe for evaluation: http://www.issco.unige.ch/projects/eagles/ewg99/7steps.html
13 March 2006MODL5003 Principles and applications of MT 16 The ISLE Project and FEMTI International Standards for Language Engineering Framework for the Evaluation of Machine Translation in ISLE ISLE Evaluation Working Group set up in response to EAGLES Funded by EC, National Science Foundation of the USA and Swiss Government Established a classification scheme of quality characteristics of MT systems and a set of measures to use when evaluating these characteristics Scheme designed to help developers, users and evaluators to select evaluation criteria according to their needs Workshops organised to involve hands-on evaluation exercises to test reliability of metrics Latest research involves the investigation of automated evaluation methods: quicker and cheaper
13 March 2006MODL5003 Principles and applications of MT 17 Evaluation methods: Carroll 1966 Source: Carroll, J. B. (1966). An experiment in evaluating the quality of translations. In Pierce, J. (Chair). (1966). Language and Machines: computers in Translation and Linguistics. Report by the Automatic Language Processing Advisory Committee (ALPAC). Publication 1416. National Academy of Sciences National Research Council, pp 67-75. http://www.nap.edu/books/ARC000005/html/ http://www.nap.edu/books/ARC000005/html/ Evaluation of scientific Russian texts translated into English 3 human translations and 3 machine translations of 4 texts evaluated Evaluators: 18 monolingual English speakers and 18 native English speakers with good understanding of scientific Russian Intelligibility: Each sentence scored on a 9-point scale with no reference to source text Informativeness (fidelity) Original Russian sentences rated for informativeness compared with the translation Monolinguals used human reference translations instead of source texts for comparison
13 March 2006MODL5003 Principles and applications of MT 18 Evaluation methods: Carroll 1966 Extracts from 9-point intelligibility scale 9. Perfectly clear and intelligible. Reads like ordinary text: has no stylistic infelicities 5.The general idea is intelligible only after considerable study, but after this study one is fairly confident that he understands. Poor word choice, grotesque syntactic arrangement, untranslated words, and similar phenomena are present, but constitute mainly "noise" through which the main idea is still perceptible 1. Hopelessly unintelligible. It appears that no amount of study and reflection would reveal the thought of the sentence.
13 March 2006MODL5003 Principles and applications of MT 19 Evaluation methods: Carroll 1966 Rating original sentences Extracts from 10-point informativeness scale 9.Extremely informative. Makes "all the difference in the world" in comprehending the meaning intended. (A rating of 9 should always be assigned when the original completely changes or reverses the meaning conveyed by the translation) 4.In contrast to 3, adds a certain amount of information about the sentence structure and syntactical relationships; it may also correct minor misapprehensions about the general meaning of the sentence or the meaning of individual words 0.The original contains, if anything, less information than the translation. The translator has added certain meanings, apparently to make the passage more understandable.
13 March 2006MODL5003 Principles and applications of MT 20 Evaluation methods: Crook & Bishop 1979 Source: Crook & Bishop (reported by T C Halliday). Measurement of readability by the cloze test. In Van Slype, G.. (1979). Critical Methods for Evaluating the Quality of Machine Translation. Prepared for the European Commission Directorate General Scientific and Technical Information and Information Management. Report BR 19142. Bureau Marcel van Dijk, p65. Evaluators rate the readability of translations using a cloze test Human and machine translations produced Every eighth word of machine translation omitted Evaluators fill in the gaps The more intelligible the MT output, the easier the test is to complete
13 March 2006MODL5003 Principles and applications of MT 21 Evaluation methods: Sinaiko 1979 Source: Sinaiko, H. W. Measurement of usefulness by performance test. In Van Slype, G. (1979). Critical Methods for Evaluating the Quality of Machine Translation. Prepared for the European Commission Directorate General Scientific and Technical Information and Information Management. Report BR 19142. Bureau Marcel van Dijk, p91. Aim: to evaluate the English-Vietnamese LOGOS system All source texts contained instructions Evaluators (native speakers of TL) use machine translated instructions to perform tasks Errors in performance were measured (weighting system used)
13 March 2006MODL5003 Principles and applications of MT 22 Evaluation methods: Nagao 1985 Source: Nagao, M., Tsujii, J. & Nakamura, J. (1985). The Japanese government project for machine translation. In Computational Linguistics 11, 91-109. Aim: to test the feasibility of using MT to translate abstracts of scientific papers 1,682 sentences from a Japanese scientific journal were machine translated into English Intelligibility: 2 native speakers of English (with no knowledge of Japanese) scored each sentence using a 5-point scale Accuracy: 4 Japanese-English translators evaluated how much of the meaning of the original text was conveyed in the MT output
13 March 2006MODL5003 Principles and applications of MT 23 Evaluation methods: Nagao 1985 Extracts from 5-point intelligibility scale 1. The meaning of the sentence is clear, and there are no questions. Grammar, word usage, and style are all appropriate, and no rewriting is needed. 3. The basic thrust of the sentence is clear, but the evaluator is not sure of some detailed parts because of grammar and word usage problems. The problems cannot be resolved by any set procedure; the evaluator needs the assistance of a Japanese evaluator to clarify the meaning of those parts in the Japanese original. 5. The sentence cannot be understood at all. No amount of effort will produce any meaning.
13 March 2006MODL5003 Principles and applications of MT 24 Evaluation methods: Nagao 1985 Extracts from 7-point accuracy scale 0.The content of the input sentence is faithfully conveyed to the output sentence. The translated sentence is clear to a native speaker and no rewriting is needed. 3.While the content of the input sentence is generally conveyed faithfully to the output sentence, there are some problems with things like relationships, between phrases and expressions, and with tense, voice, plurals, and the positions of adverbs. There is some duplication of nouns in the sentence. 6.The content of the input sentence is not conveyed at all. The output is not a proper sentence; subjects and predicates are missing. In noun phrases, the main noun (the noun positioned last in the Japanese) is missing, or a clause or phrase acting as a verb and modifying a noun is missing.
13 March 2006MODL5003 Principles and applications of MT 25 Evaluation methods: DARPA 1992-4 Adequacy, fluency and informativeness Sources: White, J., O'Connell, T., OMara, F.: The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Proceedings of the 1994 Conference, Association for Machine Translation in the Americas, Columbia, Maryland (1994) White, J. (Forthcoming). How to evaluate Machine Translation. In H. Somers (ed.) Machine translation: a handbook for translators. Benjamins, Amsterdam. Aim: to compare prototype systems funded by DARPA Evaluators: 100 monolingual native English speakers Largest evaluation resulted in: corpus of 100 news articles (of c.400 words) in each SL: French, Spanish and Japanese 2 English human translations of each English machine translations of each text by several systems Detailed evaluation results
13 March 2006MODL5003 Principles and applications of MT 26 Evaluation methods: DARPA 1992-4 DARPA: Adequacy Segments of MT output were compared with equivalent human reference translations and scored on a 5-point scale according to how much of the original content was preserved (regardless of imperfect English) 5 – All meaning expressed in the source fragment appears in the translation fragment 4 – Most of the source fragment meaning is expressed in the translation fragment 3 – Much of the source fragment meaning is expressed in the translation fragment 2 – Little of the source fragment meaning is expressed in the translation fragment 1 – None of the meaning expressed in the source fragment is expressed in the translation fragment
13 March 2006MODL5003 Principles and applications of MT 27 Evaluation methods: DARPA 1992-4 DARPA: Fluency Each sentence scored for intelligibility without reference to the source text or human reference translation Simple 5-point scale used DARPA: Informativeness Designed to test whether enough information was conveyed in MT output to enable evaluators to answer questions on its content Each translation accompanied by 6 multiple-choice questions on content 6 choices for each question
13 March 2006MODL5003 Principles and applications of MT 28 DARPA-inspired evaluation methods… Many subsequent evaluations have followed in the footsteps of DARPA ….. Fluency and Adequacy using 5-point scales: Source: Elliott, D., Atwell, E., Hartley, A.: Compiling and Using a Shareable Parallel Corpus for Machine Translation Evaluation. In: Proceedings of the Workshop on The Amazing Utility of Parallel and Comparable Corpora, Fourth International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal (2004) Usability 5-point scale
13 March 2006MODL5003 Principles and applications of MT 29 Where next? Beyond similarity metrics –FEMTI offers a rich palette of techniques Beyond adequacy and fluency –Too generic / abstract for specific tasks? –Consider MT output in its own right Beyond conventional uses of MT as surrogate human translation (emulation) –MT as a component in a workflow
13 March 2006MODL5003 Principles and applications of MT 30 Restore a sense of purpose Texts are meant to be used. There are no absolute standards of translation quality but only more or less appropriate translations for the purpose for which they are intended. (Sager 1989: 91)
13 March 2006MODL5003 Principles and applications of MT 31 Revisit MT proficiency (White 2000) View MT output as a genre –Characterise inadequacy, disfluency, ill-formedness Embed MT and adapt (to) the environment –in IE, CLIR, CLQA, Speech2Speech –in pre- and post-editing
13 March 2006MODL5003 Principles and applications of MT 32 Human Evaluation of Machine Translation Systems Any questions…?