Automatic methods of MT evaluation Practical 18/04/2005 MODL5003 Principles and applications of machine translation slides available at:

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Performance Assessment
Statistical modelling of MT output corpora for Information Extraction.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Test Development.
Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Predicting MT Fluency from IE Precision and Recall Tony Hartley, Brighton, UK Martin Rajman, EPFL, CH.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Evaluating tests and examinations What questions to ask to make sure your assessment is the best that can be produced within your context. Dianne Wall.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Machine Translation Anna Sågvall Hein Mösg F
MODL5003 Principles and applications of MT
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Introduction to Machine Learning Approach Lecture 5.
“GENERIC SCRIPT” Everything can be automated, even automation process itself. “GENERIC SCRIPT” Everything can be automated, even automation process itself.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Machine Translation- 5 Autumn 2008 Lecture Sep 2008.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Final Presentation Industrial project Automatic tagging tool for Hebrew Wiki pages Supervisors: Dr. Miri Rabinovitz, Supervisors: Dr. Miri Rabinovitz,
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 4 Looping.
ITGS Databases.
1 Chapter 3 1.Quality Management, 2.Software Cost Estimation 3.Process Improvement.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Individual Differences in Human-Computer Interaction HMI Yun Hwan Kang.
I Power Higher Computing Software Development Development Languages and Environments.
Java Programming Fifth Edition Chapter 5 Making Decisions.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Chapter 6 - Standardized Measurement and Assessment
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
How to structure good history writing Always put an introduction which explains what you are going to talk about. Always put a conclusion which summarises.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Centre for Translation Studies FACULTY OF ARTS
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
An Artificial Intelligence Approach to Precision Oncology
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
EDU 385 Session 8 Writing Selection items
Improving a Pipeline Architecture for Shallow Discourse Parsing
Personal Software Process Software Estimation
Classification of Tests Chapter # 2
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Multiple Regression – Split Sample Validation
University of Illinois System in HOO Text Correction Shared Task
Presentation transcript:

Automatic methods of MT evaluation Practical 18/04/2005 MODL5003 Principles and applications of machine translation slides available at:

Overview 1. Aspects of MT evaluation 2. Text Quality evaluation 3. Advantages / disadvantages of automatic techniques 4. Methods of automatic evaluation 5. Validation of automatic scores 6. Challenges 7. Recent developments

1. Aspects of MT evaluation 1/3 (Hutchins & Somers, 1992: ) Text quality (important for developers, users and managers); Extendibility (developers) Operational capabilities of the system (users) Efficiency of use (companies, managers, freelance translators)

Aspects of MT evaluation 2/3 Text Quality can be done manually and automatically central issue in MT quality… Extendibility = architectural considerations: adding new language pairs extending lexical / grammatical coverage developing new subject domains: improvability and portability of the system

Aspects of MT evaluation 3/3 Operational capabilities of the system user interface dictionary update: cost / performance, etc. Efficiency of use is there an increase in productivity? the cost of buying / tuning / integrating into the workflow / maintaining / training personnel how much money can be saved for the company / department?

2. Text quality evaluation (TQE) – issues 1/2 Quality evaluation vs. error identification / analysis Black box vs. glass box evaluation Error correction on the user side dictionary updating do-not-translate lists, etc.

2. Text quality evaluation (TQE) – issues 2/2 Multiple quality parameters & their relations fidelity (adequacy) fluency (intelligibility, clarity style informativeness… Are these parameters completely independent? Or is intelligibility a pre-condition for adequacy or style? Granularity of evaluation different for different purposes individual sentences; texts; corpora of similar documents; the average performance of an MT system

3. Advantages of automatic evaluation Low cost Objective character of evaluated parameters reproducibility comparability across texts: relative difficulty for MT across evaluations

Disadvantages of automatic evaluation need for calibration with human scores interpretation in terms of human quality parameters is not clear do not account for all quality dimensions hard to find good measures for certain quality parameters reliable only for homogeneous systems the results for non-native human translation, knowledge-based MT output, statistical MT output may be non-comparable

4. Methods of automatic evaluation Automatic Evaluation is more recent: first methods appeared in the late 90-ies Performance methods Measuring performance of some system which uses degraded MT output Reference proximity methods Measuring distance between MT and a gold standard translation

4.1 Performance methods A pragmatic approach to MT: similar to performance-based human evaluation …can someone using the translation carry out the instructions as well as someone using the original? (Hutchins & Somers, 1992: 163) Different from human performance evaluation 1. Tasks are carried out by an automated system 2. Parameter(s) of the output are automatically computed

… automated systems used & parameters computed parser (automatic syntactic analyser) Computing an average depth of syntactic trees (Rajman and Hartley, 2000) Named Entity Recognition system (a system which finds proper names, e.g., names of organisations…) Number of extracted organisation names Information Extraction filling a database: events, participants of events Computing ratio of correctly filled database fields

Performance-based methods: an example 1/2 Open-source NER system for English (ANNIE) the number of extracted Organisation Names gives an indication of Adequacy ORI: … le chef de la diplomatie égyptienne HT: the Chief of the Egyptian Diplomatic Corps MT-Systran: the chief of the Egyptian diplomacy

Performance-based methods: an example 2/2 count extracted organisation names the number will be bigger for better systems biggest for human translations other types of proper names do not correspond to such differences in quality Person names Location names Dates, numbers, currencies …

Performance-based methods: theory built on prior assumptions about natural language properties sentence structure is always connected; MT errors more frequently destroys relevant contexts than creates spurious contexts; difficulties for automatic tools are proportional to relative quality (the amount of MT degradation) Be careful with prior assumptions what is worse for the human user may be better for an automatic system

Example 1 ORI : Il a été fait chevalier dans l'ordre national du Mérite en mai 1991 HT: He was made a Chevalier in the National Order of Merit in May, MT-Systran: It was made knight in the national order of the Merit in May MT-Candide: He was knighted in the national command at Merite in May, 1991.

Example 2 Parser-based score: X-score Xerox shallow parser XELDA produces annotated dependency trees; identifies 22 types of dependencies The Ministry of Foreign Affairs echoed this view SUBJ(Ministry, echoed) DOBJ(echoed, view) NN(Foreign, Affairs) NNPREP(Ministry, of, Affairs)

Example 2 (contd.) a hearing that lasted more then 2 hours RELSUBJ(hearing, lasted) a public program that has already been agreed on RELSUBJPASS(program, agreed) to examine the effects as possible PADJ(effects, possible) brightly coloured doors ADVADJ(brightly, coloured) X-score = (#RELSUBJ + #RELSUBJPASS – #PADJ – #ADVADJ)

4.2 Reference proximity methods Assumption of Reference Proximity (ARP): …the closer the machine translation is to a professional human translation, the better it is (Papineni et al., 2002: 311) Finding a distance between 2 texts Minimal edit distance N-gram distance …

Minimal edit distance Minimal number of editing operations to transform text1 into text2 deletions (sequence xy changed to x) insertions (x changed to xy) substitutions (x changed by y) transpositions (sequence xy changed to yx) Algorithm by Wagner and Fischer (1974). Edit distance implementation: RED method Akiba Y., K Imamura and E. Sumita. 2001

Problem with edit distance: Legitimate translation variation ORI: De son côté, le département d'Etat américain, dans un communiqué, a déclaré: Nous ne comprenons pas la décision de Paris. HT-Expert: For its part, the American Department of State said in a communique that We do not understand the decision made by Paris. HT-Reference: For its part, the American State Department stated in a press release: We do not understand the decision of Paris. MT-Systran: On its side, the American State Department, in an official statement, declared: We do not include/understand the decision of Paris.

Legitimate translation variation (LTV) …contd. to which human translation should we compute the edit distance? is it possible to integrate both human translations into a reference set?

N-gram distance the number of common words (evaluating lexical choices); the number of common sequences of 2, 3, 4 … N words (evaluating word order): 2-word sequences (bi-grams) 3-word sequences (tri-grams) 4-word sequences (four-grams) … N-word sequences (N-grams) N-grams allow us to compute several parameters…

Matches of N-grams HT MT True positives False positives False negatives

Matches of N-grams (contd.) MT +MT – Human text + true positives false negatives recall (avoiding false negatives) Human text – false positives precision (avoiding false positives)

Precision and Recall Precision = how accurate is the answer? Dont guess, wrong answers are deducted! Recall = how complete is the answer? Guess if not sure!, dont miss anything!

Translation variation and N-grams N-gram distance to multiple human reference translations Precision on the union of N-gram sets in HT1, HT2, HT3… N-grams in all independent human translations taken together with repetitions removed Recall on the intersection of N-gram sets N-grams common to all sets – only repeated N-grams! (most stable across different human translations)

Union and Intersection UnionIntersection

Human and automated scores Empirical observations: Precision on the union gives indication of Fluency Recall on intersection gives indication of Adequacy Automated Adequacy evaluation is less accurate – harder Now most successful N-gram proximity -- BLEU evaluation measure (Papineni et al., 2002) BiLingual Evaluation Understudy

BLEU evaluation measure computes Precision on the union of N-grams accurately predicts Fluency produces scores in the range of [0,1] Usage: download and extract Perl script bleu.pl prepare MT output and reference translations in separate *.txt files Type in the command prompt: perl bleu-1.03.pl -t mt.txt -r ht.txt

BLEU evaluation measure Texts may be surrounded by tags: e.g.: different reference translations: paragraphs may be surrounded by tags: e.g.:

5. Validation of automatic scores Automatic scores have to be validated Are they meaningful, whether of not predict any human evaluation measures, e.g., Fluency, Adequacy, Informativeness Agreement human vs. automated scores measured by Pearsons correlation coefficient r a number in the range of [–1, 1] –1 < r < –0.5 = strong negative correlation 0.5 < r < +1 = strong positive correlation –0.5 < r < 0.5 no correlation or weak correlation

Pearsons correlation coefficient r in Excel

6. Challenges Multi-dimensionality no single measure of MT quality some quality measures are harder Evaluating usefulness of imperfect MT different needs of automatic systems and human users human users have in mind publication (dissemination) MT is primarily used for understanding (assimilation)

7. Recent developments: N-gram distance paraphrasing instead of multiple RT more weight to more important words relatively more frequent in a given text relations between different human scores accounting for dynamic quality criteria