Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.

Slides:

Advertisements

Similar presentations

Statistical modelling of MT output corpora for Information Extraction.

Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

What makes an image memorable?

Copyright © 2011 by Pearson Education, Inc. All rights reserved Statistics for the Behavioral and Social Sciences: A Brief Course Fifth Edition Arthur.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Evaluating Search Engine

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Simple Correlation Scatterplots & r Interpreting r Outcomes vs. RH:

Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.

Chapter 7 Correlational Research Gay, Mills, and Airasian

1 Lesson Shapes of Scatterplots. 2 Lesson Shapes of Scatterplots California Standard: Statistics, Data Analysis, and Probability 1.2 Represent.

Relationships Among Variables

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Chapter 12 (Ch. 11 in 2/3 Can. Ed.) Bivariate Association for Tabular Data: Basic Concepts.

Week 11 Chapter 12 – Association between variables measured at the nominal level.

Chapter 2: The Research Enterprise in Psychology

SHOWTIME! STATISTICAL TOOLS IN EVALUATION CORRELATION TECHNIQUE SIMPLE PREDICTION TESTS OF DIFFERENCE.

Bivariate Relationships Analyzing two variables at a time, usually the Independent & Dependent Variables Like one variable at a time, this can be done.

Agenda Correlation. CORRELATION Co-relation 2 variables tend to “go together” Does knowing a person’s score on one variable give you an idea of their.

Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.

Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.

The Research Enterprise in Psychology. The Scientific Method: Terminology Operational definitions are used to clarify precisely what is meant by each.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

© 2008 Pearson Addison-Wesley. All rights reserved Chapter 1 Section 13-6 Regression and Correlation.

WELCOME TO THETOPPERSWAY.COM.

Management & Development of Complex Projects Course Code MS Project Management Perform Qualitative Risk Analysis Lecture # 25.

How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.

Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.

Assessing the Frequency of Empirical Evaluation in Software Modeling Research Workshop on Experiences and Empirical Studies in Software Modelling (EESSMod)

By: Amani Albraikan.  Pearson r  Spearman rho  Linearity  Range restrictions  Outliers  Beware of spurious correlations….take care in interpretation.

Chapter 7 Probability and Samples: The Distribution of Sample Means

Korea Maritime and Ocean University NLP Jung Tae LEE

Chapter 16 Data Analysis: Testing for Associations.

Describing Relationships Using Correlations. 2 More Statistical Notation Correlational analysis requires scores from two variables. X stands for the scores.

11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.

Correlation. Correlation Analysis Correlations tell us to the degree that two variables are similar or associated with each other. It is a measure of.

Correlation iNZight gives r. POSITIVE linear correlation r=1 "Perfect" 0.9

Chapter 4 Summary Scatter diagrams of data pairs (x, y) are useful in helping us determine visually if there is any relation between x and y values and,

Correlation and Regression: The Need to Knows Correlation is a statistical technique: tells you if scores on variable X are related to scores on variable.

5.4 Line of Best Fit Given the following scatter plots, draw in your line of best fit and classify the type of relationship: Strong Positive Linear Strong.

The Longitudinal Student Assessment Project (LSAP)

Chapter 2 The Research Enterprise in Psychology. Table of Contents The Scientific Approach: A Search for Laws Basic assumption: events are governed by.

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

LESSON 5 - STATISTICS & RESEARCH STATISTICS – USE OF MATH TO ORGANIZE, SUMMARIZE, AND INTERPRET DATA.

Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.

Chapter 2: The Research Enterprise in Psychology.

Research Briefing Results of Research Studies: Making the Outsourcing Decision and Optimizing Value from outsourcing March 8, 2007.

NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)

Is Neural Machine Translation the New State of the Art?

Automatic Writing Evaluation

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Language Technologies Institute Carnegie Mellon University

Bivariate Association: Introduction and Basic Concepts

Chapter 12 Power Analysis.

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Potential impact of QT21 Eleanor Cornelius

Chapter Ten: Designing, Conducting, Analyzing, and Interpreting Experiments with Two Groups The Psychologist as Detective, 4e by Smith/Davis.

Presentation transcript:

Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational Translation Evaluation The 9th edition of the Language Resources and Evaluation Conference, Reykjavik

Background on MT MT programs vary with regard to: Scope Locales Maturity System Setup & Ownership MT Solution used Key Objective of using MT Final Quality Requirements Source Content

MT Quality 1. Automatic Scores  Provided by the MT system (typically BLEU)  Provided by our internal scoring tool (range of metrics) 2. Human Evaluation  Adequacy, scores 1-5  Fluency, scores Productivity Tests  Post-Editing versus Human Translation in iOmegaT

The Database Objective: Establish correlations between these 3 evaluation approaches to -draw conclusions on predicting productivity gains -see how & when to use the different metrics best Contents: -Data from Metrics (BLEU & PE Distance, Adequacy & Fluency, Productivity deltas) -Various locales, MT systems, content types -MT error analysis -Post-editing quality scores

Method Pearson’s r If r = +.70 or higher Very strong positive relationship +.40 to +.69 Strong positive relationship +.30 to +.39 Moderate positive relationship +.20 to +.29 Weak positive relationship +.01 to +.19 No or negligible relationship -.01 to -.19 No or negligible relationship -.20 to -.29 Weak negative relationship -.30 to -.39 Moderate negative relationship -.40 to -.69 Strong negative relationship -.70 or higher Very strong negative relationship

thedatabase Data Used 27 locales in total, with varying amounts of available data + 5 different MT systems (SMT & Hybrid)

correlationresults Adequacy vs Fluency A Pearson’s r of 0.82 across 182 test sets and 22 locales is a very strong, positive relationship COMMENT -most locales show a strong correlation between their Fluency and Adequacy scores -high correlation is expected (with in-domain data customized MT systems) in that, if a segment is really not understandable, it is neither accurate nor fluent. If a segment is almost perfect, both would score very high -some evaluators might not differentiate enough between Adequacy & Fluency, falsely creating a higher correlation

correlationresults Adequacy and Fluency versus BLEU Fluency and BLEU across locales have a Pearson’s r of 0.41, a strong positive relationship Adequacy and BLEU across locales have a Pearson’s r of 0.26, a moderately positive relationship Adequacy, Fluency and BLEU correlation for locales with 4 or more test sets*

correlationresults Adequacy and Fluency versus PE Distance Fluency and PE distance across all locales have a cumulative Pearson’s r of -0.70, a very strong negative relationship Adequacy and PE distance across all locales have a cumulative Pearson’s r of , a strong negative relationship A negative correlation is desired : as Adequacy and Fluency scores increase, PE distance should decrease proportionally.

correlationresults Adequacy and Fluency versus Productivity Delta Productivity and Adequacy across all locales with a cumulative Pearson’s r of 0.77, a very strong correlation Productivity and Fluency across all locales with a cumulative Pearson’s r of 0.71, a very strong correlation

correlationresults Automatic Metrics versus Productivity Delta With a Pearson’s r of , as PE distance increases, indicating a greater effort from the post-editor, Productivity declines; it is a strong negative relationship Productivity delta and BLEU with a cumulative Pearson’s r of 0.24, a weak positive relationship

correlationresults Summary Pearson's rVariablesStrength of CorrelationTests (N)Locales Statistical Significance (p value <) 0.82Adequacy & FluencyVery strong positive relationship Adequacy & P DeltaVery strong positive relationship Fluency & P DeltaVery strong positive relationship Cognitive Effort Rank & PE DistanceStrong positive relationship Fluency & BLEUStrong positive relationship Adequacy & BLEUWeak positive relationship BLEU & P DeltaWeak positive relationship Numbers of Errors & PE DistanceNo or negligible relationship1610ns -0.30Predominant Error & BLEUModerate negative relationship Cognitive Effort Rank & PE DeltaModerate negative relationship2010ns -0.41Numbers of Errors & BLEUStrong negative relationship Adequacy & PE DistanceStrong negative relationship PE Distance & P DeltaStrong negative relationship Fluency & PE DistanceVery strong negative relationship BLEU & PE DistanceVery strong negative relationship

takeaways The strongest correlations were found between: Adequacy & Fluency BLEU and PE Distance Adequacy & Productivity Delta Fluency & Productivity Delta Fluency & PE Distance  The Human Evaluations come out as stronger indicators for potential post-editing productivity gains than Automatic metrics. CORRELATIONS

erroranalysis Data size: 117 evaluations x 25 segments (3125 segments), includes 22 locales, different MT systems (hybrid & SMT).  Taking this “broad sweep“ view, most errors logged by evaluators across all categories are: -Sentence structure (word order) -MT output too literal -Wrong terminology -Word form disagreements -Source term left untranslated

erroranalysis Similar picture when we focus on the 8 dominant language pairs that constituted the bulk of the evaluations in the dataset.

takeaways  Across different MT systems, content types AND locales, 5 error categories stand out in particular. Questions: How (if) do these correlate to the post-editing effort and predicting productivity gains? How (if) can the findings on errors be used to improve the underlying systems? Are the current error categories what we need? Can the categories be improved for evaluators? Will these categories work for other post-editing scenarios (e.g. light PE)? MOST FREQUENT ERRORS LOGGED

takeaways Remodelling of Human Evaluation Form to: -increase user-friendliness -distinguish better between Ad & Fl errors -align with cognitive effort categories proposed in literature -improve relevance for system updates E.g.“Literal Translation“ seemed too broad and probably over-used.

nextsteps o focus on language groups and individual languages: do we see the same correlations? o focus on different MT systems o add categories to database (e.g. string length, post-editor experience) o add new data to database and repeat correlations o continuously tweak Human Evaluation template and process, as it proofs to provide valuable insights for predictions, as well as post- editor on-boarding / education and MT system improvement o investigate correlation with other AutoScores (…)

THANK YOU! with Laura Casanellas Luri, Elaine O’Curran, Andy Mallett