Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio.

Slides:



Advertisements
Similar presentations
Dr. Stephen Doherty & Dr. Sharon O’Brien
Advertisements

1. 2 We feel part of a multilingual Europe and we would like to give our contribution: for this reason we decided to apply for the European Language label.
Introduction: the New Price Index Manuals Presentation Points IMF Statistics Department.
Gruntvig EUROPE NET II A research project by Tamara Kropiowska Jerzy Paczkowski ODN Słupsk Formative assessment in adult education SUMMARY Barnsley, May.
2017/3/25 Test Case Upgrade from “Test Case-Training Material v1.4.ppt” of Testing basics Authors: NganVK Version: 1.4 Last Update: Dec-2005.
1 Nia Sutton Becta Total Cost of Ownership of ICT in schools.
©2011 1www.id-book.com Evaluation studies: From controlled to natural settings Chapter 14.
Current design issues for digital archives Robert Munro (presented by David Nathan) Endangered Languages Archive (ELAR), School of Oriental and African.
2 Session Objectives Increase participant understanding of effective financial monitoring based upon risk assessments of sub-grantees Increase participant.
1 Progress report on Expert Groups on micro and macro household statistics WPNA meeting October 2011 Maryse FESSEAU (OECD – Statistic Directorate, National.
1 « June, 6 and 7, 2007 Paris « Satellite Account for Education for Portugal: Implementation process and links with the National Accounts and Questionnaire.
Debugging in End- User Software Engineering summarized by Andrew Ko Toward Sharing Reasoning to Improve Fault Localization in Spreadsheets Joey Lawrance,
Improving the Effectiveness of Interviewer Administered Surveys though Refusal Avoidance Training Grace E. ONeill Presented by Anne Russell U.S. Census.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group Part 1: Usability Testing.
Aviation Security Training Module 4 Design and Conduct Exercise II 1.
GSMP Local Community Network Enhanced Process Meeting Sept. 13, 2007.
K-5 Sheltered Instruction Observation Protocol (SIOP) Update
ENQA GA. Bucharest, October 2011
AIFB Denny Vrandečić – AIFB, Universität Karlsruhe (TH) 1 Mind the Web! Valentin Zacharias, Andreas Abecker, Imen.
Terminology work at the European Central Bank
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Diversifying university studies by creating a joint vision Ulrike Krawagna University of Graz / Austria Joint Degrees at the University of Graz.
Human Performance Improvement Process
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Winter Education Conference Consequential Validity Using Item- and Standard-Level Residuals to Inform Instruction.
Michigan Electronic Grants System Plus
Module 2 Sessions 10 & 11 Report Writing.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
The Course experience questionnaire (P. Ramsden) Designed as a performance indicator 24 statements relating to 5 aspects 1 overall satisfaction statement.
1 European benchmarking with the CAF ROME 17-18th of November 2003.
1 Implementing Internet Web Sites in Counseling and Career Development James P. Sampson, Jr. Florida State University Copyright 2003 by James P. Sampson,
Multiple Indicator Cluster Surveys Survey Design Workshop MICS Technical Assistance MICS Survey Design Workshop.
1 SESSION 5- RECORDING AND REPORTING IN GRADES R-12 Computer Applications Technology Information Technology.
WP3. Evaluation, Monitoring and Quality Plan Dr. Luis Sobrado 27 th May 2011.
Configuration management
Software change management
Graduate Surveys in Germany as a Tool to Measure and Improve the Relevance of Higher Education Contribution to the International Seminar The Relevance.
INFLUENCE OF THE RESIDENT TRAINING PROGRAM IN REHABILITATION TECHNOLOGY ON THE EQUIPMENT AND ASSISTIVE TECHNOLOGY PORTION OF THE ABPMR PART 1 SCORES Hena.
1 Quality Indicators for Device Demonstrations April 21, 2009 Lisa Kosh Diana Carl.
Testing Workflow Purpose
Session 2: Introduction to the Quality Criteria. Session Overview Your facilitator, ___________________. [Add details of facilitators background, including.
QuestionPoint virtual reference networks Graeme Miller.
TU e technische universiteit eindhoven / department of mathematics and computer science 1 Empirical Evaluation of Learning Styles Adaptation Language Natalia.
1 Dissemination to Policy and Decision Makers and a Wider Audience Peter J. Bates pjb Associates
Evaluation Orientation Meeting Teacher Evaluation System
1 3GPP TSGs SA Meeting #49, San Antonio, Texas, USA, September 2010 SP © 3GPP Organizational Partners Satisfaction survey results In.
Appraising and Managing Performance (c) 2007 by Prentice Hall7-1 Chapter 7.
A Virtual Research Environment for the Study of Documents and Manuscripts 1 1 John Pybus – BVREH Project, University of Oxford A VRE for the Study of Documents.
Copyright © 2014 by Educational Testing Service. ETS, the ETS logo, LISTENING. LEARNING. LEADING. and GRE are registered trademarks of Educational Testing.
1 ICT and Professional Learning: towards communities of practice Ivan Webb, Dr Margaret Robertson & Dr Andrew Fluck From action research & in-school observations.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Who are the Experts?Simon KampaSlide 1 Who are the Experts? Simon Kampa IAM Group University of Southampton
25 seconds left…...
Teaching Adults to Read: Assessment Strategies and Reading Profiles 2011 ABE Statewide Summer Institute August 19,
RTI Implementer Webinar Series: Establishing a Screening Process
We will resume in: 25 Minutes.
Module 12 WSP quality assurance tool 1. Module 12 WSP quality assurance tool Session structure Introduction About the tool Using the tool Supporting materials.
Berlin, Germany June 20, Extending the Kernel with Practices to Create Methods Brian Elvesæter SINTEF, Oslo, Norway.
How Cells Obtain Energy from Food
14-1 © Prentice Hall, 2004 Chapter 14: OOSAD Implementation and Operation (Adapted) Object-Oriented Systems Analysis and Design Joey F. George, Dinesh.
1 Implementing DDIEditor in the Danish Data Archive - Demonstration and gained experience Part of session: Recent Developments in the DDI Implementation.
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Promoting intercultural sensitivity through telecollaboration: A practical experience between a Polish and a Spanish university. León, February 2014 Angel.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Cheryl Ng Ling Hui Hee Jee Mei, Ph.D Universiti Teknologi Malaysia
Presentation transcript:

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio Toral, and Sudip Kumar Naskar School of Computing Dublin City University Dublin 9, Ireland {fgaspari, atoral,

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Outline 2 Introduction: the CoSyne project Related work Evaluation o framework, scenario, questionnaire Results and discussion Conclusions Future work

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Aim: Synchronisation of multilingual wikis Consortium o 7 partners from Germany, Italy, the Netherlands and Ireland 3 academic partners o University of Amsterdam (UvA) o Fondazione Bruno Kessler (FBK) o Dublin City University (DCU) 1 research organization o Heidelberg Institute for Theoretical Studies (HITS) 3 end-users o Deutsche Welle (DW) o Netherlands Institute for Sound and Vision (NISV) o Vereniging Wikimedia Nederland (VWN) Introduction: CoSyne

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Techniques used by the CoSyne system: o MT o Textual entailment o Document structure modelling o Overlap synchronisation o Insertion point detection CoSyne MT system developed by UvA (Martzoukos and Monz, 2010) Language pairs covered in year 1: DE / IT / NL ↔ EN Focus of this user evaluation o CoSyne MT software to translate wiki entries DE → EN and NL → EN Introduction: CoSyne

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Related work 5 MT quality evaluation o fluency o adequacy Automatic MT evaluation metrics, esp. for SMT (Toral et al., 2011) o BLEU (Papineni et al., 2002), METEOR (Banerjee & Lavie, 2005), etc. o no insight into the nature and severity of errors (e.g. for post-editing) o weak correlation with human judgement (Lin & Och, 2004) Usefulness of MT output and users’ level of satisfaction Post-editing o effort (e.g. Allen, 2003; O’Brien, 2007; Specia & Farzindar, 2010) o gains vs. translating from scratch (e.g. O’Brien, 2005; Specia 2011)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Evaluation framework 6 User-focused task-oriented evaluation of MT in/for wikis o in close collaboration with end-users (DW, NISV) Accompanied by diagnostic evaluation o providing useful feedback to MT developers (UvA) Pilot study conducted just before month 18 of 36-month project o full-scale final evaluation planned at the very end of the project

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Protocol for evaluation agreed between DCU and end-users DW and NISV staff involved: editors, translators, project managers o German-English and Dutch-English as their working languages o final users of the CoSyne system for wiki content synchronization Evaluation conducted on typical wiki entries for end-users Users asked to focus only on linguistic quality and level of usefulness of MT (disregarding other components of the CoSyne system) Evaluation scenario

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Evaluation scenario 8 Deutsche Welle (DW): KalenderBlatt / Today in History

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Evaluation scenario 9 Netherlands Institute for Sound and Vision (NISV): wiki

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Evaluation scenario 10 Netherlands Institute for Sound and Vision (NISV): wiki

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Time-tracking system was implemented Post-editing changes performed by the participants were logged Before the evaluation o participants given presentation and demo of the CoSyne system o preliminary experimentation with the CoSyne system for 1-3 hours Evaluation scenario

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Written questionnaire administered on paper o available at Questions grouped into 6 parts focusing on different aspects Approximately 50 items using different formats o Likert scale, multiple choice and open questions Part A: basic demographic information about the respondents Part B: previous use of MT Part C: users' evaluation of the CoSyne MT system Part D: post-editing work Part E: general comments and feedback (Part F: usability and interaction design of the overall CoSyne system) Evaluation questionnaire

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Results: demographics users: 6 from DW, 4 from NISV 6 men and 4 women across DW and NISV Variety of roles: editors, authors, translators and project managers Average age: 34 (youngest 20, oldest 46) Average work experience: just over 3 years (min. 3 months, max. 10 years)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October All (4) NISV staff were native speakers of Dutch 5 DW users were German native speakers + 1 NS of Romanian fluent in German 80% of the participants self-rated their knowledge of English as upper-intermediate, 20% defined it as intermediate or excellent o None of the respondents considered themselves bilingual Results: background

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October % had used MT before our experiment o 7 for personal reasons, 6 for work (commonly for both purposes) o all but one had used Google Translate, 1 had tried Babel Fish, 2 both Language combinations used o 4 from EN into other languages o 6 into EN from a range of source languages o 5 language combinations not involving English 75% used MT for assimilation purposes vs. 25% for dissemination 62.5% had post-edited raw MT to obtain high-quality translations Results: previous use of MT

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Materials translated with MT by the 8 respondents o for study purposes (academic papers and uni-related texts): 3 o business correspondence, personal or professional s: 2 o contracts and technical documents: 2 o online articles: 2 o websites: 2 (“the translations of Dutch sites to English were hilarious!”, but not using CoSyne MT system!!)  Wikipedia content: 1 Results: previous use of MT

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Overall the 8 respondents had a predominantly negative-to-neutral impression of MT quality before taking part in the evaluation of the CoSyne MT system, based on a 5-point Likert scale (average 2.8 / 5) Quality of previously used MT systems on a 5-point scale Results: previous use of MT (1 = very poor to 5 = very good)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Results: CoSyne MT system Quality and usefulness of the CoSyne MT system on a 5-point scale Average quality is medium (3 / 5), better than previous experience (2.8) Usefulness slightly higher than medium (3.3 /5) (cf. 2.8) qualityusefulness (1 = very poor to 5 = very good)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Results: CoSyne MT system Is CoSyne MT faster than translating wiki entries into English from scratch? on a 7-point scale Average value higher than mid-point of the scale (4.6 / 7) In line with e.g. Plitt & Masselot (2010) and Flournoy & Rueppel (2010) From DE almost twice as good as from NL (due to style of wiki texts?) (1 = strongly disagree to 7 = strongly agree)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Results: CoSyne MT system MT quality broken down into: accuracy correctness, comprehensibility readability style on a 7-point scale We did not explain to users the subtle differences involved Only accuracy is approx. average (3.6 / 7), other criteria lower None of the average values particularly poor (DE always better than NL) accu corrcompreadstyl (1 = poor to 7 = excellent)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Results: post-editing CoSyne Amount of work, in terms of time and effort to post- edit the MT output Need to refer to source language while post-editing timeeffort on a 7-point scale (1 = short/small to 7 = long/large) frequency on a 7-point scale (1 = never to 7 = always)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Results: post-editing CoSyne Severity of errors over post-editing operations Frequency of errors over post-editing operations insertion deletion substitution reordering ins delsub reo ins delsub reo on a 7-point scale (1 = irrelevant to 7 = very serious) on a 7-point scale (1 = absent to 7 = frequent)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October Positive aspects: o good to have draft translation to work upon o integration in the wiki environment o potential to speed up the translation task Weaknesses: o translation quality needs improving, due to  wrong translation of pronouns  verbs frequently dropped  incorrect word order  mistranslated compounds  limited lexical coverage (OOV items is an issue) Good potential of the CoSyne system based on first prototype Results: final comments

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Conclusions 24 User-focused task-oriented questionnaire-based evaluation for MT used in wikis, supported by post-editing Evaluation of the first Y1 prototype of the CoSyne MT system for DE → EN and NL → EN Quality of the CoSyne MT system perceived by the users higher than that of previously used MT systems Post-editing effort is considered high, but users found it less time- consuming than translating from scratch Translations from German rated better than those from Dutch o contrasts with earlier findings (Toral et al., 2011) o further investigation into this discrepancy (meta-evaluation)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Future work 25 Extend analysis looking into the post-editing logs, considering actual post-editing time (to estimate costs) Involve more users after pilot stage Include a control group (translating manually or other MT s/w) Investigate correlation between the post-editing carried out by the users and the results provided by TER and TERp (ins, del…) Use our linguistically-aware diagnostic evaluation tool (DELiC4MT) to monitor performance of the MT system on specific issues flagged up by the users

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Thank you for your attention! Questions? User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio Toral, and Sudip Kumar Naskar School of Computing Dublin City University Dublin 9, Ireland {fgaspari, atoral,