Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio.

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio Toral, and Sudip Kumar Naskar School of Computing Dublin City University Dublin 9, Ireland {fgaspari, atoral, snaskar}@computing.dcu.ie

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Outline 2 Introduction: the CoSyne project Related work Evaluation o framework, scenario, questionnaire Results and discussion Conclusions Future work

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 3 Aim: Synchronisation of multilingual wikis Consortium o 7 partners from Germany, Italy, the Netherlands and Ireland 3 academic partners o University of Amsterdam (UvA) o Fondazione Bruno Kessler (FBK) o Dublin City University (DCU) 1 research organization o Heidelberg Institute for Theoretical Studies (HITS) 3 end-users o Deutsche Welle (DW) o Netherlands Institute for Sound and Vision (NISV) o Vereniging Wikimedia Nederland (VWN) Introduction: CoSyne

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 4 Techniques used by the CoSyne system: o MT o Textual entailment o Document structure modelling o Overlap synchronisation o Insertion point detection CoSyne MT system developed by UvA (Martzoukos and Monz, 2010) Language pairs covered in year 1: DE / IT / NL ↔ EN Focus of this user evaluation o CoSyne MT software to translate wiki entries DE → EN and NL → EN Introduction: CoSyne

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Related work 5 MT quality evaluation o fluency o adequacy Automatic MT evaluation metrics, esp. for SMT (Toral et al., 2011) o BLEU (Papineni et al., 2002), METEOR (Banerjee & Lavie, 2005), etc. o no insight into the nature and severity of errors (e.g. for post-editing) o weak correlation with human judgement (Lin & Och, 2004) Usefulness of MT output and users’ level of satisfaction Post-editing o effort (e.g. Allen, 2003; O’Brien, 2007; Specia & Farzindar, 2010) o gains vs. translating from scratch (e.g. O’Brien, 2005; Specia 2011)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Evaluation framework 6 User-focused task-oriented evaluation of MT in/for wikis o in close collaboration with end-users (DW, NISV) Accompanied by diagnostic evaluation o providing useful feedback to MT developers (UvA) Pilot study conducted just before month 18 of 36-month project o full-scale final evaluation planned at the very end of the project

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 7 Protocol for evaluation agreed between DCU and end-users DW and NISV staff involved: editors, translators, project managers o German-English and Dutch-English as their working languages o final users of the CoSyne system for wiki content synchronization Evaluation conducted on typical wiki entries for end-users Users asked to focus only on linguistic quality and level of usefulness of MT (disregarding other components of the CoSyne system) Evaluation scenario

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 8 Evaluation scenario 8 Deutsche Welle (DW): KalenderBlatt / Today in History

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 9 Evaluation scenario 9 Netherlands Institute for Sound and Vision (NISV): wiki

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 10 Evaluation scenario 10 Netherlands Institute for Sound and Vision (NISV): wiki

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 11 Time-tracking system was implemented Post-editing changes performed by the participants were logged Before the evaluation o participants given presentation and demo of the CoSyne system o preliminary experimentation with the CoSyne system for 1-3 hours Evaluation scenario

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 12 Written questionnaire administered on paper o available at http://www.computing.dcu.ie/~atoral/cosyne/quest.pdfhttp://www.computing.dcu.ie/~atoral/cosyne/quest.pdf Questions grouped into 6 parts focusing on different aspects Approximately 50 items using different formats o Likert scale, multiple choice and open questions Part A: basic demographic information about the respondents Part B: previous use of MT Part C: users' evaluation of the CoSyne MT system Part D: post-editing work Part E: general comments and feedback (Part F: usability and interaction design of the overall CoSyne system) Evaluation questionnaire

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Results: demographics 13 10 users: 6 from DW, 4 from NISV 6 men and 4 women across DW and NISV Variety of roles: editors, authors, translators and project managers Average age: 34 (youngest 20, oldest 46) Average work experience: just over 3 years (min. 3 months, max. 10 years)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 14 All (4) NISV staff were native speakers of Dutch 5 DW users were German native speakers + 1 NS of Romanian fluent in German 80% of the participants self-rated their knowledge of English as upper-intermediate, 20% defined it as intermediate or excellent o None of the respondents considered themselves bilingual Results: background

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 15 80% had used MT before our experiment o 7 for personal reasons, 6 for work (commonly for both purposes) o all but one had used Google Translate, 1 had tried Babel Fish, 2 both Language combinations used o 4 from EN into other languages o 6 into EN from a range of source languages o 5 language combinations not involving English 75% used MT for assimilation purposes vs. 25% for dissemination 62.5% had post-edited raw MT to obtain high-quality translations Results: previous use of MT

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 16 Materials translated with MT by the 8 respondents o for study purposes (academic papers and uni-related texts): 3 o business correspondence, personal or professional emails: 2 o contracts and technical documents: 2 o online articles: 2 o websites: 2 (“the translations of Dutch sites to English were hilarious!”, but not using CoSyne MT system!!)  Wikipedia content: 1 Results: previous use of MT

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 17 Overall the 8 respondents had a predominantly negative-to-neutral impression of MT quality before taking part in the evaluation of the CoSyne MT system, based on a 5-point Likert scale (average 2.8 / 5) Quality of previously used MT systems on a 5-point scale Results: previous use of MT (1 = very poor to 5 = very good)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 18 Results: CoSyne MT system Quality and usefulness of the CoSyne MT system on a 5-point scale Average quality is medium (3 / 5), better than previous experience (2.8) Usefulness slightly higher than medium (3.3 /5) (cf. 2.8) qualityusefulness (1 = very poor to 5 = very good)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 19 Results: CoSyne MT system Is CoSyne MT faster than translating wiki entries into English from scratch? on a 7-point scale Average value higher than mid-point of the scale (4.6 / 7) In line with e.g. Plitt & Masselot (2010) and Flournoy & Rueppel (2010) From DE almost twice as good as from NL (due to style of wiki texts?) (1 = strongly disagree to 7 = strongly agree)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 20 Results: CoSyne MT system MT quality broken down into: accuracy correctness, comprehensibility readability style on a 7-point scale We did not explain to users the subtle differences involved Only accuracy is approx. average (3.6 / 7), other criteria lower None of the average values particularly poor (DE always better than NL) accu corrcompreadstyl (1 = poor to 7 = excellent)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 21 Results: post-editing CoSyne Amount of work, in terms of time and effort to post- edit the MT output Need to refer to source language while post-editing timeeffort on a 7-point scale (1 = short/small to 7 = long/large) frequency on a 7-point scale (1 = never to 7 = always)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 22 Results: post-editing CoSyne Severity of errors over post-editing operations Frequency of errors over post-editing operations insertion deletion substitution reordering ins delsub reo ins delsub reo on a 7-point scale (1 = irrelevant to 7 = very serious) on a 7-point scale (1 = absent to 7 = frequent)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 23 Positive aspects: o good to have draft translation to work upon o integration in the wiki environment o potential to speed up the translation task Weaknesses: o translation quality needs improving, due to  wrong translation of pronouns  verbs frequently dropped  incorrect word order  mistranslated compounds  limited lexical coverage (OOV items is an issue) Good potential of the CoSyne system based on first prototype Results: final comments

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Conclusions 24 User-focused task-oriented questionnaire-based evaluation for MT used in wikis, supported by post-editing Evaluation of the first Y1 prototype of the CoSyne MT system for DE → EN and NL → EN Quality of the CoSyne MT system perceived by the users higher than that of previously used MT systems Post-editing effort is considered high, but users found it less time- consuming than translating from scratch Translations from German rated better than those from Dutch o contrasts with earlier findings (Toral et al., 2011) o further investigation into this discrepancy (meta-evaluation)

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Future work 25 Extend analysis looking into the post-editing logs, considering actual post-editing time (to estimate costs) Involve more users after pilot stage Include a control group (translating manually or other MT s/w) Investigate correlation between the post-editing carried out by the users and the results provided by TER and TERp (ins, del…) Use our linguistically-aware diagnostic evaluation tool (DELiC4MT) to monitor performance of the MT system on specific issues flagged up by the users

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Thank you for your attention! Questions? User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio Toral, and Sudip Kumar Naskar School of Computing Dublin City University Dublin 9, Ireland {fgaspari, atoral, snaskar}@computing.dcu.ie

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio.

Similar presentations

Presentation on theme: "Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio.

Similar presentations

Presentation on theme: "Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio."— Presentation transcript:

Similar presentations

About project

Feedback