Download presentation
Presentation is loading. Please wait.
Published byJamya Carr Modified over 9 years ago
1
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio Toral, and Sudip Kumar Naskar School of Computing Dublin City University Dublin 9, Ireland {fgaspari, atoral, snaskar}@computing.dcu.ie
2
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Outline 2 Introduction: the CoSyne project Related work Evaluation o framework, scenario, questionnaire Results and discussion Conclusions Future work
3
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 3 Aim: Synchronisation of multilingual wikis Consortium o 7 partners from Germany, Italy, the Netherlands and Ireland 3 academic partners o University of Amsterdam (UvA) o Fondazione Bruno Kessler (FBK) o Dublin City University (DCU) 1 research organization o Heidelberg Institute for Theoretical Studies (HITS) 3 end-users o Deutsche Welle (DW) o Netherlands Institute for Sound and Vision (NISV) o Vereniging Wikimedia Nederland (VWN) Introduction: CoSyne
4
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 4 Techniques used by the CoSyne system: o MT o Textual entailment o Document structure modelling o Overlap synchronisation o Insertion point detection CoSyne MT system developed by UvA (Martzoukos and Monz, 2010) Language pairs covered in year 1: DE / IT / NL ↔ EN Focus of this user evaluation o CoSyne MT software to translate wiki entries DE → EN and NL → EN Introduction: CoSyne
5
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Related work 5 MT quality evaluation o fluency o adequacy Automatic MT evaluation metrics, esp. for SMT (Toral et al., 2011) o BLEU (Papineni et al., 2002), METEOR (Banerjee & Lavie, 2005), etc. o no insight into the nature and severity of errors (e.g. for post-editing) o weak correlation with human judgement (Lin & Och, 2004) Usefulness of MT output and users’ level of satisfaction Post-editing o effort (e.g. Allen, 2003; O’Brien, 2007; Specia & Farzindar, 2010) o gains vs. translating from scratch (e.g. O’Brien, 2005; Specia 2011)
6
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Evaluation framework 6 User-focused task-oriented evaluation of MT in/for wikis o in close collaboration with end-users (DW, NISV) Accompanied by diagnostic evaluation o providing useful feedback to MT developers (UvA) Pilot study conducted just before month 18 of 36-month project o full-scale final evaluation planned at the very end of the project
7
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 7 Protocol for evaluation agreed between DCU and end-users DW and NISV staff involved: editors, translators, project managers o German-English and Dutch-English as their working languages o final users of the CoSyne system for wiki content synchronization Evaluation conducted on typical wiki entries for end-users Users asked to focus only on linguistic quality and level of usefulness of MT (disregarding other components of the CoSyne system) Evaluation scenario
8
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 8 Evaluation scenario 8 Deutsche Welle (DW): KalenderBlatt / Today in History
9
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 9 Evaluation scenario 9 Netherlands Institute for Sound and Vision (NISV): wiki
10
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 10 Evaluation scenario 10 Netherlands Institute for Sound and Vision (NISV): wiki
11
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 11 Time-tracking system was implemented Post-editing changes performed by the participants were logged Before the evaluation o participants given presentation and demo of the CoSyne system o preliminary experimentation with the CoSyne system for 1-3 hours Evaluation scenario
12
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 12 Written questionnaire administered on paper o available at http://www.computing.dcu.ie/~atoral/cosyne/quest.pdfhttp://www.computing.dcu.ie/~atoral/cosyne/quest.pdf Questions grouped into 6 parts focusing on different aspects Approximately 50 items using different formats o Likert scale, multiple choice and open questions Part A: basic demographic information about the respondents Part B: previous use of MT Part C: users' evaluation of the CoSyne MT system Part D: post-editing work Part E: general comments and feedback (Part F: usability and interaction design of the overall CoSyne system) Evaluation questionnaire
13
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Results: demographics 13 10 users: 6 from DW, 4 from NISV 6 men and 4 women across DW and NISV Variety of roles: editors, authors, translators and project managers Average age: 34 (youngest 20, oldest 46) Average work experience: just over 3 years (min. 3 months, max. 10 years)
14
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 14 All (4) NISV staff were native speakers of Dutch 5 DW users were German native speakers + 1 NS of Romanian fluent in German 80% of the participants self-rated their knowledge of English as upper-intermediate, 20% defined it as intermediate or excellent o None of the respondents considered themselves bilingual Results: background
15
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 15 80% had used MT before our experiment o 7 for personal reasons, 6 for work (commonly for both purposes) o all but one had used Google Translate, 1 had tried Babel Fish, 2 both Language combinations used o 4 from EN into other languages o 6 into EN from a range of source languages o 5 language combinations not involving English 75% used MT for assimilation purposes vs. 25% for dissemination 62.5% had post-edited raw MT to obtain high-quality translations Results: previous use of MT
16
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 16 Materials translated with MT by the 8 respondents o for study purposes (academic papers and uni-related texts): 3 o business correspondence, personal or professional emails: 2 o contracts and technical documents: 2 o online articles: 2 o websites: 2 (“the translations of Dutch sites to English were hilarious!”, but not using CoSyne MT system!!) Wikipedia content: 1 Results: previous use of MT
17
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 17 Overall the 8 respondents had a predominantly negative-to-neutral impression of MT quality before taking part in the evaluation of the CoSyne MT system, based on a 5-point Likert scale (average 2.8 / 5) Quality of previously used MT systems on a 5-point scale Results: previous use of MT (1 = very poor to 5 = very good)
18
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 18 Results: CoSyne MT system Quality and usefulness of the CoSyne MT system on a 5-point scale Average quality is medium (3 / 5), better than previous experience (2.8) Usefulness slightly higher than medium (3.3 /5) (cf. 2.8) qualityusefulness (1 = very poor to 5 = very good)
19
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 19 Results: CoSyne MT system Is CoSyne MT faster than translating wiki entries into English from scratch? on a 7-point scale Average value higher than mid-point of the scale (4.6 / 7) In line with e.g. Plitt & Masselot (2010) and Flournoy & Rueppel (2010) From DE almost twice as good as from NL (due to style of wiki texts?) (1 = strongly disagree to 7 = strongly agree)
20
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 20 Results: CoSyne MT system MT quality broken down into: accuracy correctness, comprehensibility readability style on a 7-point scale We did not explain to users the subtle differences involved Only accuracy is approx. average (3.6 / 7), other criteria lower None of the average values particularly poor (DE always better than NL) accu corrcompreadstyl (1 = poor to 7 = excellent)
21
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 21 Results: post-editing CoSyne Amount of work, in terms of time and effort to post- edit the MT output Need to refer to source language while post-editing timeeffort on a 7-point scale (1 = short/small to 7 = long/large) frequency on a 7-point scale (1 = never to 7 = always)
22
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 22 Results: post-editing CoSyne Severity of errors over post-editing operations Frequency of errors over post-editing operations insertion deletion substitution reordering ins delsub reo ins delsub reo on a 7-point scale (1 = irrelevant to 7 = very serious) on a 7-point scale (1 = absent to 7 = frequent)
23
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 23 Positive aspects: o good to have draft translation to work upon o integration in the wiki environment o potential to speed up the translation task Weaknesses: o translation quality needs improving, due to wrong translation of pronouns verbs frequently dropped incorrect word order mistranslated compounds limited lexical coverage (OOV items is an issue) Good potential of the CoSyne system based on first prototype Results: final comments
24
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Conclusions 24 User-focused task-oriented questionnaire-based evaluation for MT used in wikis, supported by post-editing Evaluation of the first Y1 prototype of the CoSyne MT system for DE → EN and NL → EN Quality of the CoSyne MT system perceived by the users higher than that of previously used MT systems Post-editing effort is considered high, but users found it less time- consuming than translating from scratch Translations from German rated better than those from Dutch o contrasts with earlier findings (Toral et al., 2011) o further investigation into this discrepancy (meta-evaluation)
25
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Future work 25 Extend analysis looking into the post-editing logs, considering actual post-editing time (to estimate costs) Involve more users after pilot stage Include a control group (translating manually or other MT s/w) Investigate correlation between the post-editing carried out by the users and the results provided by TER and TERp (ins, del…) Use our linguistically-aware diagnostic evaluation tool (DELiC4MT) to monitor performance of the MT system on specific issues flagged up by the users
26
Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011 Thank you for your attention! Questions? User-focused task-oriented MT evaluation for wikis: a case study Federico Gaspari, Antonio Toral, and Sudip Kumar Naskar School of Computing Dublin City University Dublin 9, Ireland {fgaspari, atoral, snaskar}@computing.dcu.ie
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.