Presentation on theme: "Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen"— Presentation transcript:
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
Bente Maegaard, LREC Evaluation at LREC More than 150 papers were submitted to the Evaluation track, both Written and Spoken This is a significant rise compared to previous years Evaluation as a field is attracting increasing interest. Many papers discuss evaluation methodology, the field is still under development, and the answers to some of the methodological questions are still not known. An example: MT Automatic evaluation Evaluation in Context (task-based, function-based)
Bente Maegaard, LREC Evaluation Written Parsing evaluation 6 Semantics, sense 6 Evaluation methodologies 7 Time annotation 9 MT13 Annotation, alignment, morph.15 Lexica, tools21 QA, IR, IE, summarisation, authoring25 Total 102 Note: These figures may contain papers that were originally in other tracks.
Bente Maegaard, LREC Discussion MT evaluation MT evaluation since 1965 Van Slype: Adequacy, fluency, fidelity, Human evaluation, expensive, time-consuming, problems with counting of errors, objective? Formalising human evaluation, adding e.g. grammaticality Another measure: Cost of post-editing, objective Automatic evaluation: Papineni et al. 2001: BLEU, with various modifications. Expensive to establish the reference translations, after that cheap and fast. However, research shows that this automatic method does not correlate well with human evaluation, also does not correlate with the cost of post-editing etc. Automatic statistical evaluation can probably be used for evaluation of MT for gisting, but it cannot be used for MT for publishing
Bente Maegaard, LREC Digression: Metrics that do not work Why is it so difficult to evaluate MT? Because there is more than one correct answer. And because answers may be more or less correct. Measures like WER are not relevant for MT. Methods relying on a specific number of words in the translation are not OK (if the translation does not have the same number of words as the reference)
Bente Maegaard, LREC Generic Contextual Quality Model (GCQM) Popescu-Belis et al. LREC2006 Building on the same thinking as the FEMTI taxonomy One can only evaluate a system in the context in which it will be used. Quality workshop 27/5: task-based, function-based evaluation. (Huang, Take) Karen Sparck-Jones: ‘the set-up’ So, the understanding that a system can only be reasonably evaluated wrt. a specific task, is accepted Domain-specific vs. general purpose MT
Bente Maegaard, LREC What do we need? When? What? In the field of MT evaluation we need more experiments in order to establish a methodology. The French CESTA (Hamon et al, LREC2006) is a good example. So, we need international cooperation for the infrastructure, but in the first instance this cooperation should lead to reliable metrics for MT evaluation. Later on it may be used for actually measuring MT systems’ performance. (Of course not only MT!) When? As soon as possible. Start with methodology, for each application Move on to doing evaluation Goal: in 2011 we can reliably evaluate MT - and other applications!