Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Yasuhiro Fujiwara (NTT Cyber Space Labs)
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
1 Semantic Description of Programming languages. 2 Static versus Dynamic Semantics n Static Semantics represents legal forms of programs that cannot be.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluation.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Model Evaluation Metrics for Performance Evaluation
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
September EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003.
Evaluating Hypotheses
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
1 Error Analysis Part 1 The Basics. 2 Key Concepts Analytical vs. numerical Methods Representation of floating-point numbers Concept of significant digits.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Ch. 3.1 – Measurements and Their Uncertainty
Significant Figures (digits)
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Introduction to Machine Learning Approach Lecture 5.
Significant Figures, and Scientific Notation
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SIGNIFICANT figures Two types of numbers: exact and inexact. Exact numbers are obtained by counting or by definitions – a dozen of wine, hundred cents.
Evaluating Classifiers
Determining Sample Size
Rounding Off Whole Numbers © Math As A Second Language All Rights Reserved next #5 Taking the Fear out of Math.
Chapter 2 Data Handling.
Uncertainty In Measurement
Chapter 3 Scientific Measurement 3.1 Using and Expressing Measurements
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.
Chapter 23: Probabilistic Language Models April 13, 2004.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Chapter 3.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Scientific Measurement Measurements and their Uncertainty Dr. Yager Chapter 3.1.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Measurement Unit Unit Description: In this unit we will focus on the mathematical tools we use in science, especially chemistry – the metric system and.
Measurement & Calculations Overview of the Scientific Method OBSERVE FORMULATE HYPOTHESIS TEST THEORIZE PUBLISH RESULTS.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
1 Error VS Mistakes Error may occur every time you make a measurement in an experiment. A mistake is a blunder or unintentional action whose consequence.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
RELIABILITY OF DISEASE CLASSIFICATION Nigel Paneth.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Uncertainty2 Types of Uncertainties Random Uncertainties: result from the randomness of measuring instruments. They can be dealt with by making repeated.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Matter, Measurement, and Problem Solving. Measurement and Significant Figures Tro: Chemistry: A Molecular Approach, 2/e.
Measurement Unit Unit Description:
Significant Figures (digits)
Machine Learning in Practice Lecture 17
Significant Figures Revisiting the Rules.
Significant Figures (digits)
Presentation transcript:

Evaluation in NLP Zdeněk Žabokrtský

Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper evaluation criteria is one way to specify precisely an NLP problem Need for evaluation metrics and evaluation data (language data resources).

Automatic vs. manual evaluation automatic –comparing the system's output with the gold standard output –the cost of producing the gold standard data... –... but then easily repeatable without additional cost manual –for many NLP problems, the definition of a gold standard can prove impossible (e.g., when inter-annotator agreement is insufficient) –manual evaluation is performed by human judges, which are instructed to estimate the quality of a system, based on a number of criteria

Intrinsic vs. extrinsic evaluation Intrinsic evaluation – considers an isolated NLP system and characterizes its performance mainly with respect to a gold standard result. Extrinsic evaluation –considers the NLP system as a component in a more complex setting.

Lower and upper bounds Naturally, the performance of our system is expected to be inside the interval given by –lower bound - result of a baseline solution (less complex or even trivial system the performance of which is supposed to be easily surpassed) –upper bound - inter-annotator agreement

Evaluation metrics the simplest case: –if the number of task instances is known –if the system gives exactly one answer for each instance –if there is exactly one possible clearly correct answer for each instance –if all errors are equally wrong But what if not?

Precision and recall But –we are in 2D now –new issue: precision-recall tradeoff

F-measure one way how to get back to 1D weighted harmonic mean of P and R usually evenly weighted

Evaluation in classification tasks confusion matrix

Evaluation in phrase-structure parsing nontrivial, because –number added nonterminals is not known in advance –it is not clear what should be treated as the (atomic) task instance GEIG metric –Grammar Evaluation Interest Group; used in Parseval –counts the proportion of bracketings which group the same sequences of words in both trees LA metric –leaf-ancestor metric - similarity in sequences of node labels along the paths from terminal nodes to tree root.

Evaluation in dependency parsing straighforward delimitatation of problem instance - - no nonterminal nodes are added unlabeled accuracy –proportion of correctly attached nodes (nodes with correctly predicted parent) labeled accuracy –proportion of nodes which are correctly attached and whose labels (dependency relation) are correct too

Evaluation in automatic speech recognition not only the recognized words might be incorrect, but the number of recognized words might be different too WER - Word Error Rate

Evaluation in machine translation obviously highly non-trivial –there are always more translations possible –there are always more criteria to be judged (translation fidelity, grammatical/pragmatic/stylistic correctness...?) –the essence of translation -- transfering the same meaning from one natural language to annother -- cannot be evaluated by the contemporary machines at all !!! current approaches –either to use human judges –or to use reference (human-made) translations and string- wise comparison metrics which are hoped to correlate with human judgement: BLEU, NIST, METEOR...

BLEU zkratka? N - maximal considered n-gram length (usually 4) p n - precision on n-gram using (a set of) reference translation(s) w n - positive weight (typically 1/N) BP - brevity penalty (to compensate easier n-gram precision on shorter candidate sentences): r - length of reference translation c - length of candidate translation

Interannotator agreement IAA - measure to tell how good human experts perform when given a specific task (to measure the reliability of manual annotations) e.g. F-measure on data from two annotators (one of them virtually treated as gold standard, symmetric if F1) But: nonzero value is obtained even if annotators' decisions are uncorrelated example –two annotators making classifications into two classes –1st annotator: 80% A, 20% B –2nd annotator 85% A, 15% B –probability of agreement by chance: 0.8* *0.15 = 71% desired measure: 1 if they agree in all decisions, 0 if their agreement is equal to agreement by chance

Cohen's Kappa takes into account the agreement occurring by chance P a - relative observed agreement between annotators P e - probability of agreement by chance but kappa -- as a means for quantifying actual level of agreement -- is still a source of much controversy

Evaluation rounding Number of significant digits is linked to experiment setting and reflects its result uncertainty. Writing more digits in an answer than justified by the number of digits in the data is bad. Do not say that error rate of your system is % if it has made 3 errors in 7 task instances (superfluous precision). Basic rules for rounding: –Multiplication/division - the number of significant digits in an answer should equal the least number of significant digits in any one of the numbers being multiplied/divided. –Addition/subtraction - the number of decimal places (not significant digits) in the answer should be the same as the least number of decimal places in any of the numbers being added or subtracted. But: the number of significant digits in a value provides only a very rough indication of its precision -- better to use confidence interval (e.g. 3.28±.05) at certain probability level (typically at 95%).

Towards more robust evaluation K-fold cross validation (usually K=10): –1) partition the data into K roughly equally sized subsamples –2) perform cyclically K iterations use K-1 subsamples for training use 1 subsample for testing –3) average the iterations' results more reliable results, especially if you have only small or in some sense non-uniform data

"Shared Tasks" in NLP contests in implementing systems for a specified task some of them quite popular in the NLP community (e.g. CoNLL) conditioned by existence of training and evaluation data and of evaluation metrics Examples: –Message Understanding Conferences (MUCs) 's, information retrieval, named entity recognition –Parseval 1991, phrase-structure parsing –Senseval, Semeval word sense disambiguation –WMT Shared Task ACL Workshop in machine translation, –CoNLLL Shared Task Conference on Computational Natural Language Learning named entity resolution (2003) semantic role labeling (2004,2005) multilingual dependency parsing (2006, 2007) Joint Parsing of Syntactic and Semantic Dependencies