September 2003 1 EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003.

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Lecture 22: Evaluation April 24, 2010.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
What is Statistical Modeling
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
Model Evaluation Metrics for Performance Evaluation
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
EXAMPLES Attributes of Information. Overview Information Systems help companies achieve their goals. How do they do it? 1. By processing raw data into.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Discrim Continued Psy 524 Andrew Ainsworth. Types of Discriminant Function Analysis They are the same as the types of multiple regression Direct Discrim.
1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006.
INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of scientific research When you know the system: Estimation.
The Experimental Approach September 15, 2009Introduction to Cognitive Science Lecture 3: The Experimental Approach.
Experimental Evaluation
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Introduction to Machine Learning Approach Lecture 5.
The Road to a Good Science Project Dr. Michael H. W. Lam Department of Biology & Chemistry City University of Hong Kong Hong Kong Student Science Project.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Determining Sample Size
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Software Engineering Software Process and Project Metrics.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
Estimating Task Durations Methodologies and Psychology of Estimating.
Evaluation CSCI-GA.2590 – Lecture 6A Ralph Grishman NYU.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Learning from Observations Chapter 18 Through
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Uncertainty in Measurement
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Academic Research Academic Research Dr Kishor Bhanushali M
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Slide 1-1 Copyright © 2004 Pearson Education, Inc. Stats Starts Here Statistics gets a bad rap, and Statistics courses are not necessarily chosen as fun.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
11 Chapter 5 The Research Process – Hypothesis Development – (Stage 4 in Research Process) © 2009 John Wiley & Sons Ltd.
Brian Lukoff Stanford University October 13, 2006.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Information Organization: Evaluation of Classification Performance.
Sampling and Sampling Distribution
7. Performance Measurement
Evaluating Classifiers
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Week 10 Chapter 16. Confidence Intervals for Proportions
Chunk Parsing CS1573: AI Application Development, Spring 2003
Brian Nisonger Shauna Eggers Joshua Johanson
Uncertainty and Error
University of Illinois System in HOO Text Correction Shared Task
Evaluating Classifiers
Presentation transcript:

September EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003

September The rise of empiricism CL up until the 1980s primarily a theoretical discipline The experimental methodology is now paid much more attention to

September Empirical methodology & evaluation Starting with the big US ASR competitions of the 1980s, evaluation has progressively become a central component of work in NL – DARPA Speech initiative – MUC – TREC GOOD: – Much easier for community (& researchers themselves) to understand which proposals are really improvements BAD: – too much focus on small improvements – cannot afford to try entirely new technique (may not lead to improvements for a couple of years!)

September Typical developmental methodology in CL

September Training set and test set Models estimated / systems developed using a TRAINING SET The training set should be - representative of the task - as large as possible - well-known and understood

September The test set Estimated models evaluated using a TEST SET The test set should be - disjoint from the training set - large enough for results to be reliable - unseen

September Possible problems with the training set Too small  performance drops OVERFITTING can be reduced using - cross-validation (large variance may mean training set too small) - large priors

September Possible problems with the test set Are results using the test set believable? - results might be distorted if too easy / hard - training set and test set may be too different (language non stationary)

September Evaluation Two types: - BLACK BOX (system as a whole) - WHITE BOX (components independently) Typically QUANTITATIVE (but need QUALITATIVE as well)

September Simplest quantitative evaluation metrics ACCURACY: percentage correct (against some gold standard) - e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank ERROR: percentage wrong - ERROR REDUCTION most typical metric in ASR

September A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskf fsavdkf d lsjnvjf fvjnf dfj djf v lafnlanflj aff rvjfkjfkbv KFKRQVFsjfanvnf CDKBCWDK

September Positives and negatives TRUE NEGATIVES FALSE NEGATIVES TPFP

September Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected

September The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure

September Simple vs. multiple runs Single run may be lucky: - Do multiple runs - Report averaged results - Report degree of variation - Do SIGNIFICANCE TESTING (cfr. t-test, etc.) A lot of people are lazy and just report single runs.

September Interpreting results A 97% accuracy may look impressive … but not so much if 98% of items have same tag: need BASELINE An F measure of.7 may not look very high unless told that humans only achieve.71 at this task: need UPPER BOUND

September Confusion matrices Once you’ve evaluated your model, you may want to try to do some ERROR ANALYSIS. This usually done with a CONFUSION MATRIX..JJ..NN..VB.. JJ25.. NN37.. VB154

September Readings Manning and Schütze, chapter 8.1