Presentation is loading. Please wait.

Presentation is loading. Please wait.

September 2003 1 EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003.

Similar presentations

Presentation on theme: "September 2003 1 EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003."— Presentation transcript:

1 September 2003 1 EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003

2 September 2003 2 The rise of empiricism CL up until the 1980s primarily a theoretical discipline The experimental methodology is now paid much more attention to

3 September 2003 3 Empirical methodology & evaluation Starting with the big US ASR competitions of the 1980s, evaluation has progressively become a central component of work in NL – DARPA Speech initiative – MUC – TREC GOOD: – Much easier for community (& researchers themselves) to understand which proposals are really improvements BAD: – too much focus on small improvements – cannot afford to try entirely new technique (may not lead to improvements for a couple of years!)

4 September 2003 4 Typical developmental methodology in CL

5 September 2003 5 Training set and test set Models estimated / systems developed using a TRAINING SET The training set should be - representative of the task - as large as possible - well-known and understood

6 September 2003 6 The test set Estimated models evaluated using a TEST SET The test set should be - disjoint from the training set - large enough for results to be reliable - unseen

7 September 2003 7 Possible problems with the training set Too small  performance drops OVERFITTING can be reduced using - cross-validation (large variance may mean training set too small) - large priors

8 September 2003 8 Possible problems with the test set Are results using the test set believable? - results might be distorted if too easy / hard - training set and test set may be too different (language non stationary)

9 September 2003 9 Evaluation Two types: - BLACK BOX (system as a whole) - WHITE BOX (components independently) Typically QUANTITATIVE (but need QUALITATIVE as well)

10 September 2003 10 Simplest quantitative evaluation metrics ACCURACY: percentage correct (against some gold standard) - e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank ERROR: percentage wrong - ERROR REDUCTION most typical metric in ASR

11 September 2003 11 A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskf fsavdkf d lsjnvjf fvjnf dfj djf v lafnlanflj aff rvjfkjfkbv KFKRQVFsjfanvnf CDKBCWDK

12 September 2003 12 Positives and negatives TRUE NEGATIVES FALSE NEGATIVES TPFP

13 September 2003 13 Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected

14 September 2003 14 The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure

15 September 2003 15 Simple vs. multiple runs Single run may be lucky: - Do multiple runs - Report averaged results - Report degree of variation - Do SIGNIFICANCE TESTING (cfr. t-test, etc.) A lot of people are lazy and just report single runs.

16 September 2003 16 Interpreting results A 97% accuracy may look impressive … but not so much if 98% of items have same tag: need BASELINE An F measure of.7 may not look very high unless told that humans only achieve.71 at this task: need UPPER BOUND

17 September 2003 17 Confusion matrices Once you’ve evaluated your model, you may want to try to do some ERROR ANALYSIS. This usually done with a CONFUSION MATRIX..JJ..NN..VB.. JJ25.. NN37.. VB154

18 September 2003 18 Readings Manning and Schütze, chapter 8.1

Download ppt "September 2003 1 EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003."

Similar presentations

Ads by Google