Evaluation of NLP Systems Martin Hassel KTH CSC Royal Institute of Technology 100 44 Stockholm +46-8-790 66 34

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

University of Sheffield NLP Module 4: Machine Learning.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Evaluation of NLP Systems Martin Hassel KTH NADA Royal Institute of Technology Stockholm
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
Albert Gatt Corpora and Statistical Methods Lecture 9.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Tell the robot exactly how to draw a square on the board.
Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Completion, Short-Answer, and True-False Items
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Ling 570 Day 17: Named Entity Recognition Chunking.
The Scientific Method A Way to Solve a Problem.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Machine learning system design Prioritizing what to work on
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
MedKAT Medical Knowledge Analysis Tool December 2009.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
David Evans Class 15: P vs. NP (Smiley Puzzles and Curing Cancer) CS150: Computer Science University of Virginia Computer.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
CS216: Program and Data Representation University of Virginia Computer Science Spring 2006 David Evans Lecture 8: Crash Course in Computational Complexity.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Assessment and the Institutional Environment Context Institutiona l Mission vision and values Intended learning and Educational Experiences Impact Educational.
 Good for:  Knowledge level content  Evaluating student understanding of popular misconceptions  Concepts with two logical responses.
Natural Language Processing Vasile Rus
Compiler Design (40-414) Main Text Book:
Erasmus University Rotterdam
Chunking—Practical Exercise
Machine Learning in Natural Language Processing
Topics in Linguistics ENG 331
By Hossein Hematialam and Wlodek Zadrozny Presented by
Presentation transcript:

Evaluation of NLP Systems Martin Hassel KTH CSC Royal Institute of Technology Stockholm

Martin Hassel Why Evaluate? Otherwise you won’t know if what you’re doing is any good! Human languages are very loosely defined This makes it hard to prove that something works (as you do in mathematics or logic)

Martin Hassel Aspects of Evaluation General aspects To measure progress Commercial aspects To ensure consumer satisfaction Edge against competitors / PR Scientific aspects Good science

Martin Hassel Manual Evaluation Human judges + Semantically based assessment – Subjective – Time consuming – Expensive

Martin Hassel Semi-Automatic Evaluation Task based evaluation + Measures the system’s utility – Subjective interpretation of questions and answers

Martin Hassel Automatic Evaluation Example from Text Summarization Sentence Recall + Cheap and repeatable – Does not distinguish between different summaries

Martin Hassel Why Automatic Evaluation? Manual labor is expensive and takes time It’s practical to be able to evaluate often – does this parameter lead to improvements? It’s tedious to evaluate manually Human factor – People tend to tire and make mistakes

Martin Hassel Corpora A body of data considered to represent ”reality” in a balanced way Sampling Raw format vs. annotated data

Martin Hassel Corpora can be… a Part-of-Speech tagged data collection Arrangörnn.utr.sin.ind.nom varvb.prt.akt.kop Järfällapm.gen naturföreningnn.utr.sin.ind.nom därha Margaretapm.nom ärvb.prs.akt.kop medlemnn.utr.sin.ind.nom.mad

Martin Hassel Corpora can be… a parse tree data collection (S (NP-SBJ (NNP W.R.) (NNP Grace) ) (VP (VBZ holds) (NP (NP (CD three) ) (PP (IN of) (NP (NP (NNP Grace) (NNP Energy) (POS 's) ) (CD seven) (NN board) (NNS seats) ) ) ) ) (..) )

Martin Hassel Corpora can be… a collection of sound samples

Martin Hassel Widely Accepted Corpora Pros Well-defined origin and context Well-established evaluation schemes Inter-system comparabilitity Cons Optimizing for a specific data set May establish a common “truth”

Martin Hassel Gold Standard ”Correct guesses” demand knowing what the result should be This ”optimal” result is often called a gold standard How the gold standard looks and how you count can differ a lot between tasks The basic idea is however the same

Martin Hassel Example of a Gold Standard Gold standard for tagging, shallow parsing and clause boundering Han pn.utr.sin.def.sub NPBCLB är vb.prs.akt.kop VCBCLI mest ab.suv ADVPB|APMINBCLI road jj.pos.utr.sin.ind.nom APMINB|APMINICLI av pp PPBCLI äldre jj.kom.utr/neu.sin/plu.ind/def.nom APMINB|NPB|PPICLI sorter nn.utr.plu.ind.nom NPI|PPICLI. Mad 0CLI

Martin Hassel Some Common Measures Precision = correct guesses / all guesses Recall = correct guesses / correct answers Precision and recall often are mutually dependant higher recall → lower precision higher precision → lower recall

Martin Hassel More Evaluation Terminology True positive – Alarm given at correct point False negative – No alarm when one should be given False positive – Alarm given when none should be given (True negative) – The algorithm is quiet on uninteresting data In e.g. spell checking the above could correspond to detected errors, missed errors, false alarms and correct words without warning.

Martin Hassel How Good Is 95%? It depends on what problem you are solving! Try to determine expected upper and lower bounds for performance (of a specific task) A baseline tells you the performance of a naïve approach (lower bound)

Martin Hassel Lower Bound Baselines Serve as lower limit of acceptability Common to have several baselines Common baselines Random Most common choice/answer (e.g. in tagging) Linear selection (e.g. in summarization)

Martin Hassel Upper Bound Sometimes there is an upper bound lower than 100% Example: In 10% of all cases experts disagree on the correct answer Human ceiling (inter-assessor agreement) Low inter-assessor agreement can sometimes be countered with comparison against several ”sources”

Martin Hassel Limited Data Limited data is often a problem, especially in machine learning We want lots of data for training Better results We want lots of data for evaluation More reliable numbers If possible, create your own data! Missplel

Martin Hassel Limited Data N-fold Cross Validation Idea: 1 Set 5% of the data aside for evaluation and train on 95% 2 Set another 5% aside for evaluation and repeat training on 95% 3 … and again (repeat in total 20 times) Take the mean of the evaluation results to be the final result

Martin Hassel Concrete Examples Taggning Force the tagger to assign exactly one tag to each token – precision? Parsing What happens when almost correct? Partial trees, how many sentences got full trees? Spell checking Recall / precision for alarms How far down in the suggestion list is the correct suggestion?

Martin Hassel Concrete Examples Grammar checking How many are false alarms (precision)? How many errors are detected (recall)? How many of these have the correct diagnosis? Machine translation & Text Summarization How many n-grams overlap with gold standard(s)? BLEU scores & ROUGE scores

Martin Hassel Concrete Examples Information retrieval What is the precision of the first X hits? At Y% recall? Mean precision. Text categorizing How many documents were correctly classified?