Methodologies for Evaluating Dialog Structure Annotation Ananlada Chotimongkol Presented at Dialogs on Dialogs Reading Group 27 January 2006.

Slides:



Advertisements
Similar presentations
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Advertisements

Chapter 5 Measurement, Reliability and Validity.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Statistical Issues in Research Planning and Evaluation
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Chapter 4 Validity.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Factor Analysis There are two main types of factor analysis:
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Research Methods in MIS
CS Catching Up CS Porter Stemmer Porter Stemmer (1980) Used for tasks in which you only care about the stem –IR, modeling given/new distinction,
Towards Learning Dialogue Structures from Speech Data and Domain Knowledge: Challenges to Conceptual Clustering using Multiple and Complex Knowledge Source.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
ACM corpus annotation analysis Andrew Rosenberg 2/26/2004.
Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.
LEVEL OF MEASUREMENT Data is generally represented as numbers, but the numbers do not always have the same meaning and cannot be used in the same way.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Measurement and Data Quality
1 Development of Valid and Reliable Case Studies for Teaching, Diagnostic Reasoning, and Other Purposes Margaret Lunney, RN, PhD Professor College of.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Data Annotation for Classification. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
The Hierarchy of Learning Adapted from Benjamin Bloom’s Taxonomy of Educational Objectives.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Instrumentation.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
VTT-STUK assessment method for safety evaluation of safety-critical computer based systems - application in BE-SECBS project.
Chapter 6 : Software Metrics
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Teaching Registrars Research Methods Variable definition and quality control of measurements Prof. Rodney Ehrlich.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Software Project Management Lecture # 3. Outline Chapter 22- “Metrics for Process & Projects”  Measurement  Measures  Metrics  Software Metrics Process.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Accuracy Assessment Having produced a map with classification is only 50% of the work, we need to quantify how good the map is. This step is called the.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies.
PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
WERST – Methodology Group
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Chapter 6 - Standardized Measurement and Assessment
RELIABILITY OF DISEASE CLASSIFICATION Nigel Paneth.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Sample Size Mahmoud Alhussami, DSc., PhD. Sample Size Determination Is the act of choosing the number of observations or replicates to include in a statistical.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Annotating and measuring Temporal relations in texts Philippe Muller and Xavier Tannier IRIT,Université Paul Sabatier COLING 2004.
Lesson 3 Measurement and Scaling. Case: “What is performance?” brandesign.co.za.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
© 2009 Pearson Prentice Hall, Salkind. Chapter 5 Measurement, Reliability and Validity.
Measurement and Scaling Concepts
Chapter 2 Theoretical statement:
CSC 594 Topics in AI – Natural Language Processing
Accuracy Assessment of Thematic Maps
Association between two categorical variables
Statistical NLP: Lecture 9
REVIEW I Reliability scraps Index of Reliability
Chapter 10 Content Analysis
Presentation transcript:

Methodologies for Evaluating Dialog Structure Annotation Ananlada Chotimongkol Presented at Dialogs on Dialogs Reading Group 27 January 2006

Dialog structure annotation evaluation How good is the annotated dialog structure? Evaluation methodologies 1.Qualitative evaluation (humans rate how good it is) 2.Compare against a gold standard (usually created by a human) 3.Evaluate the end product (task-based evaluation) 4.Evaluate the principles used 5.Inter-annotator agreement (comparing subjective judgment when there is no single correct answer)

Choosing evaluation methodologies Depended on what kind of information being annotated 1.Categorical annotation e.g. dialog act 2.Boundary annotation e.g. discourse segment 3.Structural annotation e.g. rhetorical structure

Categorical annotation evaluation Cochran's Q test Test whether the number of coders assigning the same label at each position is randomly distributed Doesn’t tell directly the degree of agreement Percentage of agreement Measures how often the coders agree Doesn’t account for agreement by chance Kappa coefficient [Carletta, 1996] Measures pairwise agreement among coders correcting for expected chance agreement

Kappa statistic Kappa coefficient (K) measures pairwise agreement among coders on categorical judgment P(A) is the proportion of times the coders agree P(E) is the proportion of times they are expected to agree by chance K > 0.8 indicates substantial agreement 0.67 < K < 0.8 indicates moderate agreement Difficult to calculate chance expected agreement in some cases

Boundary annotation evaluation Use Kappa coefficient Don’t compare the segments directly but compare a decision on placing each boundary At each eligible point, making a binary decision whether to annotate it as “boundary” or “non-boundary” However, Kappa coefficient doesn’t accommodate near-miss boundaries Redefine a matching criterion e.g. also count near-miss as match Use other metrics e.g. probabilistic error metrics

Probabilistic error metrics P k [Beeferman et al, 1999] Measure how likely two time points are classified into different segments Small P k means high degree of agreement WindowDiff (WD) [Pevzner and Hearst, 2002] Measure the number of intervening topic breaks between time points Penalize the difference in the number of segment boundaries between two time points

Structural annotation evaluation Cascaded approach Evaluate one level at a time Evaluate the annotation of the higher level only if the annotation of the lower level is agreed Example: nested game annotation in Map Task [Carletta et al, 1997] Redefine matching criteria for structural annotation [Flammia and Zue, 1995] Segment A matches segment B if A contains B Segment A in annotation-i matches with segments in annotation-j if segments in annotation-j excludes segment A Agreement criterion isn’t symmetry Flattened the hierarchical structure Flatten the hierarchy into overlapping spans Compute agreement on the spans or spans’ labels Example: RST annotation [Marcu et al, 1999]

Form-based dialog structure Describe a dialog structure using a task structure: a hierarchical organization of domain information Task: a subset of dialogs that has a specific goal Sub-task: A decomposition of a task Corresponds to one action (the process that uses related pieces of information together to create a new piece of information or a new dialog state) Concept: is a word or a group of words that captures information necessary for performing an action Task structure is domain-dependent

An example of form-based structure annotation word1 word2 word3 word4 … wordn … …

Annotation experiment Goal: to verify that the form-based dialog structure can be understood and applied by other annotators The subjects were asked to identify the task structure of the dialogs in two domains Air travel planning domain Map reading domain Need a different set of labels for each domain Equivalent to design domain-specific labels from the definition of dialog structure components

Annotation procedure The subjects study an annotation guideline Definition of the task structure Examples from other domains (bus schedule and UAV flight simulation) For each domain, the subject study the transcription of 2-3 dialogs 1.Create a set of labels for annotating the task structure 2.Annotate the given dialogs with the set of labels designed in 1)

Issues on task structure annotation evaluation There are more than one acceptable annotation Similar to MT evaluation But difficult to obtain multiple references The tag set used by two annotator may not be the same 1. two thirty 2. two thirty Difficult to define matching criteria Mapping equivalent labels between two tag sets is subjective (and may not be possible)

Cross-annotator correction Ask a different annotator (2 nd annotator) to judge the annotation and make a correction on the part that doesn’t conform to the guideline If the 2 nd annotator agrees with the 1 st one, he will make no correction The annotation of the 2 nd annotator himself may be different because there can be more than one annotation that conform with the rule

Cross-annotator correction (2) Pro: Easier to evaluate the agreement, the annotations are based on the same tag set Allow more than one acceptable annotations Con: Need another annotator, take time Another subjective judgment Need to measure amount of change made by the 2 nd annotator

Cross-annotators Who should be the 2 nd annotators Another subject who did the annotation also Bias toward his own annotation? Another subject who studies the guideline but didn’t do his/her own annotation May not think about the structure thoroughly Experts Can also measure annotation accuracy using an expert annotation as a reference

How to quantify amount of correction Edit distance from the original annotation Structural annotation, have to redefine edit operations Lower number means higher agreement, but which range of values is acceptable Inter-annotator agreement Can apply structural annotation evaluation Agreement number is meaningful, can compare across different domain

Cross-annotation agreement Use similar approach to [Marcu et al, 1999] Flatten the hierarchy into overlapping spans Compute agreement on the labels of the spans (task, sub-task, concept labels) Issues A lot of possible spans with no label (esp. for concept annotation) How to calculate P(E) when add new concepts

Objective annotation evaluation Make it more comparable to other works Easier to evaluation, don’t need the 2 nd annotator Label-insensitive 3 labels:,, May also consider the level of sub-tasks e.g., Kappa artificially high Add qualitative analysis on what they don’t agree on

Reference J. Carletta, "Assessing agreement on classification tasks the kappa statistic," Computational Linguistics, vol. 22, pp , D. Beeferman, A. Berger, and J. Lafferty, "Statistical Models for Text Segmentation," Machine Learning, vol. 34, pp , L. Pevzner and M. A. Hearst, "A critique and improvement of an evaluation metric for text segmentation," Computational Linguistics, vol. 28, pp , J. Carletta, S. Isard, G. Doherty-Sneddon, A. Isard, J. C. Kowtko, and A. H. Anderson, "The reliability of a dialogue structure coding scheme," Computational Linguistics, vol. 23, pp , G. Flammia and V. Zue, "Empirical evaluation of human performance and agreement in parsing discourse constituents in spoken dialogue," in the Proceedings of Eurospeech Madrid, Spain, D. Marcu, E. Amorrortu, and M. Romera, "Experiments in constructing a corpus of discourse trees," in the Proceedings of the ACL Workshop on Standards and Tools for Discourse Tagging, College Park, MD, 1999.

Matching criteria Exact match (pairwise) Partial match (pairwise) Agree with majority (pool of coders) Agree with consensus (pool of coders)