Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,

Slides:



Advertisements
Similar presentations
Is This Conversation on Track? Utterance Level Confidence Annotation in the CMU Communicator spoken dialog system Presented by: Dan Bohus
Advertisements

Brief introduction on Logistic Regression
Evaluating Classifiers
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Chapter 10 Decision Making © 2013 by Nelson Education.
Lecture 22: Evaluation April 24, 2010.
CMPUT 466/551 Principal Source: CMU
Chapter 7 – Classification and Regression Trees
Sorry, I didn’t catch that! – an investigation of non-understandings and recovery strategies Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Classification and risk prediction
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Model Evaluation Metrics for Performance Evaluation
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Cost of Misunderstandings Modeling the Cost of Misunderstanding Errors in the CMU Communicator Dialog System Presented by: Dan Bohus
Performance Evaluation in Computer Vision Kyungnam Kim Computer Vision Lab, University of Maryland, College Park.
A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Correlation and Linear Regression
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
Chapter 13: Inference in Regression
Correlation and Linear Regression
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Chapter 9 – Classification and Regression Trees
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
F. Provost and T. Fawcett. Confusion Matrix 2Bitirgen - CS678.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Evaluating Results of Learning Blaž Zupan
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Classification Ensemble Methods 1
Evaluating Classification Performance
9-1 Copyright © 2016 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Logistic Regression: Regression with a Binary Dependent Variable.
Stats Methods at IC Lecture 3: Regression.
Lecture 1.31 Criteria for optimal reception of radio signals.
Correlation and Linear Regression
Chapter 7. Classification and Prediction
Evaluating Results of Learning
Erasmus University Rotterdam
Experiments in Machine Learning
Discrete Event Simulation - 4
Presentation transcript:

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Abstract We present a data-driven approach which allows us to quantitatively assess the costs of various types of errors that a confidence annotator commits in the CMU Communicator spoken dialog system. Knowing these costs we can determine the optimal tradeoff point between these errors, and fine-tune the confidence annotator accordingly. The cost models based on net concept transfer efficiency fit our data quite well, and the relative costs of false-positives and false-negatives are in accordance with our intuitions. We also find, surprisingly that for a mixed-initiative system such as the CMU Communicator, these errors trade-off equally over a wide operating range. ø. ø. 1. Motivation. Problem Formulation. Intro In previous work [1], we have cast the problem of utterance- level confidence annotation as a binary classification task, and have trained multiple classifiers for this purpose:  Training corpus: 131 dialogs, 4550 utterances  12 Features from recognition, parsing and dialog level  7 Classifiers: Decision Tree, ANN, Bayesian Net, AdaBoost, Naïve Bayes, SVM, Logistic regression. Results (mean classification error rates in 10-fold cross-validation) * Most of the classifiers obtained statistically indistinguishable results (with the notable exception of Naïve Bayes). The logistic regression model obtained much better performance on a soft-metric Question: Is Classification Error Rate the Right Way to Evaluate Performance ? CER as a measure of performance implicitly assumes that the cost of false-positives and false-negatives is the same. But intuitively this assumption does not hold in most dialog systems:  On FP, the system incorporates an will act on invalid info;  On FN, the system will reject a valid user utterance. So optimally, we want to build an error function which takes into account these costs, and optimize for that. Problem Formulation 1. Develop a cost model which allows us to Quantitatively assess the costs of FP and FN errors 2. Use these costs to pick an optimal point on the classifier operating characteristic 2. Random baseline32% Previous “Garble” Baseline25% Classifiers*16% Cost Models: The Approach The Approach To model the impact of FPs and FNs on the system performance, we:  Identify a suitable dialog performance metric (P) which we want to optimize for  Build a statistical regression model on whole sessions using P as the response variable and the counts of FPs and FNs as predictors: -P = f(FPs, FNs) -P = k+Cost FP FP+Cost FNFN (Linear Regression) Performance metrics:  User satisfaction (5-point scale): subjective, hard to obtain  Completion (binary): too coarse  Concept transmission efficiency CTC = correctly transferred concepts/turn ITC = incorrectly transferred concepts/turn REC = relevantly expressed concepts/turn The Dataset  134 dialogs collected using mostly 4 different scenarios utterances  User satisfaction scores obtained for only 35 dialogs  Corpus manually labeled at the concept level: -4 labels: OK/RBAD/PBAD/OOD -Aggregate utterance labels generated  Confidence annotator decisions available in the logs  We therefore could compute the counts of FPs, FNs and CTCs and ITCs for each session An Example User: I want to fly from Pittsburgh to Boston Decoder: I want to fly from Pittsburgh to Austin Parse: [I_want/OK] [Depart_Loc/OK] [Arrive_Loc/OK]  Only 2 relevantly expressed concepts  If Accept: CTC=1, ITC=1, REC=2  If Reject: CTC=0, ITC=0, REC=2

5. Further Analysis Is CPT-IPT an Adequate Metric ? Mean = 0.71; Standard Deviation = 0.28, Mean for Completed dialogs = 0.82, Mean for Uncompleted dialogs = 0.57, differences are statistically significant at a very high level of confidence (p = ) Can We Reliably Extrapolate the Model to Other Areas of the ROC ? The distribution of FPs and FNs across dialogs indicates that, although the data is obtained with the confidence annotator running with a threshold of 0.5, we have enough samples to reliably estimate the other areas of the ROC. How About the Impact of the Baseline Error Rate ? Cost models constructed based on sessions with a low baseline error rate indicate that the optimal point is with the threshold at 0 (no confidence annotator). Explanation:  Ability to easily overwrite incorrectly captured information in the CMU Communicator.  Relatively low baseline error rates. School of Computer Science, Carnegie Mellon University, 2001, Pittsburgh, PA, Cost Models: The Results Cost Models Targeting Efficiency 3 successively refined cost models were developed targeting efficiency as a response variable. The goodness of fit for this models (indicated by R 2 ), both on the training and in a 10-fold cross-validation process are illustrated in the table below. Model 1:CTC = FP + FN + TN + k Model 2: CTC–ITC = REC + FP + FN + TN + k  added the ITC term so that we also minimize the number of incorrectly transferred concepts.  REC captures a prior on the verbosity of the user  both changes further improve performance Model 3:CTC–ITC = REC + FPC + FPNC + FN + TN + k  The FP term was split into 2, since there are 2 different types of false positives in the system, which intuitively should have very different costs: FPC = false positives with relevant concepts FPNC = false positives without relevant concepts The resulting coefficients for model 3 are given below, together with their 95% confidence intervals: Other Models Targeting Completion (binary)  Logistic regression model  Estimated model does not indicate a good fit Targeting User Satisfaction (5-point scale)  Based only on 35 dialogs  R 2 =0.61, similar to literature (Walker et al)  Explanation: subjectivity of metric + limited dataset ModelR 2 allR 2 trainR 2 test CTC=FP+FN+TN CTC-ITC=FP+FN+TN CTC-ITC=REC+FP+FN+TN CTC-ITC =REC+FPC+FPNC+FN+TN k0.41 C REC 0.62 C FPNC C FPC C FN C TN Fine-tuning the annotator We want to find the optimal trade-off point on the operating characteristic of the classifier. Implicitly we are minimizing classification error rate (FP + FN). So the problem translates to locating a point on the operating characteristic (by moving the classification threshold) which minimizes the total cost (and thus implicitly maximize the chosen performance metric), rather than the classification error rate. The cost, according to model 3 is: Cost = 0.48 FPNC FPC FN TN The fact that the cost function is almost constant across a wide range of thresholds, indicates that the efficiency of the dialog stays about the same, regardless of the ratios of FPs and FNs that the system makes.6. Conclusions  Proposed a data-driven approach to quantitatively assess the costs of various types of errors committed by a confidence annotator.  Models based on efficiency fit the data well; obtained costs confirm the intuition.  For CMU Communicator, the models predict that the total cost stays the same across a large range of the operating characteristic of the confidence annotator.