Error Detection and Correction in SDS

Slides:

Advertisements

Similar presentations

Standardized Scales.

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Using prosody to avoid ambiguity: Effects of speaker awareness and referential context Snedeker and Trueswell (2003) Psych 526 Eun-Kyung Lee.

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

The HIGGINS domain The primary domain of HIGGINS is city navigation for pedestrians. Secondarily, HIGGINS is intended to provide simple information about.

U1, Speech in the interface:2. Dialogue Management1 Module u1: Speech in the Interface 2: Dialogue Management Jacques Terken HG room 2:40 tel. (247) 5254.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Error detection in spoken dialogue systems GSLT Dialogue Systems, 5p Gabriel Skantze TT Centrum för talteknologi.

Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006.

What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.

ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.

Detecting missrecognitions Predicting with prosody.

Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002.

® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Adaptive Spoken Dialogue Systems & Computational Linguistics Diane J. Litman Dept. of Computer Science & Learning Research and Development Center University.

Turn-taking Discourse and Dialogue CS 359 November 6, 2001.

1 Natural Language Processing Lecture Notes 14 Chapter 19.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.

Copyright 2010, The World Bank Group. All Rights Reserved. Testing and Documentation Part II.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

Misrecognitions and Corrections in Spoken Dialogue Systems Diane Litman AT&T Labs -- Research (Joint Work With Julia Hirschberg, AT&T, and Marc Swerts,

circle Spoken Dialogue for the Why2 Intelligent Tutoring System Diane J. Litman Learning Research and Development Center & Computer Science Department.

1 Spoken Dialogue Systems Error Detection and Correction in Spoken Dialogue Systems.

Grounding and Repair Joe Tepperman CS 599 – Dialogue Modeling Fall 2005.

Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.

The Scientific Method. Scientifically Solving a Problem Observe Define a Problem Review the Literature Observe some More Develop a Theoretical Framework.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

HOW TO WRITE A LAB REPORT Mrs. Stewart Biomedical Central Magnet.

Investigating Pitch Accent Recognition in Non-native Speech

Experimental Psychology

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Chapter 21 More About Tests.

Towards Emotion Prediction in Spoken Tutoring Dialogues

Conditional Random Fields for ASR

Understanding Results

Experimental Psychology PSY 433

Studying Intonation Julia Hirschberg CS /21/2018.

Issues in Spoken Dialogue Systems

Spoken Dialogue Systems

Automatic Fluency Assessment

Prosody in Recognition/Understanding

Automatic Speech Recognition

Detecting Prosody Improvement in Oral Rereading

Turn-taking and Disfluencies

Advanced NLP: Speech Research and Technologies

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Advanced NLP: Speech Research and Technologies

Spoken Dialogue Systems

Spoken Dialogue Systems

Discourse Structure in Generation

Emotional Speech Julia Hirschberg CS /16/2019.

Spoken Dialogue Systems

Chapter 12 Power Analysis.

Recognizing Structure: Dialogue Acts and Segmentation

Analysis of Complex Designs

Automatic Speech Recognition

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Automatic Speech Recognition

Speaker Identification:

Low Level Cues to Emotion

Chapter Ten: Designing, Conducting, Analyzing, and Interpreting Experiments with Two Groups The Psychologist as Detective, 4e by Smith/Davis.

Presentation transcript:

Error Detection and Correction in SDS Julia Hirschberg CS 4706 9/19/2018

Today Avoiding errors Detecting errors From the user side: what cues does the user provide to indicate an error? From the system side: how likely is it the system made an error? Dealing with Errors: what can the system do when it thinks an error has occurred? Evaluating SDS: evaluating ‘problem’ dialogues 9/19/2018

Avoiding misunderstandings The problem By imitating human performance Timing and grounding (Clark ’03) Confirmation strategies Clarification and repair subdialogues 9/19/2018

Today Avoiding errors Detecting errors From the user side: what cues does the user provide to indicate an error? From the system side: how likely is it the system made an error? Dealing with Errors: what can the system do when it thinks an error has occurred? Evaluating SDS: evaluating ‘problem’ dialogues 9/19/2018

Learning from Human Behavior: Features in repetition corrections (KTH) 50 adults 40 children 30 Percentage of all repetitions 20 10 more increased shifting of clearly loudness focus articulated 9/19/2018

Learning from Human Behavior (Krahmer et al ’01) ‘go on’ and ‘go back’ signals in grounding situations (implicit/explicit verification) Positive: short turns, unmarked word order, confirmation, answers, no corrections or repetitions, new info Negative: long turns, marked word order, disconfirmation, no answer, corrections, repetitions, no new info Signalling whether information is grounded or not (Clark & Wilkes-Gibbs ‘86, Clark & Schaeffer ‘89): presentation/acceptance 120 dialogue for Dutch train info; one version uses explicit verification and oneimplicit; 20 users given 3 tasks; analyzed 443 verification q/a pairs predicted that responses to correct verifications would be shorter, with unmarked word order, not repeating or correcting information but presenting new information (positive cues) -- principle of least effort findings: where problems, subjects use more words (or say nothing), use marked word order (especially after implicit verifs), contain more disconfirmations (duh), with more repeated and corrected info ML experiments (memory based learning) show 97% correct prediction from these features (>8 words or marked word order or corrects info -> 92%) Krahmer et al ‘99b predicted additional prosodic cues for neg signals: high boundary tone, high pitch range, long duration of ‘nee’ and entire utterance, long pause after ‘nee’, long delay before ‘no’, from 109 negative answers to ynqs of 7 speakers; hyp 9/19/2018

Hypotheses supported but… Can these cues be identified automatically? How might they affect the design of SDS? 9/19/2018

Today Avoiding errors Detecting errors From the user side: what cues does the user provide to indicate an error? From the system side: how likely is it the system made an error? Dealing with Errors: what can the system do when it thinks an error has occurred? Evaluating SDS: evaluating ‘problem’ dialogues 9/19/2018

Systems Have Trouble Knowing When They’ve Made a Mistake Hard for humans to correct system misconceptions (Krahmer et al `99) User: I want to go to Boston. System: What day do you want to go to Baltimore? Easier: answering explicit requests for confirmation or responding to ASR rejections System: Did you say you want to go to Baltimore? System: I'm sorry. I didn't understand you. Could you please repeat your utterance? One major problem is that systems have a hard time telling when they themselves have made a mistake. This has some serious consequences for how useful systems are and how usable users find them: Dutch studies of people using a spoken dialogue system found that users had greater difficulty (measured in length of response and time to response) in correcting system misconceptions than in responding to explicit requests for confirmation. But systems that always ask for confirmation make the dialogue longer and more tedious and result in lower user satisfaction scores. Furthermore, Levow found that the probability of a recognition failure after a failure was 2.75 times greater than after a successful recognition. Perhaps like the `helpful’ response of native speakers to a foreign visitor with language difficulties --- of simply speaking louder --- are users of spoken dialogue systems responding to ASR failures in ways that simply increase the likelihood of further failure?? 9/19/2018

But constant confirmation or over-cautious rejection lengthens dialogue and decreases user satisfaction 9/19/2018

…And Systems Have Trouble Recognizing User Corrections Probability of recognition failures increases after a misrecognition (Levow ‘98) Corrections of system errors often hyperarticulated (louder, slower, more internal pauses, exaggerated pronunciation)  more ASR error (Wade et al ‘92, Oviatt et al ‘96, Swerts & Ostendorf ‘97, Levow ‘98, Bell & Gustafson ‘99) Another problem from the opposite side is that when users correct system errors, they often do so in ways that make it even harder for the system to understand them. 9/19/2018

Can Prosodic Information Help Systems Perform Better? If errors occur where speaker turns are prosodically ‘marked’…. Can we recognize turns that will be misrecognized by examining their prosody? Can we modify our dialogue and recognition strategies to handle corrections more appropriately? Previous research suggests that particular prosodic phenomena associated with user corrections of ASR misrecognitions may actually contribute to subsequent recognition failures: hyperarticulation studies casual speaking style (SwitchBoard and Call Home) 9/19/2018

Approach Collect corpus from interactive voice response system Identify speaker ‘turns’ incorrectly recognized where speakers first aware of error that correct misrecognitions Identify prosodic features of turns in each category and compare to other turns Use Machine Learning techniques to train a classifier to make these distinctions automatically misrecognition aware site correction Our current study looks at a spoken dialogue corpus to see if we can automatically learn three categories of speaker turn (speech between the system ending a contribution and starting another): misrecognitions, speaker corrections, and turns where speakers are first made aware that an error has occurred (aware sites). Our original goal was to combine predictive information from these three turn types to predict misrecognitions and identify potential ‘correcting’ turns. 9/19/2018

Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. misrecognition Here are examples of the 3 turn types we focus on. correction aware site 9/19/2018

TOOT Dialogues Collected to study effects of differences in dialogue strategy on user performance and satisfaction (Litman & Pan ‘99) type of initiative (system, user, mixed) type of confirmation (explicit, implicit, none) adaptability condition Subjects 39 summer students 16/23 (F/M) 20/19 (native speaker/non) The corpus we used for the study is the TOOT train information corpus, collected by Litman and Pan in 1998. 9/19/2018

Task: find train information for 4 scenarios Corpus for current study: Platform: combined over-the-phone ASR and TTS (Kamm et al ‘97) with web access to train information Task: find train information for 4 scenarios Corpus for current study: 2328 speaker turns 52 dialogues Misrecognitions Overall word accuracy: 61% Overall concept accuracy(CA): 71% “I want to go to Boston from Philadelphia” (2 domain concepts) recognized as “I want to go to Boston” (one concept) = 50% Mean WER per turn for the 1975 turns was 47%. 9/19/2018

A Successful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I would like a train to New York City from Philadelphia on Sunday at ten thirty P M S: I heard you say “go to New York City from Philadelphia on Sunday approximately at ten thirty p.m." Do you want me to find the trains from Philadelphia to New York City on Sunday approximately at ten thirty p. m. now? U: Yes. S: I am going to get the train schedule for you.... Here’s an example of the system performing well. 9/19/2018

Are Misrecognitions, Aware Turns, Corrections Measurably Different from Other Turns? For each type of turn: For each speaker, for each prosodic feature, calculate mean values for e.g. all correctly recognized speaker turns and for all incorrectly recognized turns Perform paired t-tests on these speaker pairs of means (e.g., for each speaker, pairing mean values for correctly and incorrectly recognized turns) For each turn type we performed descriptive analyses and also machine learning experiments. 9/19/2018

How: Prosodic Features Examined per Turn Raw prosodic/acoustic features f0 maximum and mean (pitch excursion/range) rms maximum and mean (amplitude) total duration duration of preceding silence amount of silence within turn speaking rate (estimated from syllables of recognized string per second) Normalized versions of each feature (compared to first turn in task, to previous turn in task, Z scores) Initially we chose these features to capture elements of hyperarticulated speech as observed in the literature. Use award’s ‘baltimore’s to illustrate diffs (files in task1 dir) 9/19/2018

Distinguishing Correct Recognitions from Misrecognitions (NAACL ‘00) Misrecognitions differ prosodically from correct recognitions in F0 maximum (higher) RMS maximum (louder) turn duration (longer) preceding pause (longer) slower Effect holds up across speakers and even when hyperarticulated turns are excluded These are reported in detail in our NAACL ’00 paper… So at least we should be able to improve rejection decisions (please repeat) and perhaps to tailor changes in dialogue strategy to fit the difficulty of recognizing particular turns. 9/19/2018

WER-Based Results Misrecognitions are higher in pitch, louder, longer, more preceding pause and less internal silence Results for both CA-defined and WER-defined misrecognitions were quite impressive and consistent with hypothesis that misrecognitions are hyperarticulated: higher in pitch, louder, longer, with more preceding pause (altho internal silence is less!). Results for CA differed from WER only in that, for CA definition, f0 mean and internal silence were not significant but speaking rate was. 9/19/2018

Predicting Turn Types Automatically Ripper (Cohen ‘96) automatically induces rule sets for predicting turn types greedy search guided by measure of information gain input: vectors of feature values output: ordered rules for predicting dependent variable and (X-validated) scores for each rule set Independent variables: all prosodic features, raw and normalized experimental conditions (adaptability of system, initiative type, confirmation style, subject, task) gender, native/non-native status ASR recognized string, grammar, and acoustic confidence score We then tested how observed differences might be used to automatically predict turn types on-line. 9/19/2018

ML Results: WER-defined Misrecognition Table shows the predictive power of several different rule sets, trained on different subsets of our features: Baseline is prediction that all turns are misrecognized. Note that the best performing rule-set is trained on prosodic information and information already available to the system during recognition. But also note that prosodic features alone currently out-perform the traditional ASR confidence score. And no features of the experimental conditions proved to be useful predictors, so our results appear to generalize to different initiative and confirmation strategies. Estimated error derived by Ripper from 25-fold cross-validation procedure. Confidence limits obtained by 9/19/2018

Best Rule-Set for Predicting WER Using prosody, ASR conf, ASR string, ASR grammar if (conf <= -2.85 ^ (duration >= 1.27) ^ then F if (conf <= -4.34) then F if (tempo <= .81) then F If (conf <= -4.09 then F If (conf <= -2.46 ^ str contains “help” then F If conf <= -2.47 ^ ppau >= .77 ^ tempo <= .25 then F If str contains “nope” then F If dur >= 1.71 ^ tempo <= 1.76 then F else T Here’s part of what the best-performing rule-set Ripper produces looks like. T is all-correct transcription Note, for example, that strings containing ‘yes’ and ‘no’ are likely to be recognized correctly, modulo duration and ASR confidence score constraints 9/19/2018

Today Avoiding errors Detecting errors From the user side: what cues does the user provide to indicate an error? From the system side: how likely is it the system made an error? Dealing with Errors: what can the system do when it thinks an error has occurred? Evaluating SDS: evaluating ‘problem’ dialogues 9/19/2018

Error Handling Strategies If systems can recognize their lack of recognition, how should they inform the user that they don’t understand (Goldberg et al ’03)? System rephrasing vs. repetitions vs. statement of not understanding Apologies What behaviors might these produce? Hyperarticulation User frustration User repetition vs. rephrasing 9/19/2018

What lessons do we learn? When users are frustrated they are generally harder to recognize accurately When users are increasingly misrecognized they tend to be misrecognized more often and become increasingly frustrated Apologies combined with rephrasing of system prompts tend to decrease frustration and improve WER: Don’t just repeat! Users are better recognized when they rephrase their input 9/19/2018

Today Avoiding errors Detecting errors From the user side: what cues does the user provide to indicate an error? From the system side: how likely is it the system made an error? Dealing with Errors: what can the system do when it thinks an error has occurred? Evaluating SDS: evaluating ‘problem’ dialogues 9/19/2018

Recognizing `Problematic’ Dialogues Hastie et al, “What’s the Trouble?” ACL 2002 How to define a dialogue as problematic? User satisfaction is low Task is not completed How to recognize? Train on a corpus of recorded dialogues (1242 DARPA Communicator dialogues) Predict User Satisfaction Task Completion (0,1,2) 9/19/2018

User Satisfaction features: 9/19/2018

Results 9/19/2018

Next Class Speech data mining HW3c due 9/19/2018