Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dissertation Defense Saarbrücken – November ??th 2004 Automatic Classification of Speech Recognition Hypotheses Using Acoustic and Pragmatic Features Malte.

Similar presentations


Presentation on theme: "Dissertation Defense Saarbrücken – November ??th 2004 Automatic Classification of Speech Recognition Hypotheses Using Acoustic and Pragmatic Features Malte."— Presentation transcript:

1 Dissertation Defense Saarbrücken – November ??th 2004 Automatic Classification of Speech Recognition Hypotheses Using Acoustic and Pragmatic Features Malte Gabsdil Universität des Saarlandes

2 The Problem (theoretical) Grounding: establishing common ground between dialogue participants –“Did H correctly understand what S said?” Combination of bottom-up (“signal”) and top-down (“expectation”) information Clark (1996): Action ladders –upward completion –downward evidence

3 The Problem (practical) Assessment of recognition quality for spoken dialogue systems Information sources –speech/recognition output (“acoustic”) –dialogue/task context (“pragmatic”) Crucial for usability and user satisfaction –avoid misunderstandings –promote dialogue flow and efficiency

4 The General Picture Dialogue System Architecture Dialogue Manager dialogue history interpretation dialogue model response selection Generation ASR

5 A Closer Look How to assess recognition quality? –decision problem ASR Dialogue Manager dialogue history interpretation dialogue model response selection Best hypothesis + confidence Confidence rejection thresholds n-Best hypotheses + confidence Machine Learning Classifier Pragmatic features Acoustic features

6 Overview Machine learning classifiers Acoustic and pragmatic features Experiment 1: Chess –exemplary domain Experiment 2: WITAS –complex spoken dialogue system Conclusions & Topics for Future Work

7 Machine Learning Classifiers Concept learners –learn decision function –training: present feature vectors annotated with correct class –testing: classify unseen feature vectors Combine acoustic and pragmatic features to classify recognition hypotheses as accept, (clarify), reject, or ignore

8 Acoustic Information Derived from speech waveforms and recognition output Low level features –amplitude, pitch (f0), duration, tempo (e.g. Levow 1998, Litman et al. 2000) Recogniser confidences –normalised probability that a sequence of recognised words is correct (e.g. Wessel et al. 2001)

9 Pragmatic Information Derived from the dialogue context and task knowledge Dialogue features –adjacency pairs: current/previous dialogue move, DM bigram frequencies –reference: unresolvable definite NPs/PROs Task features (scenario dependent) –evaluation of move scores (Chess), conflicts in action preconds and effects of (WITAS)

10 Experiment 1: Chess Recognise spoken chess move instructions –speech interface to computer chess program Exemplary domain to test methodology –nice properties, easy to control Pragmatic features: automatic move evaluation scores (Crafty) Acoustic features: recogniser confidence scores (Nuance 8.0)

11 Data & Design Subjects replay given chess games –instruct each other to move pieces –approx move instructions in different data sets (devel, train, test) 5 x 2 x 6 design –5 systems for classifying recognition results (main effect) –2 game levels (strong vs. weak) –6 pairs of players

12 Players and Instructions

13 Systems Task: accept or reject rec. hypotheses Baseline –confidence rejection threshold –binary classification of best hypothesis ML System –SVM learner (best on dev. set) –binary classication of 10-best results –choose first classified as accept, else reject

14 Results Accuracy: –Baseline: 64.3%, ML System: 97.2%

15 Evaluation 82.2% relative error rate reduction χ ² test on confusion matrices –highly significant (p<.001) Combination of acoustic and pragmatic information outperforms standard approach System reacts appropriately more often → increased usability

16 Experiment 2: WITAS Operator interaction with robot helicopter Multi-modal references, collaborative activities, multi-tasking Differences to chess experiment –complex dialogue scenario –complex system (ISU-based, planning, …) –much larger grammar and vocabulary Chess 37 GR, Vocab 50 FEHLT WAS –open mic recordings (ignore class)

17 WITAS Screenshot

18 Data Preparation 30 dialogues (6 users, 303 utterances) Manual transcriptions Offline recognition (10best) and parsing → quasi-logical forms Hypothesis labelling: –accept: same QLF –reject: out-of-grammar or different QLF –ignore: “crosstalk” not directed to system

19 Example Features Acoustic –low level: amplitude (RMS) –recogniser: hypothesis confidence score, rank in nbest list Pragmatic –dialogue: current/previous DM, DM bigram probability, #unresolvable definite NPs –task: #conflicts in planning operators (e.g. already satisfied effects)

20 Results Context-sensitive LMs Accuracy: 65.68% Wfscore: 61.81% (higher price) TiMBL (optimised) Accuracy: 86.14% Wfscore: 86.39%

21 Evaluation 59.6% relative error rate reduction χ ² test on confusion matrices –highly significant (p<.001) Combination of acoustic and pragmatic information outperforms grammar switching approach System reacts appropriately more often → increased usability

22 WITAS Features Importance according to χ² 1.confidence(ac: recogniser) 2.DMBigramFrequency(pr: dialogue) 3.currentDM(pr: dialogue) 4.minAmp(ac: low level) 5.hypothesisLength(ac: recogniser) 6.RMSamp(ac: low level) 7.currentCommand(pr: dialogue) 8.minWordConf(ac: recogniser) 9.aqMatch(pr: dialogue) 10.nbestRank(ac: recogniser)

23 Summary/Achievements Assessment of recognition quality for spoken dialogue systems (grounding) Combination of acoustic and pragmatic information via machine learning Highly significant improvements in classification accuracy over standard methods (incl. “grammar switching”) Expect better system behaviour and user satisfaction

24 Topics for Future Work Usability evaluation –systems with and w/o classification module Generic and system-specific features –which features are available across systems? Tools for ISU-based systems –module in DIPPER software library Clarification –flexible generation (alternative questions, word-level clarification)

25 APPENDIX

26 Our Proposal Combine acoustic and pragmatic information in a principled way Machine learning to predict the grounding status of competing recognition hypotheses of user utterances Evaluation against standard methods in spoken dialogue system engineering –confidence rejection thresholds

27 Application ASR … ML classifier Acoustic information Pragmatic information … Dialogue Manager dialogue context task knowledge

28 Data Collection GamePair1Pair2Pair3Pair4Pair5Pair6Total Trial(1) LMw(2) (3) LMs(4)(5) DSw69 (5)68 (4)62 (4)199 DSs65 (3)68 (3)62 (4)200 TRw68 (4)64 (2)66 (4)60 (3)69 (5)62 (3)389 TRs70 (5)63 (3)70 (5)59 (2)62 (4)68 (2)392 TEw64 (6) 384 TEs69 (7) 414 Total

29 Results acceptreject accept6952(697) reject2081(101) Baseline System (Acc: 64.3%) ML System (Acc: 97.2%) acceptreject accept44148(489) reject23772(309)

30 Chess Results Accuracy: –Base: 64.3%, LM: 93.5%, ML: 97.2%

31 Data/Baseline Data from user study with WITAS –6 subjects, 5 tasks each (“open mic”) –30 dialogues (303 user utterances) –recorded utterances and logs of WITAS Information State (dialogue history) Originally collected to evaluate a “grammar switching” version of WITAS (= Baseline; Lemon 2004)

32 Data Preparation/Setup Manual transcription of all utterances Offline recognition (10best) with “full” grammar and processing with NLU component (quasi-logical forms) Hypothesis labelling: –accept: same QLF –reject: out-of-grammar or different QLF –ignore: “crosstalk” not directed to system

33 Acoustic Features Low level: –RMSamp, minAmp (abs), meanAmp (abs) –motiv: detect crosstalk Recogniser output/confidence scores: –nbest rank, hypothesisLength (words) –hyp. confidence, confidence zScore, confidence SD, minWordConf –motiv: quality estimation within and across hypotheses

34 Pragmatic Features Dialogue: –currentDM, DMTactiveNode, qaMatch, aqMatch, DMbigramFreq, currentCommand –#unresNPs, #unresPROs, #uniqueIndefs –motiv: adjacency pairs, unlikely references Task: –taskConflict (same command already active), taskConstraintConflict (fly vs. land)

35 Baseline Results acceptrejectignore accept154/2284 reject45434 ignore1292 Accuracy: 199/303 = 65.68% Weighted f-score: 61.81%

36 Best ML Results acceptrejectignore accept159/2270 reject6860 ignore2516 Accuracy: 261/303 = 86.14% Weighted f-score: 86.39% (TiMBL with parameter optimisation)

37 Importance of Features What are the most predictive features? –χ ² statistics: correlate feature values with different classes –computed for each feature from value/class contingency tables –O ij : observed frequencies; E ij : expected frequencies –n.j : sum over column j; n i. : sum over row i; n.. : #instances

38 Simple Example c1c2 v v v v n=800 Feature BFeature A χ² = 0 c1c2 v v v v n=800 χ² = 2*(30²/50)+2*0+ 2*(55²/75)+ 2*(85²/200) =


Download ppt "Dissertation Defense Saarbrücken – November ??th 2004 Automatic Classification of Speech Recognition Hypotheses Using Acoustic and Pragmatic Features Malte."

Similar presentations


Ads by Google