Presentation is loading. Please wait.

Presentation is loading. Please wait.

Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer.

Similar presentations


Presentation on theme: "Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer."— Presentation transcript:

1 Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer Science Department University of Pittsburgh

2 Overview Motivation spoken dialogue tutoring systems Emotion Annotation positive, negative and neutral student states Machine Learning Experiments extract linguistic features from student speech use different feature sets to predict emotions best-performing feature set: speech & text, turn & context 84.75% accuracy, 44% error reduction

3 Motivation  Bridge Learning Gap between Human Tutors and Computer Tutors  (Aist et al., 2002): Adding human-provided emotional scaffolding to a reading tutor increases student persistence  Our Approach: Add emotion prediction and adaptation to ITSPOKE, our Intelligent Tutoring SPOKEn dialogue system (demo paper)

4

5 Experimental Data  Human Tutoring Spoken Dialogue Corpus 128 dialogues (physics problems), 14 subjects 45 average student and tutor turns per dialogue Same physics problems, subject pool, web interface, and experimental procedure as ITSPOKE

6 Emotion Annotation Scheme (Sigdial’04)  Perceived “Emotions”  Task- and Context-Relative  3 Main Emotion Classes: negative  neutral  positive  3 Minor Emotion Classes: weak negative, weak positive, mixed

7 Example Annotated Excerpt (weak, mixed -> neutral) Tutor: Uh let us talk of one car first. Student: ok. (EMOTION = NEUTRAL) Tutor: If there is a car, what is it that exerts force on the car such that it accelerates forward? Student: The engine. (EMOTION = POSITIVE) Tutor: Uh well engine is part of the car, so how can it exert force on itself? Student: um… (EMOTION = NEGATIVE)

8 Emotion Annotated Data  453 student turns, 10 dialogues, 9 subjects  2 annotators, 3 main emotion classes  385/453 agreed (84.99%, Kappa 0.68) NegativeNeutralPositive Negative9064 Neutral2328030 Positive0515

9 Feature Extraction per Student Turn Five feature types – acoustic-prosodic (1) – non acoustic-prosodic lexical (2) other automatic (3) manual (4) –identifiers (5) Research questions –utility of different features –speaker and task dependence

10 Feature Types (1) Acoustic-Prosodic Features (normalized)  4 pitch (f0) : max, min, mean, standard dev.  4 energy (RMS) : max, min, mean, standard dev.  4 temporal: turn duration (seconds) pause length preceding turn (seconds) tempo (syllables/second) internal silence in turn (zero f0 frames)  available to ITSPOKE in real time

11 Feature Types (2) Lexical Items  word occurrence vector

12 Feature Types (3) Other Automatic Features: available from ITSPOKE logs  Turn Begin Time (seconds from dialog start)  Turn End Time (seconds from dialog start)  Is Temporal Barge-in (student turn begins before tutor turn ends)  Is Temporal Overlap (student turn begins and ends in tutor turn)  Number of Words in Turn  Number of Syllables in Turn

13 Feature Types (4 ) Manual Features: (currently) available only from human transcription  Is Prior Tutor Question (tutor turn contains “?”)  Is Student Question (student turn contains “?”)  Is Semantic Barge-in (student turn begins at tutor word/pause boundary)  Number of Hedging/Grounding Phrases (e.g. “mm- hm”, “um”)  Is Grounding (canonical phrase turns not preceded by a tutor question)  Number of False Starts in Turn (e.g. acc-acceleration)

14 Feature Types (5) Identifier Features  subject ID  problem ID  subject gender

15 Machine Learning (ML) Experiments  Weka software: boosted decision trees give best results (Litman&Forbes, ASRU 2003)  Baseline: Predicts Majority Class (neutral) Accuracy = 72.74%  Methodology: 10 runs of 10-fold cross validation  Evaluation Metrics Mean Accuracy: %Correct Relative Improvement Over Baseline (RI): error(baseline) – error(x) error(baseline)

16 Acoustic-Prosodic vs. Other Features  Baseline = 72.74%; RI range = 12.69% - 43.87% Feature Set-ident speech76.20% lexical78.31% lexical + automatic80.38% lexical + automatic + manual83.19%  Acoustic-prosodic features (“speech”) outperform majority baseline, but other feature types yield even higher accuracy, and the more the better

17 Acoustic-Prosodic plus Other Features Feature Set-ident speech + lexical79.26% speech + lexical + automatic79.64% speech + lexical + automatic + manual83.69%  Baseline = 72.74%; RI range = 23.29% - 42.26%  Adding acoustic-prosodic to other feature sets doesn’t significantly improve performance

18 Adding Contextual Features (Litman et al. 2001, Batliner et al 2003): adding contextual features improves prediction accuracy Local Features: the values of all features for the two student turns preceding the student turn to be predicted Global Features: running averages and total for all features, over all student turns preceding the student turn to be predicted

19 Previous Feature Sets plus Context  Same feature set with no context: 83.69% Feature Set+context-ident speech + lexical + auto + manual local82.44 speech + lexical + auto + manual global84.75 speech + lexical + auto + manual local+global81.43  Adding global contextual features marginally improves performance, e.g.

20 Feature Usage Feature TypeTurn + Global Acoustic-Prosodic16.26% Temporal13.80% Energy 2.46% Pitch 0.00% Other83.74% Lexical41.87% Automatic 9.36% Manual32.51%

21 Accuracies over ML Experiments

22 Related Research in Emotional Speech uActor/Native Read Speech Corpora (Polzin & Waibel 1998; Oudeyer 2002; Liscombe et al. 2003)  more emotions; multiple dimensions  acoustic-prosodic predictors uNaturally-Occurring Speech Corpora (Ang et al. 2002; Lee et al. 2002; Batliner et al. 2003; Devillers et al. 2003; Shafran et al. 2003)  less emotions (e.g. E / -E); Kappas < 0.6  additional (non acoustic-prosodic) predictors uFew address the tutoring domain

23 Summary  Methodology: Annotation of student emotions in spoken human tutoring dialogues, extraction of linguistic features, and use of different feature sets to predict emotions  Our best-performing feature set contains acoustic- prosodic, lexical, automatic and hand-labeled features from turn and context (Accuracy = 85%, RI = 44%)  This research is a first step towards implementing emotion prediction and adaptation in ITSPOKE

24 Current Directions Address same questions in ITSPOKE computer tutoring corpus (ACL’04) Label human tutor reactions to student emotions to:  develop adaptive strategies for ITSPOKE  examine the utility of different annotation granularities  determine if greater tutor response to student emotions correlates with student learning and other performance measures

25 Thank You! Questions?

26 Prior Research: Affective Computer Tutoring (Kort and Reilly and Picard., 2001): propose a cyclical model of emotion change during learning; developing a non-dialog computer tutor that will use eye-tracking/facial features to predict emotion and support movement into positive emotions. (Aist and Kort and Reilly and Mostow and Picard, 2002): Adding human-provided emotional scaffolding to an automated reading tutor increases student persistence (Evens et al, 2002): for CIRCSIM: computer dialog tutor for physiology problems; hypothesize adaptive strategies for recognized student emotional states; e.g. if detecting frustration, system should respond to hedges and self-deprecation by supplying praise and restructuring the problem. (de Vicente and Pain, 2002): use human observation about student motivational states in videod interaction with non-dialog computer tutor to develop rules for detection (Ward and Tsukahara, 2003): spoken dialog computer “tutor-support” uses prosodic and contextual features of user turn (e.g. “on a roll”, “lively”, “in trouble”) to infer appropriate response as users remember train stations. Preferred over randomly chosen acknowledgments (e.g. “yes”, “right” “that’s it”, “that’s it ”,… ) (Conati and Zhou, 2004): use Dynamic Bayesian Networks) to reason under uncertainty about abstracted student knowledge and emotional states through time, based on student moves in non-dialog computer game, and to guide selection of “tutor” responses.  Most will be relevant to developing ITSPOKE adaptation techniques

27 ML Experiment 3: Other Evaluation Metrics  alltext + speech + ident: leave-one-out cross-validation (accuracy = 82.08%)  Best for neutral, better for negatives than positives  Baseline: neutral:.73, 1,.84; negatives and positives = 0, 0, 0 ClassPrecisionRecallF-Measure Negative0.710.600.65 Neutral0.860.920.89 Positive0.500.270.35

28 Machine Learning (ML) Experiments  Weka machine-learning software: boosted decision trees give best results (Litman&Forbes, 2003)  Baseline: Predicts Majority Class (neutral) Accuracy = 72.74%  Methodology: 10 x 10 cross validation  Evaluation Metrics Mean Accuracy: %Correct Standard Error: SE = std(x)/sqrt(n), n=10 runs +/- 2*SE = 95% confidence interval Relative Improvement Over Baseline (RI): error(baseline) – error(x) error(baseline) error(y) = 100 - % Correct

29 Outline  Introduction  ITSPOKE Project  Emotion Annotation  Machine-Learning Experiments  Conclusions and Current Directions

30 ITSPOKE: Intelligent Tutoring SPOKEn Dialogue System  Back-end is text-based Why2-Atlas tutorial dialogue system (VanLehn et al., 2002)  Sphinx2 speech recognizer  Cepstral text-to-speech synthesizer  Try ITSPOKE during demo session !

31 Experimental Procedure  Students take a physics pretest  Students read background material  Students use the web and voice interface to work through up to 10 problems with either ITSPOKE or a human tutor  Students take a post-test

32 ML Experiment 3: 8 Feature Sets + Context  Global context marginally better than local or combined  No significant difference between +/- ident sets  e.g., speech without context: 76.20% (-ident), 77.41% (+ident) Feature Set+context-ident+ident speechlocal76.9076.95 speechglobal77.7778.02 speechlocal+global77.0076.88  Adding context marginally improves some performances

33

34 8 Feature Sets speech: normalized acoustic-prosodic features lexical: lexical items in the turn autotext: lexical + automatic features alltext: lexical + automatic + manual features +ident: each of above + identifier features


Download ppt "Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer."

Similar presentations


Ads by Google