Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Prosody to Recognize Student Emotions and Attitudes in Spoken Tutoring Dialogues Diane Litman Department of Computer Science and Learning Research.

Similar presentations


Presentation on theme: "Using Prosody to Recognize Student Emotions and Attitudes in Spoken Tutoring Dialogues Diane Litman Department of Computer Science and Learning Research."— Presentation transcript:

1

2 Using Prosody to Recognize Student Emotions and Attitudes in Spoken Tutoring Dialogues Diane Litman Department of Computer Science and Learning Research and Development Center University of Pittsburgh

3 Outline  Introduction  The ITSPOKE System and Corpora  Emotion Prediction from Prosody & Other Features –Method –Human-human tutoring –Computer-human tutoring  Current Directions and Summary

4 Motivation  Working hypothesis regarding learning gains –Human Dialogue > Computer Dialogue > Text  Most human tutoring involves face-to-face spoken interaction, while most computer dialogue tutors are text-based –Evens et al., 2001; Zinn et al., 2002; Vanlehn et al., 2002; Aleven et al., 2001  Can the effectiveness of dialogue tutorial systems be further increased by using spoken interactions?

5 Potential Benefits of Speech  Self-explanation correlates with learning and occurs more in speech – Hausmann and Chi, 2002  Speech contains prosodic information, providing new sources of information for dialogue adaptation –Forbes-Riley and Litman, 2004  Spoken computational environments may prime a more social interpretation that enhances learning –Moreno et al., 2001; Graesser et al., 2003  Potential for hands-free interaction –Smith, 1992; Aist et al., 2003

6 Spoken Tutorial Dialogue Systems  Recent tutoring systems have begun to add spoken language capabilities –Rickel and Johnson, 2000; Graesser et al. 2001; Mostow and Aist, 2001; Aist et al., 2003; Fry et al., 2001; Schultz et al., 2003  However, little empirical analysis of the learning ramifications of using speech

7 Outline  Introduction  The ITSPOKE System and Corpora  Emotion Prediction from Prosody & Other Features –Method –Human-human tutoring –Computer-human tutoring  Current Directions and Summary

8 ITSPOKE: Intelligent Tutoring SPOKEn Dialogue System  Back-end is text-based Why2-Atlas tutorial dialogue system (VanLehn et al., 2002)  Student speech digitized from microphone input; Sphinx2 speech recognizer  Tutor speech played via headphones/speakers; Cepstral text-to-speech synthesizer  Other additions: access to Why2-Atlas “internals”, speech recognition repairs, etc.

9

10

11

12 Spoken Tutoring Corpora  Human-Human Corpus –14 students, 128 physics problems (dialogues), 5948 student turns, 5505 tutor turns –Student and tutor turns were manually transcribed for content and segmented for boundaries

13 Corpus Transcription & Annotation

14 Spoken Tutoring Corpora (cont.)  Computer-Human Corpus –20 students, 100 physics problems (dialogues), 2445 student turns, 2967 tutor turns –Noisy student turn boundaries and transcriptions extractable from ITSPOKE logs –Content of student turns also manually transcribed

15 ITSPOKE Corpora Comparison Human-Human Human-Computer …1.3 minutes into session… Student: Can I ask you questions? Tutor: Yes Student: Is that ok? Tutor: Mm-hm Student: Um from what we learned from the last problem they're gonna have the same- the keys and you have the same acceleration right? Tutor: Yes Student: So that means they'll both hit the bottom at the same time. But I don't understand where- I don't understand if you're in the elevator- Tutor: You see Student: Where are you going to-? Tutor: The uh let me uh the key uh- the person holds the key in front of- Student: Their face yeah- Tutor: Uh his or her face and then lets it uh uh it is let go so the the the question relates to the relative position of the key and the face uh- Student: So the key and the face- Tutor: Uh how will they- as they go- as they both fall what will be their relative position? That is the question. Student: (sigh) …3.5 minutes into session… Tutor: I didn't understand. What force acts on all objects within close proximity to earth? Student: uh the force of gravity Tutor: Fine. Besides the Earth's gravitational force, are there any other forces acting on the person, keys, and elevator after the release? Student: no Tutor: Yep. Thus, keep in mind that after the release, the only force on the person, keys, and elevator is the force of gravity. If gravity is the only force acting on an object, will it be moving or staying still? Student: it will be moving Tutor: Fine. Not only are the person, keys, and elevator moving, they have only gravitational forces on them. When an object is falling and has only gravitational force on it, it is said to be in what? Student: freefall Tutor: Good. Thus, after the release the only force on the keys and man is the downward force of earth's gravity, so they are in freefall. Before the release, what is the relationship between the man's velocity and the keys' velocity? Student: they're equal Tutor: Correct. We know that after the release, the keys, the person, and the elevator are in freefall. Is freefall acceleration near earth a constant? Or does it vary depending upon which object is in freefall? Student: it's a constant

16 Outline  Introduction  The ITSPOKE System and Corpora  Emotion Prediction from Prosody & Other Features –Method –Human-human tutoring –Computer-human tutoring  Current Directions and Summary

17 Monitoring Student State (motivation) Tutor26: Let me ask my question in a more general way. What happens to the velocity of a body when there is no force acting on it? Student27:dammit (ASR: it is) Tutor28 :Could you please repeat that? Student29 :same (ASR: i same) Tutor30 :Right. And what happens to the HORIZONTAL velocity of a body when there is no HORIZONTAL force acting on it? Student31:zero (ASR: the zero) Tutor32 :Well... OK, let's consider how forces in the vertical direction affect the motion of a body in a horizontal direction in a different scenario Student33:oh gravity you already said this (ASR: beats gravity you're exert less) Tutor34:Fine. Are there any other forces acting on the apple as it falls? Student35:no why are you doing this again (ASR: no y and to it yes) Tutor36:Correct. The only force acting on the apple as it falls to the earth is the earth's gravitational force. In which direction does gravitational force act? Student37:downward you computer (ASR: downward you computer)

18 Related Research in Emotional Speech u Elicited Speech (Polzin & Waibel 1998; Oudeyer 2002; Liscombe et al. 2003) u Naturally-Occurring Speech (Ang et al. 2002; Lee et al. 2002; Batliner et al. 2003; Devillers et al. 2003; Shafran et al. 2003) u Our Work  naturally-occurring tutoring data  analysis of comparable human and computer corpora

19 Methodology  Emotion Annotation  Machine Learning Experiments –Extract linguistic features from student turns –Use different feature sets to predict emotions –Significant reduction of baseline error

20 Emotion Annotation Scheme (Sigdial’04)  ‘ Emotion’: emotions/attitudes that may impact learning  Annotation of Student Turns  3 Main Emotion Classes negative e.g. uncertain, bored, irritated, confused, sad positive e.g. confident, enthusiastic neutral no expression of negative or positive emotion  3 Minor Emotion Classes –weak negative, weak positive, mixed

21 Feature Extraction per Student Turn  Five feature types –Acoustic-prosodic (1) –Non acoustic-prosodic –Lexical (2) –Other Automatic (3) –Manual (4) –Identifiers (5)  Research questions –Relative predictive utility of feature types –Impact of speech recognition –Comparison across computer and human tutoring

22 Feature Types (1) Acoustic-Prosodic Features  4 pitch (f0) : max, min, mean, standard dev.  4 energy (RMS) : max, min, mean, standard dev.  4 temporal: turn duration (seconds) pause length preceding turn (seconds) tempo (syllables/second) internal silence in turn (zero f0 frames)  available to ITSPOKE in real time

23 Feature Types (2) Le xical (Word Occurrence Vectors)  Human-transcribed lexical items in the turn  ITSPOKE-recognized lexical items

24 Feature Types (3) Other Automatic Features: available from logs  Turn Begin Time (seconds from dialog start)  Turn End Time (seconds from dialog start)  Is Temporal Barge-in (student begins before tutor turn ends)  Is Temporal Overlap (student begins and ends in tutor turn)  Number of Words in Turn  Number of Syllables in Turn

25 Feature Types (4 ) Manual Features: (currently) available only from human transcription  Is Prior Tutor Question (tutor turn contains “?”)  Is Student Question (student turn contains “?”)  Is Semantic Barge-in (student turn begins at tutor word/pause boundary)  Number of Hedging/Grounding Phrases (e.g. “mm-hm”, “um”)  Is Grounding (canonical phrase turns not preceded by a tutor question)  Number of False Starts in Turn (e.g. acc-acceleration)

26 Feature Types (5) Id entifier Features  student number  student gender  problem number

27 Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Proceedings of the Human Language Technology Conference: 4 th Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2004) Empirical Results I

28 Annotated Human-Human Excerpt (weak, mixed -> neutral) Tutor: Uh let us talk of one car first. Student: ok. (EMOTION = NEUTRAL) Tutor: If there is a car, what is it that exerts force on the car such that it accelerates forward? Student: The engine. (EMOTION = POSITIVE) Tutor: Uh well engine is part of the car, so how can it exert force on itself? Student: um… (EMOTION = NEGATIVE)

29 Human Tutoring: Annotation Agreement Study 453 student turns, 10 dialogues 2 annotators (the authors) 385/453 agreed (85%, Kappa.7) NegativeNeutralPositive Negative9064 Neutral2328030 Positive0515

30 Machine Learning Experiments  Task: predict negative/positive/neutral using 5 feature types  Data: “agreed” subset of annotated student turns  Weka software: boosted decision trees  Methodology: 10 runs of 10-fold cross validation  Evaluation Metrics –Mean Accuracy: %Correct –Relative Improvement Over Baseline (RI): error(baseline) – error(x) error(baseline)

31 Acoustic-Prosodic vs. Other Features Baseline = 72.74%; RI range = 12.69% - 43.87% Feature Set-ident speech76.20% lexical78.31% lexical + automatic80.38% lexical + automatic + manual83.19%  Acoustic-prosodic features (“speech”) outperform majority baseline, but other feature types yield even higher accuracy, and the more the better

32 Acoustic-Prosodic plus Other Features Feature Set-ident speech + lexical79.26% speech + lexical + automatic79.64% speech + lexical + automatic + manual83.69% Baseline = 72.74%; RI range = 23.29% - 42.26%  Adding acoustic-prosodic to other feature sets doesn’t significantly improve performance

33 Adding Contextual Features (Litman et al. 2001, Batliner et al 2003): adding contextual features improves prediction accuracy Local Features: the values of all features for the two student turns preceding the student turn to be predicted Global Features: running averages and total for all features, over all student turns preceding the student turn to be predicted

34 Previous Feature Sets plus Context Same feature set with no context: 83.69% Feature Set+context-ident speech + lexical + auto + manual local82.44 speech + lexical + auto + manual global84.75 speech + lexical + auto + manual local+global81.43  Adding global contextual features marginally improves performance, e.g.

35 Feature Usage Feature Type Turn + Global Acoustic-Prosodic16.26% Temporal13.80% Energy 2.46% Pitch 0.00% Other83.74% Lexical41.87% Automatic 9.36% Manual32.51%

36 Accuracies over ML Experiments

37 Predicting Student Emotions in Computer- Human Tutoring Dialogues Diane J. Litman and Kate Forbes-Riley Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004) Empirical Results II

38 Computer Tutoring Study  Additional dataset –Consensus (all turns after annotators resolved disagreements)  Different treatment of minor classes  Additional binary prediction tasks (in paper) –Emotional/non-emotional and negative/non-negative  Slightly different features –strict turn-taking protocol (no barge-in) –ASR output rather than actual student utterances

39 Annotated Computer-Human Excerpt (weak -> pos/neg, mixed -> neutral) ITSPOKE: What happens to the velocity of a body when there is no force acting on it? Student: dammit (NEGATIVE) ASR: it is ITSPOKE : Could you please repeat that? Student: same (NEUTRAL) ASR: i same

40 Computer Tutoring: Annotation Agreement Study 333 student turns, 15 dialogues 2 annotators (the authors) 202/333 agreed (61%; Kappa=.4) NegativeNeutralPositive Negative89306 Neutral329438 Positive619

41 Acoustic-Prosodic vs. Lexical Features (Agreed Turns) Baseline = 46.52% Feature Set-ident speech55.49% lexical52.66% speech+lexical62.08%  Both acoustic-prosodic (“speech”) and lexical features significantly outperform the majority baseline  Combining feature types yields an even higher accuracy

42 Adding Identifier Features (Agreed Turns) Baseline = 46.52% Feature Set-ident+ident speech55.49%62.03% lexical52.66%67.84% speech+lexical62.08%63.52%  Adding identifier features improves all results  With identifier features, lexical information now yields the highest accuracy

43 Using Automatic Speech Recognition (Agreed Turns) Baseline = 46.52% Feature Set-ident+ident lexical52.66%67.84% ASR57.95%65.70% speech+lexical62.08%63.52% speech+ASR61.22%62.23%  Surprisingly, using ASR output rather than human transcriptions does not particularly degrade accuracy

44 Summary of Results (Agreed Turns)

45 Comparison with Human Tutoring - In human tutoring dialogues, emotion prediction (and annotation) is more accurate and based on somewhat different features

46 Summary of Results (Consensus Turns) - Using consensus rather than agreed data decreases predictive accuracy for all feature sets, but other observations generally hold

47 Recap  Recognition of annotated student emotions in spoken computer and human tutoring dialogues, using multiple knowledge sources  Significant improvements in predictive accuracy compared to majority class baselines  A first step towards implementing emotion prediction and adaptation in ITSPOKE

48 Outline  Introduction  The ITSPOKE System and Corpora  Emotion Prediction from Prosody & Other Features –Method –Human-human tutoring –Computer-human tutoring  Current Directions and Summary

49 Word Level Emotion Models (joint research with Mihai Rotaru)  Motivation –Emotion might not be expressed over the entire turn –Some pitch features make more sense at a smaller level  Simple word-level emotion model –Label each word with turn class –Learn a word level emotion model –Predict the class of each word in a test turn –Combine word classes using majority/weighted voting

50 Word Level Emotion Models - Results  Feature sets (Turn and Word levels) –Lexical –Pitch –PitchLex  Results –Word-level better than Turn-level counterpart –PitchLex at Word-level always among the best performers –PitchLex at Word-level comparable with state-of-art on our corpora HC, EnE, MBL

51 Prosody-Learning Correlations (joint work with Kate Forbes-Riley)  What aspects of spoken tutoring dialogues correlate with learning gains? –Dialogue features (Litman et al. 2004) –Student emotions (frequency or patterns) –Acoustic-prosodic features  Human Tutoring –Faster tempos (syllables/second) and longer turns (seconds) negatively correlate with learning (p <.09)  Computer Tutoring –Higher pitch features (average, max, min) negatively correlate with learning (p <.07)

52 Other Directions  Co-training to address annotation bottleneck –Maeireizo, Litman, and Hwa, ACL 2004  Development of adaptive strategies for ITSPOKE –Annotatation of human tutor turns  ITSPOKE version 2 and beyond –Pre-recorded prompts and domain-specific TTS –Barge-in –Dynamic adaptation to predicted student emotions

53 Summary  Recognition of annotated student emotions in spoken tutoring dialogues  Significant improvements in predictive accuracy compared to majority class baselines –role of feature types and speech recognition errors –comparable analysis of human and computer tutoring  This research is a first step towards implementing emotion prediction and adaptation in ITSPOKE

54 Acknowledgments  Kurt VanLehn and the Why2 Team  The ITSPOKE Group –Kate Forbes-Riley, LRDC –Beatriz Maeireizo, Computer Science –Amruta Purandare, Intelligent Systems –Mihai Rotaru, Computer Science –Scott Silliman, LRDC –Art Ward, Intelligent Systems  NSF and ONR

55 Thank You! Questions?

56 Architecture Cepstral www server www browser java ITSpoke Text Manager Spoken Dialogue Manager essay dialogue student text (xml) tutor turn (xml) html xml text Speech Analysis (Sphinx) dialogue repair goals Essay Analysis (Carmel, Tacitus- lite+) Content Dialogue Manager (Ape, Carmel) Why2 tutorial goals text essay

57 Speech Recognition: Sphinx2 (CMU)  Probabilistic language models for different dialogue states  Initial training data –typed student utterances from Why2-Atlas corpora  Later training data –spoken utterances obtained during development and pilot testing of ITSPOKE  Total vocabulary – 1240 unique words  “Semantic Accuracy” Rate = 92.4%

58 Speech Synthesis: Cepstral  Commercial outgrowth of Festival text-to- speech synthesizer (Edinburgh, CMU)  Required additional processing of Why2-Atlas prompts (e.g., f=m*a)

59 Common Experimental Aspects  Students take a physics pretest  Students read background material  Students use web interface to work through up to 10 problems with either a computer or a human tutor  Students take a posttest –40 multiple choice questions, isomorphic to pretest

60 Hypotheses  Compared to typed dialogues, spoken interactions will yield better learning gains, and will be more efficient and natural  Different student behaviors will correlate with learning in spoken versus typed dialogues, and will be elicited by different tutor actions  Findings in human-human and human-computer dialogues will vary as a function of system performance

61 Recap  Human Tutoring: spoken dialogue yielded significant performance improvements –Greater learning gains –Reduced dialogue time –Many differences in superficial dialogue characteristics  Computer Tutoring: spoken dialogue made little difference –No change in learning –Increased dialogue time –Fewer dialogue differences


Download ppt "Using Prosody to Recognize Student Emotions and Attitudes in Spoken Tutoring Dialogues Diane Litman Department of Computer Science and Learning Research."

Similar presentations


Ads by Google