Presentation is loading. Please wait.

Presentation is loading. Please wait.

Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.

Similar presentations


Presentation on theme: "Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer."— Presentation transcript:

1

2 Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer

3 Motivation “The context of emergency gives a larger palette of complex and mixed emotions.” Emotions in emergency situations are more extreme, and are “really felt in a natural way.” Debate on acted vs. real emotions Ethical concerns?

4 : Real-life Emotions Detection with Lexical and Paralinguistic Cues on Human-Human Call Center Dialogs (Devillers & Vidrascu ’06) Domain: Medical emergencies Motive: Study real-life speech in highly emotional situations Emotions studied: Anger, Fear, Relief, Sadness (but finer-grained annotation) Corpus: 680 dialogs, 2258 speaker turns Training-test split: 72% - 28% Machine Learning method: Log-likelihood ratio (linguistic), SVM (paralinguistic)

5 CEMO Corpus 688 dialogs, avg 48 turns per dialog Annotation: – Decisions of 2 annotators are combined in a soft vector: – Emotion mixtures – 8 coarse-level emotions, 21 fine-grained emotions – Inter-annotator agreement for client turns: 0.57 (moderate) – Consistency checks: Self-reannotation procedure (85% similarity) Perception test (no details given) Restrict corpus to caller utterances: 2258 utterances, 680 speakers.

6 Features Lexical features / Linguistic cues: Unigrams of user utterances, stemmed Prosodic features / Paralinguistic cues: – Loudness (energy) – Pitch contour (F0) – Speaking rate – Voice quality (jitter,...) – Disfluency (pauses) – Non-linguistic events (mouth noise, crying, …) – Normalized by speaker

7 Annotation Utterances annotated with one of the following nonmixed emotions: – Anger, Fear, Relief, Sadness – Justification for this choice?

8 Lexical Cue Model Log-likelihood ratio: 4 unigram emotion models (1 for each emotion) – A general task-specific model – Interpolation coefficient to avoid data sparsity problems – A coefficient of 0.75 gave the best results Stemming: – Cut inflectional suffixes (more important for rich morphology languages like French) – Improves overall recognition rates by 12-13 points

9 Paralinguist (Prosodic) Cue Model 100 features, fed into SVM classifier: – F0 (pitch contour) and spectral features (formants) – Energy (loudness) – Voice quality (jitter, shimmer,...) Jitter: varying pitch in the voice Shimmer: varying loudness in the voice NHR: Noise-to-harmonic ratio HNR: Harmonic-to-noise ratio – Speaking rate, silences, pauses, filled pauses – Mouth noise, laughter, crying, breathing Normalized by speaker (~24 user turns per dialog)

10 Results AngerFearReliefSadnessTotal # Utts49384107100640 Lexical59%90%86%34%78% Prosodic39%64%58%57%59.8% Relief associated to lexical markers like thanks or I agree. “Sadness is more prosodic or syntactic than lexical.”

11 Prosody-based Automatic Detection of Annoyance and Frustration in Human-Computer (Ang et al ’02) What’s new: – Naturally-produced emotions – Automatic methods (except for style features) – Emotion vs. style Corpus: DARPA Communicator, 21,899 utts Labels: – Neutral, Annoyed, Frustrated, Fired, Amused, Other, NA – Hyperarticulation, pausing, ‘raised voice’ – Repeats and corrections – Data quality: nonnative, spkr switch, system developer

12 Annotation, Features, Method ASR output Prosodic features – Duration, rate, pause, pitch, energy, tilt Utterance position, correction labels Language model Feature selection Downsampled data to correct for neutral skew

13 Results Useful features: duration, rate, pitch, repeat/ correction, energy, position Raised voice only stylistic predictor – and that is acoustically defined Comparable to human agreement

14 Two Stream Emotion Recognition for Call Center Monitoring (Gupta & Rajput ’07) Goal: Help supervisors evaluate agents at call centers Method: Develop two stream technique to detect strong emotion – Acoustic features – Lexical features – Weighted combination Corpus: 900 calls (60h)

15 Two-Stream Recognition Acoustic Stream Extracted features based on pitch and energy Semantic Stream Performed speech-to-text conversion Text classification algorithms (TF-IDF) identified phrases such as “pleasure,” “thanks,” “useless,” & “disgusting.”

16 Implementation Method: – Two streams analyzed separately: speech utterance/acoustic features spoken text/semantics/speech recognition of conversation – Confidence levels of two streams combined – Examined 3 emotions Neutral Hot-anger Happy Tested two data sets: – LDC data – 20 real-world call-center calls

17 Two Stream - Conclusion Table 2 suggested that two-stream analysis is more accurate than acoustic or semantic alone LDC data recognition significantly higher than real-world data Neutral emotions had less accuracy Combination of two-stream processing showed improvement (~20%) in identification of “happy” and “anger” emotions Low acoustic stream accuracy may be attributed to length of sentences in real-world data. Normal people do not exhibit different emotions significantly in long sentences

18 Questions Gupta&Rajput analyzed 3 emotions (happy, neutral, hot-anger): Why break it down into these categories? Implications? Can this technique be applied to a wider range of emotions? For other applications? Speech to text may not translate the complete conversation. Would further examination greatly improve results? What are the pros and cons? Pitch range was from 50-400Hz. Research may not be applicable outside this range. Do you think it necessary to examine other frequencies? In this paper, TF-IDF (Term Frequency – Inverse Document Frequency) technique is used to classify utterances. Accuracy for acoustics only is about 55%. Previous research suggest that alternative techniques may be better. Would implementation better results? What are the pros and cons of using the TF-IDF technique?

19 Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis Previous work: – 1995; Mozziconacci suggested that VQ combined with f 0 combined could create affect – 2002; Gobl suggested synthesized stimuli with VQ can add affective coloring. Study suggested that “VQ + f 0 ” stimuli is more affective than “f 0 only” – 2003; Gobl tested VQ with large f 0 range. Did not examine contribution of affect-related f 0 contours Objective: To examine affects of VQ and f 0 on affect expression

20 Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis 3 series of stimuli of Sweden utterance – “ja adjo”: – Stimuli exemplifying VQ – Stimuli with modal voice quality with different affect-related f 0 contours – Stimuli combining both Tested parameters exemplifying 5 voice quality (VQ): – Modal voice – Breathy voice – Whispery voice – Lax-creaky voice – Tense voice 15 synthesized stimuli test samples (see Table 1)

21 What is Voice Quality? Phonation Gestures Derived from a variety of laryngeal and supralaryngeal features Adductive tension: interarytenoid muscles adduct the arytenoid muscles Medial compression: adductive force on vocal processes- adjustment of ligamental glottis Longitudinal pressure: tension of vocal folds

22 Tense Voice Very strong tension of vocal folds, very high tension in vocal tract

23 Whispery Voice Very low adductive tension Medial compression moderately high Longitudinal tension moderately high Little or no vocal fold vibration Turbulence generated by friction of air in and above larynx

24 Creaky Voice Vocal fold vibration at low frequency, irregular Low tension (only ligamental part of glottis vibrates) The vocal folds strongly adducted Longitudinal tension weak Moderately high medial compression

25 Breathy Voice Tension low – Minimal adductive tension – Weak medial compression Medium longitudinal vocal fold tension Vocal folds do not come together completely, leading to frication

26 Modal Voice “Neutral” mode Muscular adjustments moderate Vibration of vocal folds periodic, full closing of glottis, no audible friction Frequency of vibration and loudness in low to mid range for conversational speech

27 Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis Six sub-tests with 20 native speakers of Hiberno-English. Rated on 12 different affective attributes: – Sad – happy – Intimate – formal – Relaxed – stressed – Bored – interested – Apologetic – indignant – Fearless – scared Participants asked to mark their response on scale IntimateFormal No affective load

28 Voice Quality and f 0 Test: Conclusion Categorized results into 4 groups. No simple one-to-one mapping between quality and affect “Happy” was most difficult to synthesis Suggested that, in addition to f 0, VQ should be used to synthesis affectively colored speech. VQ appears to be crucial for expressive synthesis

29 Voice Quality and f 0 Test: Discussion If the scale is on a 1-7, then 3.5 should be “neutral”; however, most ratings are less than 2. Do the conclusions (see Fig 2) seem strong? In terms of VQ and f 0, the groupings in Fig 2 seem to suggest that certain affects are closely related. What are the implications of this? For example, are happy and indignant affects closer than relaxed or formal? Do you agree? Do you consider an intimate voice more “breathy” or “whispery?” Does your intuition agree with the paper? Yanushevskaya found that the VQ accounts for the highest affect ratings overall. How to compare range of voice quality with frequency? Do you think they are comparable? Is there a different way to describe these qualities?

30 Questions?


Download ppt "Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer."

Similar presentations


Ads by Google