Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.

Slides:

Advertisements

Similar presentations

PF-STAR: emotional speech synthesis Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova – “Fonetica e Dialettologia”, CNR.

Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Detecting Certainness in Spoken Tutorial Dialogues Liscombe, Hirschberg & Venditti Using System and User Performance Features to Improve Emotion Detection.

ACOUSTICS OF SPEECH AND SINGING MUSICAL ACOUSTICS Science of Sound, Chapters 15, 17 P. Denes & E. Pinson, The Speech Chain (1963, 1993) J. Sundberg, The.

On the road to the creation of situation-adaptive dialogue managers Ajay Juneja Dialog Seminar.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.

Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.

Annotation and Detection of Blended Emotions in Real Human-Human Dialogs recorded in a Call Center L. Vidrascu and L. Devillers TLP-LIMSI/CNRS - France.

AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.

Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,

Introduction to Speech Synthesis ● Key terms and definitions ● Key processes in sythetic speech production ● Text-To-Phones ● Phones to Synthesizer parameters.

Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.

Outline Why study emotional speech?

Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Detecting missrecognitions Predicting with prosody.

1 Evidence of Emotion Julia Hirschberg

Cues to Emotion: Anger and Frustration Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.

Techniques for Emotion Classification Julia Hirschberg COMS 4995/6998 Thanks to Kaushal Lahankar.

Emotional Speech Guest Lecturer: Jackson Liscombe CS 4706 Julia Hirschberg 4/20/05.

Anatomic Aspects Larynx: Sytem of muscles, cartileges and ligaments.

Techniques for Emotion Classification Kaushal N Lahankar Oct 12,2009 COMS 6998.

Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,

V OICE QUALITY AND F0 CUES FOR AFFECT EXPRESSION By I. Yanushevskaya, C. Gobl and N. Chasaide.

Producing Emotional Speech Thanks to Gabriel Schubiner.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01 1 Liz Shriberg* Andreas Stolcke* Jeremy Ang + * SRI International International.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

Categorizing Emotion in Spoken Language Janine K. Fitzpatrick and John Logan METHOD RESULTS We understand emotion through spoken language via two types.

Voice Quality Feburary 11, 2013 Practicalities Course project reports to hand in! And the next set of guidelines to hand out… Also: the mid-term is on.

By Sarita Jondhale1 Pattern Comparison Techniques.

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

1 Computation Approaches to Emotional Speech Julia Hirschberg

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Introduction to Computational Linguistics

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

Why predict emotions? Feature granularity levels [1] uses pitch features computed at the word-level Offers a better approximation of the pitch contour.

National Taiwan University, Taiwan

Performance Comparison of Speaker and Emotion Recognition

Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

SPPA 6010 Advanced Speech Science

Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.

Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.

Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)

THE ACTOR’S VOICE  How does the voice work ?  Why is breath control important?  What are vocal strategies ?  How does the actor create a FLEXIBLE.

Detection Of Anger In Telephone Speech Using Support Vector Machine and Gaussian Mixture Model Prepared By : Siti Marahaini Binti Mahamood.

University of Rochester

ARTIFICIAL NEURAL NETWORKS

Laryngeal correlates of the English tense/lax vowel contrast

Studying Spoken Language Text 17, 18 and 19

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Advanced NLP: Speech Research and Technologies

Liz Shriberg* Andreas Stolcke* Jeremy Ang+ * SRI International

Emotional Speech Julia Hirschberg CS /16/2019.

Low Level Cues to Emotion

Presentation transcript:

Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer

Motivation “The context of emergency gives a larger palette of complex and mixed emotions.” Emotions in emergency situations are more extreme, and are “really felt in a natural way.” Debate on acted vs. real emotions Ethical concerns?

: Real-life Emotions Detection with Lexical and Paralinguistic Cues on Human-Human Call Center Dialogs (Devillers & Vidrascu ’06) Domain: Medical emergencies Motive: Study real-life speech in highly emotional situations Emotions studied: Anger, Fear, Relief, Sadness (but finer-grained annotation) Corpus: 680 dialogs, 2258 speaker turns Training-test split: 72% - 28% Machine Learning method: Log-likelihood ratio (linguistic), SVM (paralinguistic)

CEMO Corpus 688 dialogs, avg 48 turns per dialog Annotation: – Decisions of 2 annotators are combined in a soft vector: – Emotion mixtures – 8 coarse-level emotions, 21 fine-grained emotions – Inter-annotator agreement for client turns: 0.57 (moderate) – Consistency checks: Self-reannotation procedure (85% similarity) Perception test (no details given) Restrict corpus to caller utterances: 2258 utterances, 680 speakers.

Features Lexical features / Linguistic cues: Unigrams of user utterances, stemmed Prosodic features / Paralinguistic cues: – Loudness (energy) – Pitch contour (F0) – Speaking rate – Voice quality (jitter,...) – Disfluency (pauses) – Non-linguistic events (mouth noise, crying, …) – Normalized by speaker

Annotation Utterances annotated with one of the following nonmixed emotions: – Anger, Fear, Relief, Sadness – Justification for this choice?

Lexical Cue Model Log-likelihood ratio: 4 unigram emotion models (1 for each emotion) – A general task-specific model – Interpolation coefficient to avoid data sparsity problems – A coefficient of 0.75 gave the best results Stemming: – Cut inflectional suffixes (more important for rich morphology languages like French) – Improves overall recognition rates by points

Paralinguist (Prosodic) Cue Model 100 features, fed into SVM classifier: – F0 (pitch contour) and spectral features (formants) – Energy (loudness) – Voice quality (jitter, shimmer,...) Jitter: varying pitch in the voice Shimmer: varying loudness in the voice NHR: Noise-to-harmonic ratio HNR: Harmonic-to-noise ratio – Speaking rate, silences, pauses, filled pauses – Mouth noise, laughter, crying, breathing Normalized by speaker (~24 user turns per dialog)

Results AngerFearReliefSadnessTotal # Utts Lexical59%90%86%34%78% Prosodic39%64%58%57%59.8% Relief associated to lexical markers like thanks or I agree. “Sadness is more prosodic or syntactic than lexical.”

Prosody-based Automatic Detection of Annoyance and Frustration in Human-Computer (Ang et al ’02) What’s new: – Naturally-produced emotions – Automatic methods (except for style features) – Emotion vs. style Corpus: DARPA Communicator, 21,899 utts Labels: – Neutral, Annoyed, Frustrated, Fired, Amused, Other, NA – Hyperarticulation, pausing, ‘raised voice’ – Repeats and corrections – Data quality: nonnative, spkr switch, system developer

Annotation, Features, Method ASR output Prosodic features – Duration, rate, pause, pitch, energy, tilt Utterance position, correction labels Language model Feature selection Downsampled data to correct for neutral skew

Results Useful features: duration, rate, pitch, repeat/ correction, energy, position Raised voice only stylistic predictor – and that is acoustically defined Comparable to human agreement

Two Stream Emotion Recognition for Call Center Monitoring (Gupta & Rajput ’07) Goal: Help supervisors evaluate agents at call centers Method: Develop two stream technique to detect strong emotion – Acoustic features – Lexical features – Weighted combination Corpus: 900 calls (60h)

Two-Stream Recognition Acoustic Stream Extracted features based on pitch and energy Semantic Stream Performed speech-to-text conversion Text classification algorithms (TF-IDF) identified phrases such as “pleasure,” “thanks,” “useless,” & “disgusting.”

Implementation Method: – Two streams analyzed separately: speech utterance/acoustic features spoken text/semantics/speech recognition of conversation – Confidence levels of two streams combined – Examined 3 emotions Neutral Hot-anger Happy Tested two data sets: – LDC data – 20 real-world call-center calls

Two Stream - Conclusion Table 2 suggested that two-stream analysis is more accurate than acoustic or semantic alone LDC data recognition significantly higher than real-world data Neutral emotions had less accuracy Combination of two-stream processing showed improvement (~20%) in identification of “happy” and “anger” emotions Low acoustic stream accuracy may be attributed to length of sentences in real-world data. Normal people do not exhibit different emotions significantly in long sentences

Questions Gupta&Rajput analyzed 3 emotions (happy, neutral, hot-anger): Why break it down into these categories? Implications? Can this technique be applied to a wider range of emotions? For other applications? Speech to text may not translate the complete conversation. Would further examination greatly improve results? What are the pros and cons? Pitch range was from Hz. Research may not be applicable outside this range. Do you think it necessary to examine other frequencies? In this paper, TF-IDF (Term Frequency – Inverse Document Frequency) technique is used to classify utterances. Accuracy for acoustics only is about 55%. Previous research suggest that alternative techniques may be better. Would implementation better results? What are the pros and cons of using the TF-IDF technique?

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis Previous work: – 1995; Mozziconacci suggested that VQ combined with f 0 combined could create affect – 2002; Gobl suggested synthesized stimuli with VQ can add affective coloring. Study suggested that “VQ + f 0 ” stimuli is more affective than “f 0 only” – 2003; Gobl tested VQ with large f 0 range. Did not examine contribution of affect-related f 0 contours Objective: To examine affects of VQ and f 0 on affect expression

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis 3 series of stimuli of Sweden utterance – “ja adjo”: – Stimuli exemplifying VQ – Stimuli with modal voice quality with different affect-related f 0 contours – Stimuli combining both Tested parameters exemplifying 5 voice quality (VQ): – Modal voice – Breathy voice – Whispery voice – Lax-creaky voice – Tense voice 15 synthesized stimuli test samples (see Table 1)

What is Voice Quality? Phonation Gestures Derived from a variety of laryngeal and supralaryngeal features Adductive tension: interarytenoid muscles adduct the arytenoid muscles Medial compression: adductive force on vocal processes- adjustment of ligamental glottis Longitudinal pressure: tension of vocal folds

Tense Voice Very strong tension of vocal folds, very high tension in vocal tract

Whispery Voice Very low adductive tension Medial compression moderately high Longitudinal tension moderately high Little or no vocal fold vibration Turbulence generated by friction of air in and above larynx

Creaky Voice Vocal fold vibration at low frequency, irregular Low tension (only ligamental part of glottis vibrates) The vocal folds strongly adducted Longitudinal tension weak Moderately high medial compression

Breathy Voice Tension low – Minimal adductive tension – Weak medial compression Medium longitudinal vocal fold tension Vocal folds do not come together completely, leading to frication

Modal Voice “Neutral” mode Muscular adjustments moderate Vibration of vocal folds periodic, full closing of glottis, no audible friction Frequency of vibration and loudness in low to mid range for conversational speech

Voice Quality and f 0 Cues for Affect Expression: Implications for Synthesis Six sub-tests with 20 native speakers of Hiberno-English. Rated on 12 different affective attributes: – Sad – happy – Intimate – formal – Relaxed – stressed – Bored – interested – Apologetic – indignant – Fearless – scared Participants asked to mark their response on scale IntimateFormal No affective load

Voice Quality and f 0 Test: Conclusion Categorized results into 4 groups. No simple one-to-one mapping between quality and affect “Happy” was most difficult to synthesis Suggested that, in addition to f 0, VQ should be used to synthesis affectively colored speech. VQ appears to be crucial for expressive synthesis

Voice Quality and f 0 Test: Discussion If the scale is on a 1-7, then 3.5 should be “neutral”; however, most ratings are less than 2. Do the conclusions (see Fig 2) seem strong? In terms of VQ and f 0, the groupings in Fig 2 seem to suggest that certain affects are closely related. What are the implications of this? For example, are happy and indignant affects closer than relaxed or formal? Do you agree? Do you consider an intimate voice more “breathy” or “whispery?” Does your intuition agree with the paper? Yanushevskaya found that the VQ accounts for the highest affect ratings overall. How to compare range of voice quality with frequency? Do you think they are comparable? Is there a different way to describe these qualities?

Questions?