Outline Why study emotional speech?

Slides:



Advertisements
Similar presentations
Spectral Analysis Feburary 24, 2009 Sorting Things Out 1.TOBI transcription homework rehash. And some structural reminders. 2.On Thursday: back in the.
Advertisements

Voice Quality October 14, 2014 Practicalities Course Project report #2 is due! Also: I have new guidelines to hand out. The mid-term is on Tuesday after.
Detecting Certainness in Spoken Tutorial Dialogues Liscombe, Hirschberg & Venditti Using System and User Performance Features to Improve Emotion Detection.
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
Hossein Sameti Department of Computer Engineering Sharif University of Technology.
Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.
Dan Jurafsky Lecture 6: Emotion CS 424P/ LINGUIST 287 Extracting Social Meaning and Sentiment.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.
Voice source characterisation Gerrit Bloothooft UiL-OTS Utrecht University.
Using Creaky Voice Index in Forensic Phonetics – Is it valid and is it reliable? ____________________________ Tuija Niemi-Laitinen Forensic Scientist/Technical.
Topic 3b: Phonation.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.
Using Emotion Recognition and Dialog Analysis to Detect Trouble in Communication in Spoken Dialog Systems Nathan Imse Kelly Peterson.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
Emotional Grounding in Spoken Dialog Systems Jackson Liscombe Giuseppe Riccardi Dilek Hakkani-Tür
1 Evidence of Emotion Julia Hirschberg
Cues to Emotion: Anger and Frustration Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.
Emotional Speech Guest Lecturer: Jackson Liscombe CS 4706 Julia Hirschberg 4/20/05.
Anatomic Aspects Larynx: Sytem of muscles, cartileges and ligaments.
Learning Objectives Describe how speakers control frequency and amplitude of vocal fold vibration Describe psychophysical attributes of pitch, loudness.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
Understanding Non- Verbal Communication MRS. DOBBINS.
Annotating Student Emotional States in Spoken Tutoring Dialogues Diane Litman and Kate Forbes-Riley Learning Research and Development Center and Computer.
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman and Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA USA.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
(Slides modified from D. Jurafsky) Emotion CS 3710 / ISSP 3565.
Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.
Voice Quality Feburary 11, 2013 Practicalities Course project reports to hand in! And the next set of guidelines to hand out… Also: the mid-term is on.
Laryngeal Structure & Function; Vocal Fold Vibration
Voice Quality + Stop Acoustics
Collaborative Research: Monitoring Student State in Tutorial Spoken Dialogue Diane Litman Computer Science Department and Learning Research and Development.
1 Computation Approaches to Emotional Speech Julia Hirschberg
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Why predict emotions? Feature granularity levels [1] uses pitch features computed at the word-level Offers a better approximation of the pitch contour.
Using Word-level Features to Better Predict Student Emotions during Spoken Tutoring Dialogues Mihai Rotaru Diane J. Litman Graduate Research Competition.
Performance Comparison of Speaker and Emotion Recognition
Stop + Approximant Acoustics
SPPA 6010 Advanced Speech Science
Phonation.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
Speech Generation and Perception
Improving (Meta)cognitive Tutoring by Detecting and Responding to Uncertainty Diane Litman & Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
Detecting and Adapting to Student Uncertainty in a Spoken Tutorial Dialogue System Diane Litman Computer Science Department & Learning Research & Development.
Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.
On the role of context and prosody in the interpretation of ‘okay’ Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Héctor Chávez, and Lauren Wilcox.
Speech emotion detection General architecture of a speech emotion detection system: What features?
Whip Around  What 3 adjectives best describe you?  Think about this question and be prepared to share aloud with the class.
Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer.
Chapter 3: The Speech Process
University of Rochester
August 15, 2008, presented by Rio Akasaka
Changes in Vocal Intensity
Towards Emotion Prediction in Spoken Tutoring Dialogues
Chapter 3: The Speech Process
Breathy Voice Note that you can hear both a buzzy (periodic) component and a hissy (aperiodic) component.
SDS Future Julia Hirschberg LSA /17/2018.
Studying Intonation Julia Hirschberg CS /21/2018.
Studying Intonation Julia Hirschberg CS /21/2018.
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Voice source characterisation
Advanced NLP: Speech Research and Technologies
Emotional Speech Julia Hirschberg CS /16/2019.
Changes in Vocal Intensity
Low Level Cues to Emotion
Presentation transcript:

Emotional Speech CS 4706 Julia Hirschberg (thanks to Jackson Liscombe and Lauren Wilcox for some slides)

Outline Why study emotional speech? Why is modeling emotional speech so difficult? Production and perception studies Voice Quality features: the holy grail CS 4706

Why study emotional speech? Recognition Customer-care centers Tutoring systems Automated agents (Wildfire) Generation Characteristics of ‘emotional speech’ little understood, so hard to produce: …a voice that sounds friendly, sympathetic, authoritative…. TTS systems Games CS 4706

Emotion in Spoken Dialogue Systems Batliner, Huber, Fischer, Spilker, Nöth (2003) Verbmobil (Wizard of Oz scenarios) Ang, Dhillon, Krupski, Shriberg, Stolcke (2002) DARPA Communicator Liscombe, Guicciardi, Tur, Gokken-Tur (2005) “How May I Help You?” call center Lee, Narayanan (2004) Speechworks call-center Liscombe, Hirschberg, Venditti (2005) ITSpoke Tutoring System (physics) CS 4706

Why is emotional speech so hard to model? Colloquial definitions of speakers and listeners ≠ technical definitions Utterances may convey multiple emotions simultaneously Result: Human consensus low Hard to get reliable training data CS 4706

Spontaneous Corpora Unconstrained [Campbell, 2003] [Roach, 2000] [Cowie et al., 2001] Call centers [Vidrascu & Devillers, 2005] [Ang et al., 2002] [Litman and Forbes-Riley, 2004] [Batliner et al., 2003] [Lee & Narayanan, 2005] Meetings [Wrede and Shriberg, 2003] CS 4706

anxious bored encouraging Acted Corpora happy sad angry confident frustrated friendly interested anxious bored encouraging CS 4706

LDC Emotional Prosody and Transcripts corpus Semantically neutral (dates and numbers) 8 actors 15 emotions CS 4706

Are Emotions Mutually Exclusive? User study to classify tokens from LDC Emotional Prosody corpus 10 emotions only: Positive: confident, encouraging, friendly, happy, interested Negative: angry, anxious, bored, frustrated, sad Example CS 4706

Emotion Intercorrelations sad angry bored frust anxs friend conf happy inter encour 0.44 0.26 0.22 -0.27 -0.32 -0.42 -0.33 0.70 0.21 -0.41 -0.37 -0.09 0.14 -0.14 -0.28 -0.17 frustrated 0.32 -0.43 -0.47 -0.16 -0.39 anxious -0.25 friendly 0.77 0.59 0.75 confident 0.45 0.51 0.58 0.73 interested 0.62 encouraging (p < 0.001) CS 4706

Results Emotions are heavily correlated Positive with positive Negative with negative Emotions are non-exclusive Can they be clustered empirically Activation Valency CS 4706

Global Pitch Statistics Different Valence/Activation Global Pitch Statistics CS 4706

Different Valence/Same Activation CS 4706

Identifying Emotions Automatic Acoustic-prosodic [Davitz, 1964] [Huttar, 1968] Global characterization pitch loudness speaking rate Intonational Contours [Mozziconacci & Hermes, 1999] Spectral Tilt [Banse & Scherer, 1996] [Ang et al., 2002] CS 4706

Machine Learning Experiment RIPPER 90/10 split Binary classification for each emotion Results 62% average baseline 75% average accuracy Acoustic-prosodic features for activation /H-L%/ for negative; /L-L%/ for positive Spectral tilt for valence? CS 4706

Accuracy Distinguishing One Emotion from the Rest Baseline Accuracy angry 69.32% 77.27% confident 75.00% happy 57.39% 80.11% interested 69.89% 74.43% encouraging 52.27% 72.73% sad 61.93% anxious 55.68% 71.59% bored 66.48% 78.98% friendly 59.09% 73.86% frustrated CS 4706

A Call Center Application AT&T’s “How May I Help You?” system Customers often angry and frustrated CS 4706

HMIHY Example Very Frustrated Somewhat Frustrated CS 4706

Pitch, Energy and Rate CS 4706

Features Automatic Acoustic-prosodic Contextual [Cauldwell, 2000] Lexical [Schröder, 2003] [Brennan, 1995] Pragmatic [Ang et al., 2002] [Lee & Narayanan, 2005] CS 4706

Rel. Improv. over Baseline Results Feature Set Accuracy Rel. Improv. over Baseline Majority Class 73.1% ----- pros+lex 76.1% pros+lex+da 77.0% 1.2% all 79.0% 3.8% CS 4706

Tutoring Systems Should Respond to Uncertainty SCoT [Pon-Barry et al. 2006] Responding to uncertainty Active listening Hinting vs. paraphrasing Features examined Latency Filled pauses Hedges Performance metric Learning gain But no improvement by responding to uncertainty CS 4706

What does uncertainty sound like? CS 4706

[pr01_sess00_prob58] CS 4706

Uncertainty in ITSpoke um <sigh> I don’t even think I have an idea here ...... now .. mass isn’t weight ...... mass is ................ the .......... space that an object takes up ........ is that mass? One ‘.’ corresponds to 0.25 seconds. [71-67-1:92-113] CS 4706

ITSpoke Experiment Human-Human Corpus AdaBoost(C4.5) 90/10 split in WEKA Classes: Uncertain vs Certain vs Neutral Results: Features Accuracy Baseline 66% Acoustic-prosodic 75% + contextual 76% + breath-groups 77% CS 4706

ITSpoke Results Emotion Precision Recall F-measure certain 0.611 0.602 0.606 uncertain 0.515 0.393 0.446 neutral 0.846 0.891 0.868 Emotion label Classified as certain uncertain neutral 80 11 42 26 35 28 25 22 384 CS 4706

Voice Quality and Emotion Perceptual coloring Derived from a variety of laryngeal and supralaryngeal features modal, creaky, whispered, harsh, breathy, ... Correlates with emotion Laver ‘80, Scherer ‘86, Murray& Arnott ’93, Laukkanen ’96, Johnstone & Scherer ’99, Gobl & Chasaide, ‘03, Fernandez ‘00 CS 4706

Phonation Gestures Adductive tension: interarytenoid muscles adduct the arytenoid muscles Medial compression: adductive force on vocal processes- adjustment of ligamental glottis Longitudinal pressure: tension of vocal folds CS 4706

Modal Voice “Neutral” mode Muscular adjustments moderate Vibration of vocal folds periodic, full closing of glottis, no audible friction Frequency of vibration and loudness in low to mid range for conversational speech CS 4706

Tense Voice Very strong tension of vocal folds, very high tension in vocal tract CS 4706

Whispery Voice Very low adductive tension Medial compression moderately high Longitudinal tension moderately high Little or no vocal fold vibration Turbulence generated by friction of air in and above larynx CS 4706

Creaky Voice Vocal fold vibration at low frequency, irregular Low tension (only ligamental part of glottis vibrates) The vocal folds strongly adducted Longitudinal tension weak Moderately high medial compression CS 4706

Breathy Voice Tension low Minimal adductive tension Weak medial compression Medium longitudinal vocal fold tension Vocal folds do not come together completely, leading to frication CS 4706

Estimating Voice Quality Estimate wrt controlled neutral quality But how do we know the control is truly “neutral”? Must must match the natural laryngeal behavior to laboratory “neutral” Our knowledge of models of vocal fold movements may be inadequate for describing real phonation Known relationships between acoustic signal and voice source are complex Only can observe behavior of voicing indirectly so prone to error. Direct source data obtained by invasive techniques which may interfere with signal CS 4706

Next Class Deceptive Speech CS 4706