Producing Emotional Speech Thanks to Gabriel Schubiner.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

PF-STAR: emotional speech synthesis Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova – “Fonetica e Dialettologia”, CNR.

Descriptive schemes for facial expression introduction.

Phonetics as a scientific study of speech

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.

RRL: A Rich Representation Language for the Description of Agent Behaviour in NECA Paul Piwek, ITRI, Brighton Brigitte Krenn, OFAI, Vienna Marc Schröder,

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing

IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Prosodic Signalling of (Un)Expected Information in South Swedish Gilbert Ambrazaitis Linguistics and Phonetics Centre for Languages and Literature.

Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.

Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.

Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,

Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.

Sound and Speech. The vocal tract Figures from Graddol et al.

Language Comprehension Speech Perception Naming Deficits.

EKMAN’S FACIAL EXPRESSIONS STUDY A Demonstration.

Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,

Recognizing Emotions in Facial Expressions

김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.

Sunee Holland University of South Australia School of Computer and Information Science Supervisor: Dr G Stewart Von Itzstein.

Speech Synthesis Markup Language -----Aim at Extension Dr. Jianhua Tao National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese.

Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Human Emotion Synthesis David Oziem, Lisa Gralewski, Neill Campbell, Colin Dalton, David Gibson, Barry Thomas University of Bristol, Motion Ripper, 3CR.

Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.

GUI: Specifying Complete User Interaction Soft computing Laboratory Yonsei University October 25, 2004.

Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

Chapter 7. BEAT: the Behavior Expression Animation Toolkit

Graphite 2004 Statistical Synthesis of Facial Expressions for the Portrayal of Emotion Lisa Gralewski Bristol University United Kingdom

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Multimodal Information Analysis for Emotion Recognition

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Speech Perception 4/4/00.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.

1 Computation Approaches to Emotional Speech Julia Hirschberg

VoiceUNL : a proposal to represent speech control mechanisms within the Universal Networking Digital Language Mutsuko Tomokiyo (GETA-CLIPS-IMAG) & Gérard.

Intelligent Control and Automation, WCICA 2008.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Performance Comparison of Speaker and Emotion Recognition

Predicting Voice Elicited Emotions

Ekman’s Facial Expressions Study A Demonstration.

SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.

Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.

WP6 Emotion in Interaction Embodied Conversational Agents WP6 core task: describe an interactive ECA system with capabilities beyond those of present day.

Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.

Speech emotion detection General architecture of a speech emotion detection system: What features?

영어교육에 있어서의 영어억양의 역할 (The role of prosody in English education) Korea Nazarene University Kyuchul Yoon English Division Kyungnam University.

Nataliya Nadtoka James Edge, Philip Jackson, Adrian Hilton CVSSP Centre for Vision, Speech & Signal Processing UNIVERSITY OF SURREY.

Mr. Darko Pekar, Speech Morphing Inc.

August 15, 2008, presented by Rio Akasaka

Representing Intonational Variation

Emotional Speech Julia Hirschberg CS /8/2018.

Emotional Speech Julia Hirschberg CS /16/2019.

Automatic Prosodic Event Detection

Presentation transcript:

Producing Emotional Speech Thanks to Gabriel Schubiner

Papers Generation of Affect in Synthesized Speech Corpus-based approach to synthesis Expressive visual speech using talking head Demos Affect Editor Quiz/Demo Synface Demo

Affect in Speech Goals Addition of Emotion to Synthetic speech Acoustic Model Typology of parameters of emotional speech Quantification Addresses problem of expressiveness What benefit is gained from expressive speech?

Emotion Theory/ Assumptions Emotion -> Nervous System -> Speech Output Binary distinction Parasympathetic vs Sympathetic based on physical changes universal emotions

Approaches to Affect Generative Emotion -> Physical -> Acoustic Descriptive Observed acoustic params imposed

Descriptive Framework 4 Parameter groups Pitch Timing Voice Quality Articulation Assumption of independence How could this affect design and results?

Pitch Timing Accent Shape Average Pitch Contour Slope Final Lowering Pitch Range Reference Line Exaggeration (not used) Fluent Pauses Hesitation Pauses Speech Rate Stress Frequency Stressed Stressable

Voice Quality Articulation Breathiness Brilliance Loudness Pause Discontinuity Pitch Discontinuity Tremor Laryngealization Precision

Implementation Each parameter has scale Each scale is independent from other parameters between positive and negative

Implementation Settings grouped into preset conditions for each emotion based on prior studies

Program Flow: Input Emotion -> parameter representation Utterance -> clauses Agent, Action, Object, Locative Clause and lexeme annotations Finds all possible locations of affect and chooses whether or not to use

Program Flow Utterance -> Tree structure -> linear phonology “compiled” for specific synthesizer with software to simulate affects not available in hardware

Perception 30 Utterances 5 sentences * 6 affects Forced choice of one of six affects magnitude and comments

Elicitation Sentences Intro I’m almost finished I’m going to the city I saw your name in the paper X I thought you really meant it Look at that picture

Pop Quiz!!!

Pop Quiz Solutions I’m almost finished Disgust : Surprise : Sadness : Gladness : Anger : Fear I’m going to the city Surprise : Gladness : Anger : Disgust : Sadness : Fear I thought you really meant it Anger : Disgust : Gladness : Sadness : Fear : Surprise Look at that picture Anger : Fear : Disgust : Sadness : Gladness : Surprise

Results approx 50% recognition rate 91% sadness

Conclusions Effective? Thoughts?

Corpus-based Approach to Expressive Speech Synthesis

Corpus Collect utterances in each emotion emotion-dependent semantics One speaker Good news, Bad news, Question

Model: Feature Vector Features Lexical stress Phrase-level stress Distance from beginning of phrase Distance from end of phrase POS Phrase-type End of syllable pitch

Model: Classification Predicts F0 5 syllable window Uses feature vector to predict observation vector observation vector: log(p), Δp p = end of syllable pitch Decision Tree

Model: Target Duration Similar to predicting F0 build tree with goal of providing Gaussian at leafs Use mean of class as target duration discretization

Models Uses acoustic analogue of n-grams captures sense of context compared to describing full emotion as sequence compare to Affect Editor Uses only F0 and length (comp. A E) Include information about from which utterance the features are derived intentional bias, justified?

Model: Synthesis Data tagged with original expression and emotion expression-cost matrix noted trade-off: emotional intensity vs. smoothness Paralinguistic events

SSML Compare to Cahn’s typology Abstraction layers

Perception Experiment Distinguish same utterance spoken with neutral and affected prosody Semantic content problematic?

Results Binary decision Reasonable gain over baseline?

Conclusion Major contributions? Paths forward?

Synthesis of Expressive Visual Speech on a Talking Head

Synthesis Background Manipulation of video images Virtual model with deformation parameters Synchronized with time-aligned transcription Articulatory Control Model Cohen & Massaro (1993)

Data Single actor Given specific emotion as instruction 6 emotions + neutral

Facial Animation Parameters Face independent FAP Matrix * scaling factor + position 0 Weighted deformations of distance between vertices and feature point

Modeling Phonetic segments assigned target parameter vector temporal blending over dominance functions Principal components

ML Separate models for each emotion 6:1 training:testing ratio models -> PC traj -> FAP traj * emotion param matrix

Results More extreme emotions easier to perceive 73% sad, 60% angry, 40% sad

Synface Demo

Discussion Changes in approach from Cahn to Eide Production compared to Detection