Producing Emotional Speech Thanks to Gabriel Schubiner.

Producing Emotional Speech Thanks to Gabriel Schubiner

Papers Generation of Affect in Synthesized Speech Corpus-based approach to synthesis Expressive visual speech using talking head Demos Affect Editor Quiz/Demo Synface Demo

Affect in Speech Goals Addition of Emotion to Synthetic speech Acoustic Model Typology of parameters of emotional speech Quantification Addresses problem of expressiveness What benefit is gained from expressive speech?

Emotion Theory/ Assumptions Emotion -> Nervous System -> Speech Output Binary distinction Parasympathetic vs Sympathetic based on physical changes universal emotions

Approaches to Affect Generative Emotion -> Physical -> Acoustic Descriptive Observed acoustic params imposed

Descriptive Framework 4 Parameter groups Pitch Timing Voice Quality Articulation Assumption of independence How could this affect design and results?

Pitch Timing Accent Shape Average Pitch Contour Slope Final Lowering Pitch Range Reference Line Exaggeration (not used) Fluent Pauses Hesitation Pauses Speech Rate Stress Frequency Stressed Stressable

Voice Quality Articulation Breathiness Brilliance Loudness Pause Discontinuity Pitch Discontinuity Tremor Laryngealization Precision

Implementation Each parameter has scale Each scale is independent from other parameters between positive and negative

Implementation Settings grouped into preset conditions for each emotion based on prior studies

Program Flow: Input Emotion -> parameter representation Utterance -> clauses Agent, Action, Object, Locative Clause and lexeme annotations Finds all possible locations of affect and chooses whether or not to use

Program Flow Utterance -> Tree structure -> linear phonology “compiled” for specific synthesizer with software to simulate affects not available in hardware

Perception 30 Utterances 5 sentences * 6 affects Forced choice of one of six affects magnitude and comments

Elicitation Sentences Intro I’m almost finished I’m going to the city I saw your name in the paper X I thought you really meant it Look at that picture

Pop Quiz!!!

Pop Quiz Solutions I’m almost finished Disgust : Surprise : Sadness : Gladness : Anger : Fear I’m going to the city Surprise : Gladness : Anger : Disgust : Sadness : Fear I thought you really meant it Anger : Disgust : Gladness : Sadness : Fear : Surprise Look at that picture Anger : Fear : Disgust : Sadness : Gladness : Surprise

Results approx 50% recognition rate 91% sadness

Conclusions Effective? Thoughts?

Corpus-based Approach to Expressive Speech Synthesis

Corpus Collect utterances in each emotion emotion-dependent semantics One speaker Good news, Bad news, Question

Model: Feature Vector Features Lexical stress Phrase-level stress Distance from beginning of phrase Distance from end of phrase POS Phrase-type End of syllable pitch

Model: Classification Predicts F0 5 syllable window Uses feature vector to predict observation vector observation vector: log(p), Δp p = end of syllable pitch Decision Tree

Model: Target Duration Similar to predicting F0 build tree with goal of providing Gaussian at leafs Use mean of class as target duration discretization

Models Uses acoustic analogue of n-grams captures sense of context compared to describing full emotion as sequence compare to Affect Editor Uses only F0 and length (comp. A E) Include information about from which utterance the features are derived intentional bias, justified?

Model: Synthesis Data tagged with original expression and emotion expression-cost matrix noted trade-off: emotional intensity vs. smoothness Paralinguistic events

SSML Compare to Cahn’s typology Abstraction layers

Perception Experiment Distinguish same utterance spoken with neutral and affected prosody Semantic content problematic?

Results Binary decision Reasonable gain over baseline?

Conclusion Major contributions? Paths forward?

Synthesis of Expressive Visual Speech on a Talking Head

Synthesis Background Manipulation of video images Virtual model with deformation parameters Synchronized with time-aligned transcription Articulatory Control Model Cohen & Massaro (1993)

Data Single actor Given specific emotion as instruction 6 emotions + neutral

Facial Animation Parameters Face independent FAP Matrix * scaling factor + position 0 Weighted deformations of distance between vertices and feature point

Modeling Phonetic segments assigned target parameter vector temporal blending over dominance functions Principal components

ML Separate models for each emotion 6:1 training:testing ratio models -> PC traj -> FAP traj * emotion param matrix

Results More extreme emotions easier to perceive 73% sad, 60% angry, 40% sad

Synface Demo

Discussion Changes in approach from Cahn to Eide Production compared to Detection

Producing Emotional Speech Thanks to Gabriel Schubiner.

Similar presentations

Presentation on theme: "Producing Emotional Speech Thanks to Gabriel Schubiner."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Producing Emotional Speech Thanks to Gabriel Schubiner.

Similar presentations

Presentation on theme: "Producing Emotional Speech Thanks to Gabriel Schubiner."— Presentation transcript:

Similar presentations

About project

Feedback