Download presentation
Presentation is loading. Please wait.
Published byAdela Hudson Modified over 9 years ago
2
Producing Emotional Speech Thanks to Gabriel Schubiner
3
Papers Generation of Affect in Synthesized Speech Corpus-based approach to synthesis Expressive visual speech using talking head Demos Affect Editor Quiz/Demo Synface Demo
4
Affect in Speech Goals Addition of Emotion to Synthetic speech Acoustic Model Typology of parameters of emotional speech Quantification Addresses problem of expressiveness What benefit is gained from expressive speech?
5
Emotion Theory/ Assumptions Emotion -> Nervous System -> Speech Output Binary distinction Parasympathetic vs Sympathetic based on physical changes universal emotions
6
Approaches to Affect Generative Emotion -> Physical -> Acoustic Descriptive Observed acoustic params imposed
7
Descriptive Framework 4 Parameter groups Pitch Timing Voice Quality Articulation Assumption of independence How could this affect design and results?
8
Pitch Timing Accent Shape Average Pitch Contour Slope Final Lowering Pitch Range Reference Line Exaggeration (not used) Fluent Pauses Hesitation Pauses Speech Rate Stress Frequency Stressed Stressable
9
Voice Quality Articulation Breathiness Brilliance Loudness Pause Discontinuity Pitch Discontinuity Tremor Laryngealization Precision
10
Implementation Each parameter has scale Each scale is independent from other parameters between positive and negative
11
Implementation Settings grouped into preset conditions for each emotion based on prior studies
12
Program Flow: Input Emotion -> parameter representation Utterance -> clauses Agent, Action, Object, Locative Clause and lexeme annotations Finds all possible locations of affect and chooses whether or not to use
13
Program Flow Utterance -> Tree structure -> linear phonology “compiled” for specific synthesizer with software to simulate affects not available in hardware
15
Perception 30 Utterances 5 sentences * 6 affects Forced choice of one of six affects magnitude and comments
16
Elicitation Sentences Intro I’m almost finished I’m going to the city I saw your name in the paper X I thought you really meant it Look at that picture
17
Pop Quiz!!!
18
Pop Quiz Solutions I’m almost finished Disgust : Surprise : Sadness : Gladness : Anger : Fear I’m going to the city Surprise : Gladness : Anger : Disgust : Sadness : Fear I thought you really meant it Anger : Disgust : Gladness : Sadness : Fear : Surprise Look at that picture Anger : Fear : Disgust : Sadness : Gladness : Surprise
19
Results approx 50% recognition rate 91% sadness
21
Conclusions Effective? Thoughts?
22
Corpus-based Approach to Expressive Speech Synthesis
23
Corpus Collect utterances in each emotion emotion-dependent semantics One speaker Good news, Bad news, Question
24
Model: Feature Vector Features Lexical stress Phrase-level stress Distance from beginning of phrase Distance from end of phrase POS Phrase-type End of syllable pitch
25
Model: Classification Predicts F0 5 syllable window Uses feature vector to predict observation vector observation vector: log(p), Δp p = end of syllable pitch Decision Tree
26
Model: Target Duration Similar to predicting F0 build tree with goal of providing Gaussian at leafs Use mean of class as target duration discretization
27
Models Uses acoustic analogue of n-grams captures sense of context compared to describing full emotion as sequence compare to Affect Editor Uses only F0 and length (comp. A E) Include information about from which utterance the features are derived intentional bias, justified?
28
Model: Synthesis Data tagged with original expression and emotion expression-cost matrix noted trade-off: emotional intensity vs. smoothness Paralinguistic events
29
SSML Compare to Cahn’s typology Abstraction layers
30
Perception Experiment Distinguish same utterance spoken with neutral and affected prosody Semantic content problematic?
31
Results Binary decision Reasonable gain over baseline?
32
Conclusion Major contributions? Paths forward?
33
Synthesis of Expressive Visual Speech on a Talking Head
35
Synthesis Background Manipulation of video images Virtual model with deformation parameters Synchronized with time-aligned transcription Articulatory Control Model Cohen & Massaro (1993)
36
Data Single actor Given specific emotion as instruction 6 emotions + neutral
37
Facial Animation Parameters Face independent FAP Matrix * scaling factor + position 0 Weighted deformations of distance between vertices and feature point
38
Modeling Phonetic segments assigned target parameter vector temporal blending over dominance functions Principal components
39
ML Separate models for each emotion 6:1 training:testing ratio models -> PC traj -> FAP traj * emotion param matrix
40
Results More extreme emotions easier to perceive 73% sad, 60% angry, 40% sad
41
Synface Demo
42
Discussion Changes in approach from Cahn to Eide Production compared to Detection
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.