PF-STAR: emotional speech synthesis Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova – “Fonetica e Dialettologia”, CNR
Analysis of emotive speech: audio Recordings: /’aba/, /’ava/, /m’amma/ Cues extraction and analysis: Intensity, duration, pitch, pitch range, formants. F0 stressed vowel mean and F0mid values are strongly correlated. Shimmer, Jitter, HNR, Hammarberg’s index, Spectral flatness, Spectral energy distributions: voice quality correlates. F0mean (global and for stressed vowel), F0mid, and F0range for /’aba/ anger (A) joy (J) fear (F) sadness (SA) disgust (D) surprise (SU) neutral (N)
Analysis of emotive speech: voice quality Discriminant analysis: classification scores: 60/70 % for stressed and unstressed vowel Best score: Fear, Anger Worst score: Surprise Voice quality characterization: Anger: harsh voice (/’a/) Disgust: creaky voice (/a/) Joy, Fear, Surprise : breathy voice VOQUAL 2003 paper: “Emotions and Voice Quality: Experiments with Sinusoidal Modeling”
Processing of emotive speech Neutral Target Disgust Target Sadness Disgust Sadness Results: Time-stretch and (formant preserving) pitch shift alone can’t account for the principal emotion related cues Spectral conversion can account for some of the emotion cues Disgust (Ps+Ts) Sadness (Ps+Ts) Neutral Emotive transformation based on sinusoidal modeling:
Processing of emotive speech Neutral Emotive transformation based on sinusoidal modeling: Neutral anger disgust joy fear surprise sadness Ps+TsPs+Ts+ScTarget
SI voice processing for TTS systems Processing of emotive speech: results Emotive synthesis based on FESTIVAL MBROLA (Male Voice) Neutral Anger Disgust Joy Fear Surprise Sadness Ps+TsPs+Ts+VQtrTarget
ETTS Audio Examples “Neutral” Prosody Anger Disgust Fear Joy Surprise Sadness E-ProsodyE-Prosody+VQ
Mark-Up Languages for E-TTS Hierarchic description of emotive voice: High Level: emotive tag (e.g.,,,, etc.) Medium Level: phonetic voice description (e.g.,,,, etc.) Low Level: acoustic description (e.g.,,,, etc.) Definition of speaker-independent rules to control voice quality within a text-to-speech synthesizer.