Download presentation
Presentation is loading. Please wait.
1
Emotional Speech Modelling and Synthesis
Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young Hello – My name is Zeynep Inanoglu and I am a first year PhD student working with Prof. Steve Young on the general topic of emotions in speech with a focus on expressive speech synthesis.
2
Agenda Project Motivation Review of Work on Intonation Modelling
Intonation Models and Training Intonation Synthesis from HMMs Intonation Adaptation and Perceptual Tests Alternative Intonation Labels Prosodizer Labels Lexical Stress Labels Controlling More Parameters Pitch Synchronous Harmonic Model Transplantation of pitch, duration and voice quality Emotional Speech Data Collection Summary and Future Direction This is a brief outline of my slides. I will first remind you of where I had left of during the Toshiba visit in February. Then I will try to motivate the rest of the talk through a brief literature review on strategies for expressive speech synthesis. The rest of the talk will consist of details of the statistical framework we have implemented and demonstration of experiments and sample results. I will conclude with a discussion of future direction.
3
Review: Project Motivation
To synthesize or resynthesize speech with desired emotional expressivity. Initial focus on pitch modelling. Intonation (F0) contours have two distinct functions: Convey the prominence structure, sentence modality. Convey signals about the emotional states. The interaction between the two functions are largely unexplored. (Banziger & Scherer, 2005) Goal: Choose building blocks of intonation Model them statistically Adapt models them to different emotions. Generate intonation contours from models. The goal of this work is to synthesize and resynthesize speech with the desired emotional expressivity. Our initial focus was on pitch modelling. As we know an intonation (f0) contour helps fulfil two seemingly parallel functions: to convey the prominence structure of the utterance, that is determine which words are accented and which are not . and secondly to convey signals about speaker state, which includes emotional state. While there has been systematic analysis of emotion effects on aggregate F0 measures such as mean F0 and F0 range in literature, the interaction between the linguistic and affective functions of F0 contours have been largely untested. This was also noted in a recent journal article by Banziger & Scherer) Our goal in this project was to construct a framework that can generate contours in the desired emotion given the prominence structure. This we were able to achieve by modelling contours as a sequence of linguistic intonation units and adapting these basic units to various emotions.
4
Review: Intonation Modelling
Basic Models Seven basic models: A (accent), U (unstressed), RB (rising boundary), FB (falling boundary), ARB, AFB, SIL 3 state, single mixture, left-to-right HMMs Data: Boston Radio Corpus. (48 minutes of speech, female speaker) Features: Mean-normalized raw f0 and energy values as well as differentials. Context-Sensitive Models Tri-unit models (U+A-RB) Full-context models (U+A-RB::vowel_pos=2::num_a=1::…..) Decision tree-based parameter tying was performed for context-sensitive models. Our basic models were all syllable based and consist of seven intonation units: an accent, an unaccented syllable two types of boundary tones: a rising boundary and a falling boundary, and combination units where an accent and boundary tone happen within the same syllable. These seven units are modelled as a three-state single mixture left-to-right HMM and trained on 48 minutes of labelled speech from the Boston Radio Corpus. The features extracted were f0 and energy as well as their first and second order differentials. F0 values were normalized by speaker mean in order to allow speaker-independent training and adaptation. We have experimented with two levels of context-sensitive models. The simplest context involves a tri-unit model similar to a tri-phone in speech recognition, which incorporates information on neighbouring units. A more complex full-context model set was also built based on phonetic constituents of the syllables and more long-distance syllable context. Decision tree parameter tying was performed for all context-sensitive models due to the increasing number of parameters.
5
Review: Generation from Models
The goal is to generate an optimal sequence of F0 values directly from syllable HMMs given the intonation models: This results in a sequence of mean state values. Cepstral parameter generation algorithm of HTS system for interpolated F0 generation (Tokuda et al, 1995) Differential F0 features are used as constraints in contour generation. Results in smoother contours. Given a model sequence we want to come up with the optimal sequence of f0 values that maximize P(O given alpha) – as you may guess, this maximization problem results in a sequence of state mean values. The f0 contour becomes a stepwise curve which we do not want – instead we use the cepstral parameter generation algorithm of Tokuda et al, The idea behind this algorithm is to use the dynamic features delta f0 and delta delta f0 as explicit constraints in generating the static features f0.
6
Review: Generation From Models
a a u u u u I saw him yes ter day I’d now like to illustrate some synthetic contours generated by our tri-unit models. The two contours seen here are both generated by our models for two sequences of intonation labels. The difference between the two sequences is the introduction of an accent and falling boundary on the fourth and sixth syllables : You can see visually observe the hat shaped accent on the fourth syllable and the sharp fall in the final syllable. I will know play two sound files where each contour has then been transplanted onto the same six syllable utterance ADD: I SAW HIM YESTERDAY a a u a u fb I saw him yes ter day
7
Review: Model Adaptation with MLLR
Adapt models with Maximum Likelihood Linear Regression (MLLR) . Adaptation data from Emotional Prosody Corpus which consists of four syllable phrases in a variety of emotions. Happy and sad speech were chosen for this experiment. As I have noted in the overview, the final step of this experiment was to adapt neutral models with little emotion data using maximum likelihood linear regression adaptation. MLLR basically computes a set of linear transformations for the mean and variance parameters of HMM output distributions. The number of transforms applied can vary based on a regression tree which clusters all state distributions according to a similiarity measure. So if there is not enough data for computing independent transforms for each state, a group of similar states use the same transform. The adaptation data we used came from the Emotional Prosody corpus which consists of four-syllable phrases in a variety of emotions. Around 40 sad and happy phrases were labelled with our intonation units and used to adapt tri-unit HMMs. Neutral Sad Happy
8
Review: Perceptual Tests
Utterances with sad contours were identified 80% of the time. This was significant. (p<0.01) Listeners formed a bimodal distribution in their ability to detect happy utterances. Overall only 46% of the happy intonation was identified as happier than neutral. (Smiling voice is infamous in literature) Happy models worked better with utterances with more accents and rising boundaries - the organization of labels matters!!! We had some interesting results: utterances with the sad contours were identified 80 percent of the time which is significant. Listeners formed a bimodal distribution in their ability to detect happy utterances. Overall only 46% of the happy intonation was identified as happier than neutral. However some listeners agreed with the happy contours systematically and others preferred neutral contours. Interesting observation: happy models were preferred more frequently in utterances with more accents and rising boundaries. Organization of labels matters.
9
Alternative Intonation Labels
Manual intonation labels are subjective and their creation time-consuming. Evaluate alternative labelling methods Automatic TOBI labels generated by Prosodizer. Prosodizer generated labels are converted to the seven basic units. Lexical Stress Labels one (primary stress), two (secondary stress), zero (no-stress), sil (silence) Evaluation on Boston Radio Corpus female speaker f2b One of our reservations with the method I just described was the use of manual intonation labels. The generation of these labels is not only time consuming but also quite subjective and does not have high inter-labeller consistency. We have explored two alternatives to replace the manual labels: the first one was the use of toshiba’s prosodizer tool to generate automatic TOBI labels and to map from TOBI labels to our original label set. The second method was to simply use lexical stress labels to model intonation. The stress labels can simply be obtained from a dictionary and our hope was that modelling context sensitive stress labels may indirectly provide us with emotion-specific accentuation patterns. We have evaluated the synthetic contours provided by manual, prosodizer and lexical stress labels on the Boston Radio corpus. The objective measure we used was the average mean sq error per sample. The manual and prosodizer labels were very similar in accuracy and due to fewer base models lexical stress labels had higher error. Use of prosodizer was interesting for us and we have yet to try it on emotional data. Manual Prosodizer Lexical Mean Sq. Error 30.60 31.83 34.56
10
Perceptual Investigation of Other Emotions
sil-one+one ..one zero zero-one+sil Boredom Contempt Disgust Interest Cold Anger Panic Hot Anger One big advantage of lexical stress models, is that we were able to easily to label our training data for various emotions provided in the Emotional Speech Corpus and generate contours. Here we are looking at a single lexical stress pattern one-one-zero-one and the contours generated by different emotion models. These are context sensitive models so we can model the fact that initial primary sentence is a phrase start and the last one is phrase-end. As you can see the contours not only vary in their global range and scale but also in the shapes of individual contexts. Interest was unique one with a rising primary stress at the end of the phrase. Contempt was also interesting in its high pitch/rising sentence start model.
11
Controlling More Parameters
Pitch Synchronous Harmonic Model (Hui Ye, 2004) Size of analysis/synthesis window equal to one pitch period. Represent each frame as a sum of harmonically related sinusoids. (amplitudes and phases) For voiced frames, acquire LSF representations of the vocal tract. Better framework to manipulate pitch, duration and voice quality. Implemented pitch, duration and voice quality transplantation. Set up framework for emotion conversion. (prosody, duration and vocal tract) I think the examples in the previous slide make it clear that intonation mapping while important is not sufficient on its own for emotional synthesis. We have therefore decided to explore manipulating more parameters including parameters relating to the vocal tract in addition to pitch and duration. To do this we have switched to speech model where controlling all these paramaters at the same time is possible. PSHM was developed by a student of Steve Young for a voice conversion task. The model is pitch synchronous in that each analysis frame is as big as a single pitch period. Each frame is represented by a sum of sinusoids whose amplitudes, frequencies and phases fully define that frame of speech. For voiced segments, LSF parameters are also obtained from the sinusoidal representation. Our recent goal was to set up a transplantation experiment where we would transplant the phone-based pitch duration and voice quality features of a target emotion onto an utterance in a source emotion and investigate the results. This would give us an idea of whether pursuing a voice conversion approach to emotional resynthesis is a useful paradigm.
12
Transplantation with PSHM
Duration Transplantation Pitch Transplantation per phone For each phone: Compute pitch alignment Recompute spectral envelope Restore time
13
Transplantation with PSHM
Vocal Tract Transplantation Alignment of frames based on DTW of MFCC distance. Convert LSF parameters to LPC. Filtering of the source harmonics with the target LPC. Computation of new sinusoidal amplitudes. Neutral Spectral Envelopes for /eh/ Happy
14
Transplantation Results
Source Target Pitch Pitch+Duration LSF Pitch+Duration+LSF Neutral Happy Sad Angry Conversion of voice quality improves target emotion perception in all transplantations. LSF transplantation driving factor in anger, while both LSF and prosody transplantation plays an important role in happy and sad.
15
Emotional Speech Data Collection
4 emotions: Happy, sad, surprised, angry. Two speakers: 1 male & 1 female. (Suzanne Park, Matthew Johnson) Toshiba TTS Training Corpus. Happy & Sad: 1250 sentences 900 from the phonetically balanced short sentences. 300 long sentences. 25 questions & 25 exclamations. Surprise & Anger: 625 sentences. 300 phonetically balanced short sentences. 300 long sentences 25 questions Neutral data collection for the male speaker. (1250)
16
Emotional Speech Data Collection
Emotion elicitation by context prompting. “I like a party with an atmosphere” Happy: You have just arrived at the best party in town. Sad: You never get invitations to good parties any more. Expected recording time 12 days for the female speaker. ( 2 weeks and a half non-stop) 15 days for the male speaker. (Twice a week for two months) 6-hour days Post processing: Phonetic Alignment -Syllable Boundaries Pitch Marks -Prosodizer labels Text Analysis
17
Future Direction Data collection and labelling.
Experiments with emotion conversion Prosody conversion based on HMM models Voice quality conversion. Joint modelling of prosody and voice quality in emotional speech. Investigation of voice source and its effects on emotion. Integration of speech modification techniques into a TTS framework. Comparison of speech modification techniques with unit-selection techniques.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.