Presentation is loading. Please wait.

Presentation is loading. Please wait.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Similar presentations


Presentation on theme: "Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab."— Presentation transcript:

1

2 Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab CUED

3 Toshiba Update 04/09/2006 Project Goal Restated Given a neutral utterance and all the known features that can be extracted from a TTS framework – convert the utterance into a specified emotion with zero degradation in quality. – Particularly useful for unit-selection synthesizers with good-quality neutral output. What can be modified? – Prosody (F0, duration, prominence structure) – Voice quality (spectral features relating to the voice source & filter) Method: Data-Driven Learning & Generation (Decision trees, HMM) This presentation addresses both the issues of prosody generation and voice quality modification.

4 Toshiba Update 04/09/2006 Data Specifics – Female Speaker Parallel data used: 595 utterances of each emotion 545 utterances used for training of the different modules, 50 were set aside as the test set. Features extracted: – Text-based features Phone identity, lexical stress, syllable position, word length, word position, part of speech. – Perceptual features (Intonation Units) TOBI-based syllable labels: {alh, ah, al, c} represent three accent types and one symbol for unaccented syllables. Automatically extracted by Prosodizer.

5 Toshiba Update 04/09/2006 Step 1: Convert Intonation Unit Sequence ASSUMPTION: Each emotion has an intrinsic pattern in its intonation unit sequence (e.g. surprised utterances use alh much more frequently than neutral) For each input neutral unit and its context, we want to find the corresponding unit in a target emotion. Training: Use Decision Trees trained on parallel data – This is similar to sliding a decision tree along the utterance and generating an output at each syllable. { alh c c c ah c al c } Sequence Conversion { alh c alh c ah c alh c } Target Emotion Text-based features Neutral Sequence

6 Toshiba Update 04/09/2006 Step 1: Sequence Conversion Results Sequence prediction accuracy is computed on the test data by measuring the number of units that match between the converted and target sequences. (Substitution error) As a benchmark, the unchanged neutral sequence is also compared to the target sequence. Sequence conversion improves accuracy for happy, angry and surprised and doesn’t change the results for sad. Sequence Prediction Accuracy (%) Neutral SequenceConverted Sequence happy61.9066.52 sad60.8160.04 angry60.2672.45 surprised53.2361.03

7 Toshiba Update 04/09/2006 Step 2: Intonation Model Training Each syllable intonation is modelled as a three-state left-to-right HMM. Each model is highly context sensitive – An example model: alh+lex@1:wpos@1:spos@3:pofs@3:word_len@3:syltype@2 Syllable models trained with interpolated F0 and Energy values based on the laryngograph signal as well as first and second order differentials. (F0, E,  F0,  F0,  E,  E) Decision tree-based parameter tying was performed. Intonation Models F0 contour { alh c alh c …} Text-based features

8 Toshiba Update 04/09/2006 Step 2: Model-Based Intonation Generation The goal is to generate an optimal sequence of F0 values directly from context-sensitive syllable HMMs given the intonation sequence: This results in a sequence of mean state values. Cepstral parameter generation algorithm of HTS system for interpolated F0 generation (Tokuda et al, 1995) Differential F0 features are used as constraints in contour generation. Results in smoother contours.

9 Toshiba Update 04/09/2006 Step 2: Model-Based Intonation Generation Difficult to obtain an objective measure for F0 comparison that is perceptually relevant. A simple approach is to align the target and the model-generated contours so that they both have N pitch points and measure RMS error per utterance The average RMS error for all generated test contours is given below. The first row presents the original error between the neutral contour and the target contour as a benchmark. RMSE in Hz SurprisedAngryHappySad RMSE (neutral, target)93.7138.868.834.8 RMSE (gen, target)64.768.365.627.2

10 Toshiba Update 04/09/2006 Step 3: Duration Tree Training A decision tree was built for each voiced broad class – vowels, nasals, glides and fricatives. All text-based features and intonation units used to build the trees. The features that were most significant varied with emotion and phone class. For each test utterance a duration tier was constructed by taking the ratio of predicted duration to neutral duration. Text-based features Duration Trees Duration Tier { alh c alh c …}

11 Toshiba Update 04/09/2006 Step 3: Duration Trees - Evaluation Assume Poisson distribution at the leaves of the decision tree. where λ is the mean of the leaf node (predicted duration) Measure the performance of duration trees by using them as a classifier. – How likely is it for happy durations to be generated by neutral/happy/sad…trees? We want the diagonal entries to be minimum (most likely) of each row. Mean Log Likelihood Of Test Data Decision Tree NeutralHappySadSurprisedAngry Actual Duration Neutral-9.28-10.18-11.14-11.01-11.49 Happy-11.52-10.12-12.01-11.41-11.87 Sad-14.35-13.04-12.05-14.90-15.29 Surprised-12.13-10.86-13.13-10.23-11.39 Angry-12.82-11.17-13.37-11.53-10.95

12 Toshiba Update 04/09/2006 Overview of Run-Time System { alh c c c ah c al c } Sequence Conversion { alh c alh c ah c alh c } Target Emotion Intonation Models Text-based features Text-based features Duration Trees TD-PSOLA (Praat) F0 contour Duration Tier Neutral Sequence Target Emotion Target Emotion

13 Toshiba Update 04/09/2006 Prosodic Conversion - Samples NeutralHappySadSurprisedAngry

14 Toshiba Update 04/09/2006 Experiments With Voice Quality Analysis of Long Term Average Spectra for vowels. – Pitch-Synchronous Analysis (single pitch period frames) – Total power in each frame normalized to a constant value. – Anger has significantly more energy in the 1550-2500 band and less in 0-800 – Sadness has a sharper spectral tilt and more low frequency energy – Happy & surprised follow similar spectral patterns /ae/

15 Toshiba Update 04/09/2006 Upcoming work More experiments with voice quality modification – Decision tree based filter bank generation approach. Combine voice quality processing with prosody generation. Application of techniques to the MMJ (male) corpus, performance comparison across gender. Perceptual study – Acquire recognition scores across emotions, gender and feature set Miscellaneous: Application of FSP, MMJ models to a new speaker/language.


Download ppt "Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab."

Similar presentations


Ads by Google