Business Unit or Product Name © 2006 IBM Corporation 26/02/2007 Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12 th, 2008 Data-driven.

Business Unit or Product Name © 2006 IBM Corporation 26/02/2007 Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12 th, 2008 Data-driven Parameter Generation For Emotional Speech Synthesis

26/02/2007 Project Goals  Rapid generation of expressive speech using LIMITED training data.  A modular data-driven emotion conversion framework.  Focus on advanced prosody conversion techniques. Two alternative techniques for F0 conversion A duration conversion module Incorporation of basic linguistic context in prosody conversion  Case study with three emotions: surprise, sadness and anger 272 utterances for training and 28 for testing  Evaluation of individual modules and the combined emotion conversion system NeutralSuprisedSadAngry Duration of Training Utterances (min)13.915.116.414.9

26/02/2007 Experimental Setup Pitch-synchronous LPC Analysis GMM-based Spectral Conversion OLA Synthesis Duration Conversion Neutral Phone Durations Duration Tier Converted Waveform TD-PSOLA Extract new syllable durations HMM-based F0 Generation F0 Segment Selection F0 Contour TD-PSOLA Converted Waveform Final Waveform Linguistic Context (Phone, syllable, word) Linguistic Context (Syllable, Word) Neutral F0 Contour

26/02/2007 GMM-based Spectral Conversion  Complements prosody conversion modules.  A popular method for voice conversion (Stylianou et al, 1998).  Pitch-synchronous LPC analysis (order=30) and OLA synthesis.  A linear transformation F is learned from the parallel data and applied to each speech frame.  Long-term average spectra of a vowel in the test data: /ae/

26/02/2007 F0 Generation from Syllable HMMs  Syllable modelling from scratch based on limited training data.  Linguistic units for F0 modelling Model initialization with short labels  spos, wpos, lex Model training with full-context labels  spos, wpos, lex, pofs, ppofs, onset, coda Decision-tree based parameter tying was performed based on a log-likelihood threshold. Lexical Stresslexstressed(1) unstressed(0) Position in Wordwpossingle-syllable(0) initial(1) middle(2) final(3) Position in Sentence Sposfirst three words (1-3) middle(4) last three words (5-7) Part of Speech / Previous Part of Speech pofs / ppofs determinant(1) verb(2-3) noun(4-5) adjective(6) adverb(7) preposition(8) pronoun(9) conjunctive(10) aux(11) wh(12) intensifier(13) Onsetonsetno onset (0) unvoiced onset (1) sonorant (2) Codacodano coda (0) unvoiced coda (1) sonorant (2)

26/02/2007 Syllable MSD-HMMs spos@7:wpos@3:lex@0+ pofs@5:ppofs@8:onset@1:coda@1  F0, ΔF0, ΔΔF0  Previously used interpolated contours  Updated to syllable MSD- HMMs Two spaces: voiced / unvoiced A uv model was introduced to represent all unvoiced regions within syllables.

26/02/2007  Given a sequence of syllable durations and syllable context for an utterance, generate an F0 contour: parameter generation algorithm of the HTS framework. (Tokuda et al., 2000)  Incorporation of global variance in parameter generation only made a slight perceptual difference for surprise. F0 Generation from Syllable HMMs angry sadsurprised No GV GV

26/02/2007 F0 Segment Selection  Unit selection applied to syllable F0 segments. Parallel neutral and emotional F0 segments and their common context form the units (U) An input specification sequence I consisting of syllable context and input contour.  Given a sequence of input syllable specifications, the goal is to find the minimum cost path through trellis of possible F0 segment combinations using Viterbi search.

26/02/2007 F0 Segment Selection Continued  Target cost T is a Manhattan distance, consisting of P subcosts:  Two types of target subcosts: Binary value (0 or 1) indicating context match for a given context feature. Euclidian distance between input segment and the neutral segment of unit u j  Concatenation cost J? Zero if adjacent syllables are detached, i.e. separated by an unvoiced region. Non-zero if syllables are attached, i.e. part of one continous voiced segment.  Distance between last F0 value for unit s-1 and first F0 value of unit s.  Pruning based on segment durations. s - 1 s E B B J s,s-1 = |B-E|

26/02/2007 Weights Used in Segment Selection wTwT SurprisedSadAngry w lex 13.6712.3018.74 w wpos 24.5211.2918.47 w spos 11.334.913.31 w pofs 1.134.828.82 w ppofs 24.276.4910.54 w onset 15.080.335.54 w coda 8.236.096.36 w F0 0.470.691.00 w TJ SurprisedSadAngry w lex 17.896.4315.98 w wpos 0.0 w spos 0.0 w pofs 0.0 w ppofs 0.0 w onset 3.230.0 w coda 0.0 8.74 w F0 0.270.370.68 w concat 0.740.700.48  P weights, w T, are estimated automatically corresponding to P target subcosts detached syllables. (P=8)  A separate set of weights w TJ are estimated for attached syllables. The cost of attaching unit u k to a preceding unit u j is defined as follows:

26/02/2007 Conversion of F0 & Spectra Syllable HMMs Segment Selection surprised angry sad

26/02/2007 Duration Conversion  A regression tree was built for each broad class and emotion. Vowels, nasals, glides and fricatives. Relative trees outperform absolute trees, scaling factors are predicted. Cross-validation of training data to find the best pruning level.  Seven feature groups (FG) were investigated as predictors in the tree. FG resulting in the smallest RMS error on test data used for each emotion and broad class. Optimal RMS errors 25-35ms – not very small… RMSE of surprise improves with higher level linguistic features Best FG (RMSE in ms) AngrySadSurprised VowelsFG1 (37) FG2 (29.7) FG5 (34.5) GlidesFG2 (29.8) FG2 (29.5) FG2 (29) NasalsFG2 (27.5) FG2 (21) FG2 (28.6) FricativesFG3 (32.8) FG2 (28.7) FG4 (34.5) Feature GroupPredictors FG0 input duration FG1 FG0 + phone ID FG2 FG1 + left context, right context FG3 FG2 + lexical stress FG4 FG3 + position in sentence FG5 FG4 + position in word FG6 FG5 + part of speech

26/02/2007 Duration Conversion & F0 Segment Selection Duration Tier

26/02/2007 Perceptual Listening Test – Part 1  Evaluation of spectral conversion through preference tests. Which one sounds angrier/sadder/more surprised? Spectral conversion applied to only one in each pair. Segment selection was used to predict F0 contours for both.  Five pairs of stimuli for each emotion presented to 20 subjects. Carrier utterances changed after 10 subjects for variety.  Spectral conversion contributes to the perception of anger and sadness  Spectral conversion did not have an effect on the perception of surprise. Due to its less muffled quality, unmodified speech was usually preferred for surprise. Prefer no ConversionPrefer Conversion Angry9%91% Surprised68%32% Sad13%87%

26/02/2007 Perceptual Listening Test – Part 2  Two-way comparison of HMM-based F0 contours with a baseline:  Spectral conversion applied to both stimuli, no duration modification  Five pairs of stimuli for each emotion presented to 30 subjects. Carrier sentences changed for each group of 10 subjects – 15 unique utterances evaluated.

26/02/2007 Perceptual Listening Test – Part 3  Three-way comparison of HMM-based F0 contours, segment selection and naive baseline Which one sounds angriest/saddest/most surprised? Spectral conversion applied to all stimuli, no duration modification  Ten comparisons for each emotion presented to 30 subjects.  In all parts of the test, subjects were asked to identify the emotion they had most difficulty deciding between the options. % of subjects who picked an emotion as “the hardest to choose” Two-way TestThree-way Test Angry43.3%13.3% Surprised16.7% Sad40%70%

26/02/2007 Perceptual Listening Test – Part 3

26/02/2007 Perceptual Listening Test – Part 4  Two-way comparison of segment selection with and without duration conversion Spectral conversion applied to all stimuli Note that converted durations may effect selected contours due to pruning criteria.  Ten comparisons for each emotion presented to 30 subjects

26/02/2007 Perceptual Listening Test – Part 5  Forced-choice emotion classification task on original emotional utterances. 15 utterances (5 per emotion) were randomly presented to 30 subjects. A “Can’t decide” option was available in order to avoid forcing subjects to choose an emotion.  Speaker conveys anger and sadness very clearly.  Surprise is frequently confused with anger. Speaker used tense voice quality similar to anger, which may have misled people despite clear prosody. AngrySurprisedSadCan’t Decide Angry99.3%0.7%0% Surprised20.0%66.0%0%14% Sad0.7%0%96.0%3.3%

26/02/2007 Perceptual Listening Test – Part 5  Forced-choice emotion classification task on converted utterances. Thirty utterances (10 for each emotion) were randomly presented to 30 subjects. Spectral conversion and duration conversion applied to all utterances Two hidden groups within each emotion (HMM-based contours and segment selection) “Sounds OK” or “Sounds strange” ? HMM-Based F0AngrySurprisedSadCan’t Decide Angry64.7%8.0%4.7%22.6% Surprised10.0%60.7%0%29.3% Sad0.7% 96.0%2.6% Anger “sometimes” doesn’t sound “mean enough”, sounds “stern” not “angry” yet natural The surprise element is “not always in the right place”. Sounds awkward more than half the time.

26/02/2007 Perceptual Listening Test – Part 5 Segment Selection AngrySurprisedSadCan’t Decide Angry 86.7%0.7%0%12.6% Surprised 8.7%76.7%0%14.7% Sad 0.7%0%87.3%12% Significant improvement in the recognition of anger with segment selection. (64.7% with syllable HMMs) No loss of intonation quality (75.3% sounds ok with syllable HMMs, 77.3% with segment selection) Prosody, not just voice quality, plays an important role in communicating anger Remaining 10% is probably lost due to spectral smoothing Recognition rates for surprise improves from 60.7% with syllable HMMs to 76.7% - better than perception of original surprised speech. Spectral conversion fails to capture the harsh voice quality for surprise, which was misleading the subjects in original surprised speech (a nice side effect) Intonation quality was much higher with segment selection (from 47.3% to 73.3%) Sadness was captured slightly less consistently with segment selection. Intonation quality was consistent.

26/02/2007 Conclusions & Future Research  An emotion conversion system was implemented as a means of parameter generation for expressive speech. Trained on 15mins of parallel data in three emotions. Incorporated basic linguistic context in conversion  The subjective evaluation proved that each module is able to help communicate a particular emotion to varying degrees. The overall system was able to communicate all emotions with considerable success.  Segment selection proved to be a highly successful method for F0 conversion even when very little data is available.  Possible areas for future research Modeling of pause and phrasing Conversion of perceptual intensity Context-sensitive conversion of spectra More advanced concatenation costs for segment selection.

26/02/2007 Thank you for your support….

26/02/2007 Backup Slides: Model Training  3 state left-to-right MSD-HMM with 3 mixtures Two mixture components for the voiced space A single zero-dimensional mixture for the unvoiced space.  Model training ensures the uv model has a very high weight for the zero- dimensional mixture component  Decision tree-based parameter tying using log likelihood criterion. Seperate trees were built for each position sentence and each state. ANGER (spos=7)State 2State 3State 4 Number of Leaf Nodes415628 Top 10 questions in the treewpos=0, wpos=3, lex=0, wpos=2, ppofs=1, onset=1, onset=2, pofs=7, pofs=3, onset=0 wpos=1, wpos=2, lex=1, wpos=0, onset=0, ppofs=8, ppofs=13, coda=1, ppofs=3, ppofs=1 wpos=3, wpos=0, lex=0, ppofs=6, wpos=1, coda=0, coda=2, pofs=3, pofs=7, coda=1 SURPRISE (spos=7)State 2State 3State 4 Number of Leaf Nodes557427 Top 10 questions in the treewpos=0, lex=1, wpos=3, wpos=2, ppofs=8, ppofs=2, wpos=1, onset=1, ppofs=6, pofs=5 lex=1, wpos=3, wpos=0, wpos=2, ppofs=3, pofs=7, coda=1, pofs=5, ppofs=4, ppofs=9 wpos=3, wpos=1, lex=1, wpos=0, ppofs=7, coda=2, ppofs=8, ppofs=2, ppofs=9, coda=0

26/02/2007 F0 Generation from Syllable HMMs  Objective evaluation is difficult since perceptually correlated measures do not exist. RMS distance to the true emotional contour is not a reliable source of information. Speakers have a multitude of strategies for expressing a given emotion. As a crude measure, it may still help compare general patterns across methods. RMSE in HzAngrySadSurprised RMSE (neutral, target)145.2032.45115.44 RMSE (generated, target)75.6223.87103.11 RMSE (scaled, target)76.6118.82115.72

26/02/2007 Weight Estimation for Detached syllables  P weights, w T, are estimated corresponding to P target subcosts using held- out data.  Least squares framework with X equations P unknowns where X >> P We already have the target F0 segment we would like to predict. Find N-best and N-worst candidates in the unit database. Each error E represents the RMSE error between the target contour and the best and worst units....

26/02/2007 Weight Estimation for Attached Syllables  A separate set of weights w TJ are estimated for attached syllables. The cost of attaching unit u k to a preceding unit u j is defined as follows:  The sum of target costs for both units and the concatenation cost of joining them are set equal to the RMS error for the N-best and N-worst pairs.

26/02/2007 Weights Used in Segment Selection wTwT SurprisedSadAngry w lex 13.6712.3018.74 w wpos 24.5211.2918.47 w spos 11.334.913.31 w pofs 1.134.828.82 w ppofs 24.276.4910.54 w onset 15.080.335.54 w coda 8.236.096.36 w F0 0.470.691.00 w TJ SurprisedSadAngry w lex 17.896.4315.98 w wpos 0.0 w spos 0.0 w pofs 0.0 w ppofs 0.0 w onset 3.230.0 w coda 0.0 8.74 w F0 0.270.370.68 w concat 0.740.700.48 Lexical stress, input cost and concatenation cost are the major contributors to segment selection for attached syllables. The importance of input cost is once again highest for anger. Concatenation cost plays an very important role for sadness and surprise. Lexical stress and position in word most important linguistic factors across all emotions Position in sentence most important for surprise. High weights for previous part of speech for all emotions. Similarity of input contour to neutral units very importnant for anger, not as important for surprise.

Business Unit or Product Name © 2006 IBM Corporation 26/02/2007 Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12 th, 2008 Data-driven.

Similar presentations

Presentation on theme: "Business Unit or Product Name © 2006 IBM Corporation 26/02/2007 Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12 th, 2008 Data-driven."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Business Unit or Product Name © 2006 IBM Corporation 26/02/2007 Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12 th, 2008 Data-driven.

Similar presentations

Presentation on theme: "Business Unit or Product Name © 2006 IBM Corporation 26/02/2007 Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12 th, 2008 Data-driven."— Presentation transcript:

Similar presentations

About project

Feedback