HARMONIC MODEL FOR FEMALE VOICE EMOTIONAL SYNTHESIS Anna PŘIBILOVÁ Department of Radioelectronics, Slovak University of Technology Ilkovičova 3, SK Bratislava, Slovakia, Jiří PŘIBIL Institute of Photonics and Electronics, Academy of Sciences of the Czech Republic Chaberská 57, CZ Praha 8, Czech Republic, Introduction Harmonic speech model with AR parameterization Spectral modifications for emotional synthesis Prosodic modifications for emotional synthesis Listening tests results Conclusion
Harmonic speech model with AR parameterization voicing transition frequency
Voicing transition frequency
Determination of model parameters spectral flatness measure
F1 300 Hz 840 Hz F2 840 Hz 2400 Hz F3 2400 Hz 3840 Hz F4 3840 Hz 4800 Hz Female formant areas (+20%) Emotional influence on speech formants pleasant emotions – faucal and pharyngeal expansion, relaxation of tract walls, mouth corners retracted upward (F1 falling, resonances raised) unpleasant emotions – faucal and pharyngeal constriction, tensing of vocal tract walls, mouth corners retracted downward (F1 rising, F2 and F3 falling) pleasant emotions F1 falling, resonances raised unpleasant emotions F1 rising, F2 and F3 falling Scherer, K., R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication, Vol. 40 (2003) Male formant areas F1 250 Hz 700 Hz F2 700 Hz 2000 Hz F3 2000 Hz 3200 Hz F4 3200 Hz 4000 Hz Fant, G.: Speech Acoustics and Phonetics. Kluwer Academic Publishers, Dordrecht (2004) 700 Hz 840 Hz
Spectral modifications for emotional synthesis frequency scale transformation
Frequency scale transformation F 1,2 F1 ( < F 1,2 ) increased (decreased) F2, F3, F4 ( > F 1,2 ) decreased (increased) fs/4 F 1,2 fs/4 f [kHz] [-][-] [ - ]
Formant ratio between emotional and neutral speech chosen formant ratio (for frequency after transformation) 1 (214.3 Hz) 2 ( Hz) joyous-to-neutral formant ratio (shift) 0.7 ( 30 % ) 1.05 ( + 5 % ) angry-to-neutral formant ratio (shift) 1.35 ( + 35 % ) 0.85 ( 15 % ) sad-to-neutral formant ratio (shift) 1.1 ( + 10 % ) 0.9 ( 10 % ) mean formant ratio in formant areas F1 300 840 Hz F2 840 2400 Hz F 3840 Hz F 4800 Hz joyous-to-neutral formant ratio (shift) %) ( %) ( %) ( 0.36 %) angry-to-neutral formant ratio (shift) ( %) ( %) %) 9.88 %) sad-to-neutral formant ratio (shift) ( %) 6.17 %) %) 9.24 %) joyous angry sad joyous angry sad 30 % 15 % 10 % % % 9.88 % % 6.17 % % % % % + 35 % + 10 % + 5 % 0.36 % 9.24 % %
Prosody of emotional speech Scherer, K., R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication, Vol. 40 (2003) EMOTIONF0 meanF0 rangeenergyduration JOYhigher shorter ANGERhigher shorter SADNESSlower longer EMOTIONF0 meanF0 rangeenergyduration JOY ANGER SADNESS OUR CHOICE OF EMOTIONAL-TO-NEUTRAL RATIOS
Linear trend of F0 at the end of sentences JOY EMOTIONlinear trend typelinear trend start JOYrising55 % from the end ANGERfalling35 % from the end ANGER
Listening tests “Determination of emotion type” – 10 evaluation sets selected randomly from the testing corpus – 60 short sentences (1 s 3.5 s) – from the Czech stories – female professional actors – 4 possibilities: “joy”, “anger”, “sadness”, “other” 20 listeners (16 Czechs and 4 Slovaks, 6 women and 14 men) MS ISAPI/NSAPI DLL script - runs on server PC - communicates with user via HTTP protocol
Listening tests MS ISAPI/NSAPI DLL script - runs on server PC - communicates with user via HTTP protocol
Listening tests results EMOTIONJOYANGERSADNESSOTHER JOY59.0 % 0.5 %16.0 %24.5 % ANGER 2.5 %73.5 % 2.0 %22.0 % SADNESS 0.5 % 90.0 % 9.0 % Successful determination of emotions (summed for all emotions) Confusion matrix correctnot classifiedexchanged best evaluated sentence * 88.1 %11.9 % 0 % worst evaluated sentence ** 57.6 %30.3 %12.1 % * “Vše co potřeboval.” (“All he needed.”) ** “Máš ho mít.” (“You ought to have it.”)
Conclusion Female voice emotional conversion: – harmonic speech model with AR parameterization Spectral modifications: – spectral envelope: formant shift – spectral flatness => voicing transition frequency Prosodic modifications: – energy, duration, F0 mean, range, linear trend at the end of sentences Listening tests: best synthesized: sadness worst synthesized: joy Next research: – inclusion of microprosodic features in emotional voice conversion – modifications of F0 linear trend at the beginning of sentences