Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.

Similar presentations


Presentation on theme: "Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio."— Presentation transcript:

1

2 Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio

3 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 2 Shout is the loudest mode of vocal communication It is used for increasing the signal- to-noise ratio (SNR) when communicating over an interfering noise over a distance Shouting is also used for expressing emotions or intentions Shout

4 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 3 Shout is produced by raising the subglottal pressure and increasing the vocal fold tension In effect, shout is characterized by Increased sound pressure level (SPL) Increased fundamental frequency (f0) Increased amplitudes in mid-frequencies (1—4 kHz) Increased duration and energy of vowels Decreased duration and energy of consonants Less accurate articulation Properties of shout

5 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 4 Fortunately, shouting is used rarely, but it is an essential part of human vocal communication Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters Why perform shout synthesis?

6 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 5 In this study Normal and shouted speech was recorded Properties of normal and shouted speech were analyzed Methods for producing natural sounding HMM-based synthetic shout are investigated In this study…

7 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 6 Normal and shouted speech was recorded in an anechoid chamber 22 Finnish speakers 24 sentences of speech and shout from each speaker A total of 1056 sentences Subjects were asked to use very loud voice in shouting In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes Recording of normal and shouted speech

8 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 7

9 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 8 The following acoustic properties were analyzed from the recorded shouted and normal speech: sound pressure level (SPL) duration fundamental frequency (f0) spectrum properties of the voice source: shape of the glottal pulse H1-H2 parameter NAQ parameter Acoustic analysis of shout

10 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 9 On average (speech  shout) SPL increased 21 dB for females and 22 dB for males Sentence duration increased 20% for females and 24% for males f0 increased 71% for females and 152% for males Spectrum was emphasized in the 1–4 kHz area Acoustic analysis of shout – Results

11 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 10

12 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 11 Overall Voiced Unvoiced FemaleMale

13 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 12

14 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 13 Differences between normal speech and shout are large This induces problems in many speech processing algorithms: Due to high f0, the accurate estimation of speech spectrum is difficult This is due to the biasing effect of the sparse harmonic structure of the shouted voice source Especially linear prediction (LP) is prone to this type of bias Problems…

15 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 14 The biasing effect of the harmonics must be reduced For this purpose, e.g. weighted linear prediction (WLP) can be used In WLP, the effect of the excitation to spectrum is reduced This is done by weighting the squared residual with a specific function Spectrum estimation of shout

16 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 15 LP vs. weighted linear prediction (WLP) Conventional LP: Weighted LP:

17 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 16

18 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 17 Following spectrum estimation methods were compared for normal speech and shout: 1.Conventional linear prediction (LP) 2.WLP with STE weight (STE-WLP) 3.WLP with AME weight (AME-WLP) STE – short time energy AME – attenuation of the main excitation Spectrum estimation of shout

19 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 18 Subjective listening tests indicate that WLP-AME performs best with normal speech WLP-STE performs best with shout LP WLP-STE WLP-AME LP vs. WLP in resynthesis

20 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 19 Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation) FemaleMale LP vs. WLP in HMM-based speech synthesis

21 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 20 HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout Synthesis of shout (1) Speech data Statistical model Synthetic speech Training Synthesis Text

22 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 21 It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice Shout data Synthesis of shout (2)

23 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 22 Statistical adaptation of the normal speech model was used to generate synthetic shouted speech Statistical model Shout data Adaptation Training Synthesis Text Synthetic shout Speech data Synthesis of shout (3)

24 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 23 Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech Shout data Voice conversion Statistical model Training Synthesis Text Synthetic shout Speech data Synthesis of shout (4)

25 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 24 The following speech types were selected for the test: 1.Natural normal speech 2.Natural shout 3.Synthetic normal speech 4.Synthetic shout (adapted) 5.Synthetic shout (voice conversion) Evaluation (1)

26 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 25 MOS style listening test: the following properties were rated: 1.How would you rate the quality of the speech sample? 2.How much the sample resembles shouting? 3.How much effort did speaker use for producing speech? Scale from 1 to 5 with verbal anchors Loudness of the speech samples was normalized so that the ratings are based on other aspects than SPL 11 test subjects evaluated 50 samples each Evaluation (2)

27 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 26 Results – Naturalness 26 Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected) Normal synthesis Shout synthesis

28 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 27 Results – Impression of shouting 27 The impression of shouting is, however, fairly well preserved Natural shout Synthetic shout

29 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 28 Results – Vocal effort 28 Adaptation produces better impression of the used vocal effort compared to voice conversion method Adapted shout Voice conversion shout

30 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 29 Synthesis of shout is challenging for many reasons: 1.It is difficult to obtain large amounts of shout data with consistent quality 2.Differences between normal speech and shout are large, which induces problems in many speech processing algorithms In this work, the biasing effect of high-pitched shout was reduced by using weighted linear predictive (WLP) methods Subjective listening tests show the that WLP models work better with shout than conventional LP Summary (1)

31 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 30 In this study, synthetic shout was produced with two different techniques: 1.Adaptation 2.Voice conversion of the synthetic normal speech Methods were rated equal in quality Impression of shouting and the use of vocal effort were better preserved in the adapted shout Summary (2)

32 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 31 Thank you! MaleFemale Samples

33 Statistical Parametric Speech Synthesis Utilizing Glottal Inverse Filtering Based Vocoding Analysis and Synthesis of Shouted Speech Raitio, Suni, Pohjalainen, Airaksinen, Vainio, Alku 32


Download ppt "Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio."

Similar presentations


Ads by Google