SIPCom 8-4 Speech Processing, MM7 - Speech Synthesis - Speech Recognition (Part 1 of 3) Børge Lindberg lindberg@kom.aau.dk
Text-to speech Synthesis Text analysis Prosody generation Sound generation Synthetic speech Lexicon & Rules Pitch & duration (stød) Diphone-database
Why is it so difficult ? Text nomalisation Morphological analysis “kl 12-14”, “8-3=5”, “8-4-1997”, “mio”, “USA” Morphological analysis “periferien” vs. “skoleferien”, “hul” Syntactic analysis “en mand med hul røst dør bag en dør med hul i” Semantic analysis “The man fed her dog biscuits” Sound generation Transitions, time- and pitch scaling
Concatenative synthesis test = /tEsd/ = /#t/ + /tE/ + /Es/ + /sd/ + /d#/ /#t/ /tE/ /Es/ /sd/ /d#/
Di-(tri)phone Database database of male speaker Approx. 2600 subword units (di- & triphones) Requires pitch-, di- and triphone segmentation
Input to the sound generator
Effect of scaling No scaling Time scaled + pitch scaled + energy + stød
More examples Normal High speaking rate, normal pitch (aalb.wav) High speaking rate, normal pitch (fast.wav) Low speaking rate, normal pitch (slow.wav) Normal speaking rate, high pitch (light.wav) Normal speaking rate, low pitch (dark.wav)
Evaluation - intelligibility 32 test persons 156 stimuli in carrier sentence: “Det er <keyword>, de siger“
Evaluation - naturalness 32 test persons 155 stimuli