Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal tract (synthesis) Record a human speaking a lot of sentences, and come up with some way of making new sentences out of the recorded ones (concatenation)
What goes into synthesizing speech? Have some idea of what human speech actually looks/sounds like –Modeling the shape of a speaker’s mouth –Fricative noises and noises from stops –Pitch changes Produce sounds that resemble speech sounds
Synthesis: Putting it all together Shape of mouth: 1: 2: 3: all 3: Fricative and burst noises: Shape of mouth and fricative noises: Shape of mouth, fricative noises, & pitch:
Speech synthesis (1980): The Speak & Spell toy used a synthesis process called Linear Predictive Coding (LPC). Basically, LPC is a way for a computer to extract all of the different parts of speech from a speech signal, and re-create them using a mathematical model of the vocal tract Here’s a better example of LPC (1982): LPC is used today for GSM phone systems
Text-to-Speech (TTS) systems Concatenative synthesis –Record natural speech –Chop speech up into units –Recombine units according to the phonetic transcription to be pronounced Steps for a TTS system: –Start w/ written text –Convert text to phonetic characters –Find segments of speech in database –Calculate intonation of sentence
Text-to-Speech (TTS) systems Examples of text from The North Wind and the Sun (Aesop), circa 2005: Mike (AT&T) Crystal (AT&T) British English (Rhetorical Systems) Scottish English (Rhetorical Systems)