Speech Synthesis December 4, 2014 Gentle Reminders Final exam: Friday, December 12 th, 3:30 – 5:30 pm In this room! Final exam review: Wednesday, December.

Slides:



Advertisements
Similar presentations
Digital Signal Processing
Advertisements

Perturbation Theory, part 2 November 4, 2014 Before I forget Course project report #3 is due! I have course project report #4 guidelines to hand out.
SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.
From Resonance to Vowels March 8, 2013 Friday Frivolity Some project reports to hand back… Mystery spectrogram reading exercise: solved! We need to plan.
Synthesizing naturally produced tokens Melissa Baese-Berk SoundLab 12 April 2009.
Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.
Hillenbrand: Vowels1 The Acoustics and Perception of American English Vowels.
Vowel Acoustics, part 2 March 12, 2014 The Master Plan Today: How resonance relates to vowels (= formants) On Friday: In-class transcription exercise.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
ACOUSTICAL THEORY OF SPEECH PRODUCTION
Speech Perception Overview of Questions Can computers perceive speech as well as humans? Does each word that we hear have a unique pattern associated.
Speech Group INRIA Lorraine
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
EE 225D, Section I: Broad background Synthesis/vocoding history (chaps 2&3) Recognition history (chap 4) Machine recognition basics (chap 5) Human recognition.
Vowel Acoustics, part 2 November 14, 2012 The Master Plan Acoustics Homeworks are due! Today: Source/Filter Theory On Friday: Transcription of Quantity/More.
Unit 4 Articulation I.The Stops II.The Fricatives III.The Affricates IV.The Nasals.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Synthetic Audio A Brief Historical Introduction Generating sounds Synthesis can be “additive” or “subtractive” Additive means combining components (e.g.,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
SPEECH PERCEPTION The Speech Stimulus Perceiving Phonemes Top-Down Processing Is Speech Special?
Analysis & Synthesis The Vocoder and its related technology.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Natural Language Understanding
Source/Filter Theory and Vowels February 4, 2010.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Speech Perception. Phoneme - a basic unit of a speech sound that distinguishes one word from another Phonemes do not have meaning on their own but they.
Speech Synthesis April 14, 2009 Some Reminders Final Exam is next Monday: In this room (I am looking into changing the start time to 9 am.) I have a.
Automatic Pitch Tracking September 18, 2014 The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice.
Vowels, part 4 March 19, 2014 Just So You Know Today: Source-Filter Theory For Friday: vowel transcription! Turkish, British English and New Zealand.
Perturbation Theory + Vowels (again) March 17, 2011.
Fricatives, part II November 21, 2012 Announcements For Friday: spectrogram matching exercise! Fricatives and possibly glides, too. Final exam has been.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Acoustic Analysis of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
Harmonics November 1, 2010 What’s next? We’re halfway through grading the mid-terms. For the next two weeks: more acoustics It’s going to get worse before.
Voice Quality + Stop Acoustics
The end of vowels + The beginning of fricatives November 19, 2012.
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
Speech analysis with Praat Paul Trilsbeek DoBeS training course June 2007.
Compression No. 1  Seattle Pacific University Data Compression Kevin Bolding Electrical Engineering Seattle Pacific University.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Sonorant Acoustics March 24, 2009 Announcements and Such Collect course reports Give back homeworks Hand out new course project guidelines.
Sensation & Perception
Speech Science VI Resonances WS Resonances Reading: Borden, Harris & Raphael, p Kentp Pompino-Marschallp Reetzp
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speech Synthesis April 12, 2013 Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Perception + Synthesis April 6, 2010 CP Results.
(Extremely) Simplified Model of Speech Production
Introduction to Digital Speech Processing Presented by Dr. Allam Mousa 1 An Najah National University SP_1_intro.
Fricatives November 20, 2015 The Road Ahead Formant plotting + vowel production exercises are due at 5 pm today! Monday and Wednesday of next week: fricatives,
Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.
Sonorant Acoustics + Place Transitions
Stop Acoustics and Glides December 2, 2013 Where Do We Go From Here? The Final Exam has been scheduled! Wednesday, December 18 th 8-10 am (!) Kinesiology.
Stop + Approximant Acoustics
Lecture 1 Phonetics – the study of speech sounds
Vowels, part 4 November 16, 2015 Just So You Know Today: Vowel remnants + Source-Filter Theory For Wednesday: vowel transcription! Turkish and British.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
Transitions + Perception March 25, 2010 Tidbits Mystery spectrogram #3 is now up and ready for review! Final project ideas.
1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.
Spectral Analysis March 3, 2016 Mini-Rant I have succeeded in grading your course project reports. Things to keep in mind: A table of stop phonemes is.
High Quality Voice Morphing
Mr. Darko Pekar, Speech Morphing Inc.
Vocoders.
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Linear Predictive Coding Methods
The Vocoder and its related technology
Presentation transcript:

Speech Synthesis December 4, 2014

Gentle Reminders Final exam: Friday, December 12 th, 3:30 – 5:30 pm In this room! Final exam review: Wednesday, December 10 th, 11 am Place to be determined! Final course project report is due: Thursday, December 18 th at 5 pm! I will be posting my notes on audition later for your education/edification. The palatography pix will be posted, too! I’ll be around tomorrow (EDC 259), if you’d like to pick up your remaining homeworks.

Speech Synthesis: A Basic Overview Speech synthesis is the generation of speech by machine. The reasons for studying synthetic speech have evolved over the years: 1.Novelty 2.To control acoustic cues in perceptual studies 3.To understand the human articulatory system “Analysis by Synthesis” 4.Practical applications Reading machines for the blind, navigation systems

Speech Synthesis: A Basic Overview There are four basic types of synthetic speech: 1.Mechanical synthesis 2.Formant synthesis Based on Source/Filter theory 3.Concatenative synthesis = stringing bits and pieces of natural speech together 4.Articulatory synthesis = generating speech from a model of the vocal tract.

1. Mechanical Synthesis The very first attempts to produce synthetic speech were made without electricity. = mechanical synthesis In the late 1700s, models were produced which used: reeds as a voicing source differently shaped tubes for different vowels

Mechanical Synthesis, part II Later, Wolfgang von Kempelen and Charles Wheatstone created a more sophisticated mechanical speech device… with independently manipulable source and filter mechanisms.

Mechanical Synthesis, part III An interesting historical footnote: Alexander Graham Bell and his “questionable” experiments with his dog. Mechanical synthesis has largely gone out of style ever since. …but check out Mike Brady’s talking robot.

The Voder The next big step in speech synthesis was to generate speech electronically. This was most famously demonstrated at the New York World’s Fair in 1939 with the Voder. The Voder was a manually controlled speech synthesizer. (operated by highly trained young women)

Voder Principles The Voder basically operated like a vocoder. Voicing and fricative source sounds were filtered by 10 different resonators… each controlled by an individual finger! Only about 1 in 10 had the ability to learn how to play the Voder. Compare with Daft Punk:

The Pattern Playback Shortly after the invention of the spectrograph, the pattern playback was developed. = basically a reverse spectrograph. Idea at this point was still to use speech synthesis to determine what the best cues were for particular sounds.

2. Formant Synthesis The next synthesizer was PAT (Parametric Artificial Talker). PAT was a parallel formant synthesizer. Idea: three formants are good enough for intelligble speech. Subtitles: What did you say before that? Tea or coffee? What have you done with it?

PAT Spectrogram

2. Formant Synthesis, part II Another formant synthesizer was OVE, built by the Swedish phonetician Gunnar Fant. OVE was a cascade formant synthesizer. In the ‘50s and ‘60s, people debated whether parallel or cascade synthesis was better. Weeks and weeks of tuning each system could get much better results:

Synthesis by rule The ultimate goal was to get machines to generate speech automatically, without any manual intervention. synthesis by rule A first attempt, on the Pattern Playback: (I painted this by rule without looking at a spectrogram. Can you understand it?) Later, from 1961, on a cascade synthesizer: Note: first use of a computer to calculate rules for synthetic speech. Compare with the HAL 9000:

Parallel vs. Cascade The rivalry between the parallel and cascade camps continued into the ‘70s. Cascade synthesizers were good at producing vowels and required fewer control parameters… but were bad with nasals, stops and fricatives. Parallel synthesizers were better with nasals and fricatives, but not as good with vowels. Dennis Klatt proposed a synthesis (sorry): and combined the two…

KlattTalk KlattTalk has since become the standard for formant synthesis. (DECTalk)

KlattVoice Dennis Klatt also made significant improvements to the artificial voice source waveform. Perfect Paul: Beautiful Betty: Female voices have remained problematic. Also note: lack of jitter and shimmer

LPC Synthesis Another method of formant synthesis, developed in the ‘70s, is known as Linear Predictive Coding (LPC). Here’s an example: To recapitulate (my) childhood: As a general rule, LPC synthesis is pretty lousy. But it’s cheap! LPC synthesis greatly reduces the amount of information in speech…

Filters + LPC One way to understand LPC analysis is to think about a moving average filter. A moving average filter reduces noise in a signal by making each point equal to the average of the points surrounding it. y n = (x n-2 + x n-1 + x n + x n+1 + x n+2 ) / 5

Filters + LPC Another way to write the smoothing equation is y n =.2*x n-2 +.2*x n-1 +.2*x n +.2*x n+1 +.2*x n+2 Note that we could weight the different parts of the equation differently. Ex: y n =.1*x n-2 +.2*x n-1 +.4*x n +.2*x n+1 +.1*x n+2 Another trick: try to predict future points in the waveform on the basis of only previous points. Objective: find the combination of weights that predicts future points as perfectly as possible.

Deriving the Filter Let’s say that minimizing the prediction errors for a certain waveform yields the following equation: y n =.5*x n -.3*x n-1 +.2*x n-2 -.1*x n-3 The weights in the equation define a filter. Example: how would the values of y change if the input to the equation was a transient where: at time n, x = 1 at all other times, x = 0 Graph y at times n to n+3.

Decomposing the Filter Putting a transient into the weighted filter equation yields a new waveform: The new equation reflects the weights in the equation. We can apply Fourier Analysis to the new waveform to determine its spectral characteristics.

LPC Spectrum When we perform a Fourier Analysis on this waveform, we get a very smooth-looking spectrum function: This function is a good representation of what the vocal tract filter looks like. LPC spectrum Original spectrum

LPC Applications Remember: the LPC spectrum is derived from the weights of a linear predictive equation. One thing we can do with the LPC-derived spectrum is estimate formant frequencies of a filter. (This is how Praat does it) Note: the more weights in the original equation, the more formants are assumed to be in the signal. We can also use that LPC-derived filter, in conjunction with a voice source, to create synthetic speech. (Like in the Speak & Spell)

3. Concatenative Synthesis Formant synthesis dominated the synthetic speech world up until the ‘90s… Then concatenative synthesis started taking over. Basic idea: string together recorded samples of natural speech. Most common option: “diphone” synthesis Concatenated bits stretch from the middle of one phoneme to the middle of the next phoneme. Note: inventory has to include all possible phoneme sequences = only possible with lots of computer memory.

Concatenated Samples Concatenated synthesis tends to sound more natural than formant synthesis. (basically because of better voice quality) Early (1977) combination of LPC + diphone synthesis: LPC + demisyllable-sized chunks (1980): More recent efforts with the MBROLA synthesizer: Also check out the Macintalk Pro synthesizer!

Recent Developments Contemporary concatenative speech synthesizers use variable unit selection. Idea: record a huge database of speech… And play back the largest unit of speech you can, whenever you can. Interesting development #2: synthetic voices tailored to particular speakers. Check it out:

4. Articulatory Synthesis Last but not least, there is articulatory synthesis. Generation of acoustic signals on the basis of models of the vocal tract. This is the most complicated of all synthesis paradigms. (we don’t understand articulations all that well) Some early attempts: Paul Boersma built his own articulatory synthesizer… and incorporated it into Praat.

Synthetic Speech Perception In the early days, speech scientists thought that synthetic speech would lead to a form of “super speech” = ideal speech, without any of the extraneous noise of natural productions. However, natural speech is always more intelligible than synthetic speech. And more natural sounding! But: perceptual learning is possible. Requires lots and lots of practice. And lots of variability. (words, phonemes, contexts) An extreme example: blind listeners.

More Perceptual Findings 1.Reducing the number of possible messages dramatically increases intelligibility.

More Perceptual Findings 2.Formant synthesis produces better vowels; Concatenative synthesis produces better consonants (and transitions) 3. Synthetic speech perception uses up more mental resources. memory and recall of number lists 4.Synthetic speech perception is a lot easier for native speakers of a language. And also adults. 5.Older listeners prefer slower rates of speech.

Audio-Visual Speech Synthesis The synthesis of audio-visual speech has primarily been spearheaded by Dominic Massaro, at UC-Santa Cruz. “Baldi” Basic findings: Synthetic visuals can induce the McGurk effect. Synthetic visuals improve perception of speech in noise …but not as well as natural visuals. Check out some samples.

Further Reading In case you’re curious: mst/contents.html