Download presentation
Presentation is loading. Please wait.
Published byLeslie Conley Modified over 9 years ago
1
csc2535 2013 Lecture 7 Recognizing speech. Geoffrey Hinton
2
Why speech recognition works better with a good language model We cannot identify phonemes perfectly in noisy speech –The acoustic input is often ambiguous: there are several different words that fit the acoustic signal equally well. People use their understanding of the meaning of the utterance to hear the right word. –We do this unconsciously and we are very good at it. –It can lead to bizarre errors: Given the right context, we hear “Wreck a nice beach” as “Recognize speech” Speech recognizers have to know which words are likely to come next and which are not. –This can be done quite well without full understanding.
3
A “standard” speech recognition system Step 1: Convert the soundwave into a sequence of frames of Mel Frequency Cepstral Coefficients (MFCCs) –Use Fourier analysis on ~25ms windows. –Smooth the spectrum of frequencies to capture the resonances of the vocal tract. –Use wider frequency bins for higher frequencies (but uniform width up to 1000Hz) –Advance the window by ~10ms
4
A “standard” speech recognition system Step 2: Model each frame of coefficients by using a mixture of Gaussians. –Affine transform all frames to deal with obvious, shared covariance structure. Make the affine transform depend on the speaker to eliminate some of the inter-speaker variation. –To cope with the fact that these Gaussians cannot model the strong temporal covariances, enhance the data with temporal differences and differences of differences. This allows local temporal covariances to be modelled as the variances of the differences.
5
A “standard” speech recognition system Step 3: Cope with the alignment problem by using a Hidden Markov Model. –Each hidden state of the HMM has its own mixture of Gaussians model. –These state-specific MoG models may use state-specific mixing proportions for a set of Gaussians that are shared by all of the hidden states. This reduces the number of parameters by a lot.
6
How HMMs solve the alignment problem HMMs have a very weak generative model: –Each frame is generated by a single hidden state. –So the full posterior over states only has as many probabilities as the number of states. –This makes it easy to search all possible alignments using the forward-backward algorithm (a form of dynamic programming). Since exact inference is tractable in an HMM we can use EM to learn the parameters of the transition matrix and of the Gaussians.
7
How to make progress in speech recognition Stop using MFCCs to model the soundwave. –They throw away a lot of detailed information This is good if you have a small, slow computer. Stop using mixtures of Gaussians to model the acoustics. –Its an exponentially inefficient model Stop using HMMs to model sequential structure. –HMMs are exponentially inefficient generative models. They require 2^N hidden states to carry N bits of constraint during generation.
8
An early use of neural nets Use a feedforward neural net to convert MFCCs into a posterior probability distribution over hidden states of the HMM. –To train this net we need to know the “correct” state of the HMM, so we need to bootstrap from an existing ASR system. –After training the neural net, we need to convert p(state|data) into p(data|state) in order to train an HMM.
9
A neat application of deep learning A very deep belief net beats the record at phone recognition on the very well-studied TIMIT database. The task: –Each of the 61 phones is modeled by its own little 3- state mono-phone HMM. –The neural net is trained to predict the probabilities of 183 context-dependent phone labels for the central frame of a short window of speech The training procedure: –Train lots of big layers, one at a time, without using the labels. –Add a 183-way softmax of context-specific phone labels –Fine-tune with backprop on a big GPU board for several days
10
One very deep belief net for phone recognition 11 frames of 39 MFCC’s 2000 binary hidden units 128 units 183 labels Mohamed, Dahl & Hinton (2011) not pre-trained The Mel Cepstrum Coefficients are a standard representation for speech
11
A neat application of deep learning After the standard post-processing using a bi-phone “language” model this net gets 23.0% phone error rate. –This was a record for speaker-independent phone recognition. Throw out the MFCCs and use filterbank coefficients plus their temporal deltas and delta-deltas. –With a deeper net (8 layers) this gets down to 20.7% For TIMIT, the classification task (i.e. Classify each phone when you are given the phone boundaries) is a bit easier than the recognition task. –On TIMIT, deep networks are the best at classification too (Honglak Lee)
12
Getting rid of the HMM This is going to be more difficult because HMMs are a convenient way to deal with alignment. –Their generative weakness is a strength. The big hope is a recurrent neural net. –Initially this could be trained to predict the next frame. This provides lots of constraint and there is lots of data. –Then it can be trained to predict both the next frame and the next phone label. –But we don’t really know how to deal with alignment when using an RNN.
13
THE END
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.