Speech Recognition with Hidden Markov Models Winter 2011

Speech Recognition with Hidden Markov Models Winter 2011
CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 17½ Speaker Adaptation Notes based on: Huang, Acero, and Hon (2001), “Spoken Language Processing” section 9.6 Lee and Gauvain (1993), “Speaker Adaptation based on MAP Estimation of HMM paramters”, ICASSP 93 Woodland (2001), “Speaker Adaptation for Continuous Density HMMs: A Review” 2001. Gauvain and Lee (1994), “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains” Renals (2008) speaker adaptation lecture notes. Lee and Rose (1996), “Speaker Normalization Using Efficient Frequency Warping Procedures” Panchapagesan and Alwan (2008), “Frequency Warping for VTLN and Speaker Adaptation by Linear Transformation of Standard MFCC”

Speaker Adaptation Given an HMM that has been trained on a large number of people (a speaker-independent HMM), we can try to improve performance by adapting to the speaker currently being recognized in testing. Two basic types of speaker adaptation: 1. Adaptation of the feature space (speaker normalization) Vocal-Tract Length Normalization (VTLN) = warp the feature space to better fit the model parameters 2. Adaptation of the model parameters Maximum A Posteriori (MAP) adaptation = retrain individual state parameters Maximum Likelihood Linear Regression (MLLR) = “warp” model parameters to better fit adaptation data

Speaker Normalization
Common technique is Vocal Tract Length Normalization (VTLN) Assumption: The majority of speaker differences in the acoustic space are caused by different vocal tract lengths. Different lengths of the vocal tract can be normalized using a non-linear frequency warping (like Mel scale, but on speaker-by-speaker basis). Performance using VTLN typically improves by a relative reduction in error of 10% (e.g. from 22% WER to 20% WER, or 10% to 9%, or 5% to 4.5%). Two questions need to be answered to implement VTLN: 1. What type of non-linear warping 2. How to determine optimal parameter value for the non-linear warping during both training and recognition?

With different lengths of vocal tract, the resonant frequencies (formants) shift. A shorter vocal tract yields higher formants; a longer vocal tract yields lower formants. But the shift is not a linear function of frequency. So, need to choose a non-linear warping function. 1. what type of non-linear warping? piecewise linear adjustment of Mel scale power function Also, what range for parameters? If we consider vocal tract lengths to be correlated with a person’s height, then we can look at variation in height to determine range of vocal tract lengths. In U.S., average male is 5’10” (1.776m) and has VTL of 17 cm. A tall man might be 6’6”, or 11% taller than average. An average woman is 5’ 4”, or 90% of the average male height. A short woman might be 85% of the average male height.

frequencies for warping with 
Speaker Normalization warping of Mel scale: =0.85 =1.0 =1.10 frequencies for warping with  from 0.85 to 1.10 by 0.05 frequencies for no warping (=1) (equation from Huang, Acero, Hon “Spoken Language Processing” 2001 p. 427)

piecewise linear warping: (figure from Renals’ ASR lecture, 2008)

warping by power function: (figure from Renals’ ASR lecture, 2008)

Actual estimated warping for different vocal tract lengths, based on two-tube model of four vowels (/ax/, /iy/, /ae/, /aa/; tube parameter values taken from CS551 Lecture 9): 85%=14.4cm 100%=17cm 110%=18.7cm 85%, 90%, 95%, 100%, 105%, formant frequencies for vocal tract lengths: and 110% of 17 cm formant frequencies for 17-cm vocal tract So, complexity of non-linear warping actually isn’t warranted; a linear model fits theoretical data well, or -warping of Mel scale

2. how to determine optimal parameter value during both training and recognition? “Grid Search”: try 13 regularly-spaced values from to 1.12, and find the value that maximizes the likelihood of the model. (Linear increase in processing time) (Lee and Rose, 1996). Use gradient search instead of grid search. Estimate and align (along frequency scale) formant peaks in speaker data. For example, ratio of median position of 3rd formant for current speaker divided by median F3 averaged over all speakers (Eide and Gish, 1996):

Maximum a Posteriori (MAP) Adaptation of Model Parameters
If we have some (labeled) data for a speaker, we can adapt our model parameters to better fit that speaker using MAP adaptation. Sometimes just the means are updated; covariance matrix is assumed to be the same, as are transition probabilities and mixture weights. We also assume that each aspect (means, covariance matrix, etc.) can be treated independently. Maximum Likelihood estimation: MAP estimation: where g() is the prior probability distribution of the model over the space of model parameter values. (If we know nothing about g(), the prior probability of the model, then MAP reduces to ML estimation.) g() original paper on MAP: Lee and Gauvain, ICASSP 1993 parameter space 

What do we know about g(), the prior probability density function of the new model? Usually, we don’t know g(), so we use maximum-likelihood (EM) training. However, in this case, we have an existing, speaker-independent (S.I.) model (know, prior information) and we want to learn the model for a specific speaker. If we assume that each of the parts of the GMM model (, , weights) are independent, we can optimize each of these sub-problems independently. For the D-dimensional Gaussian distributions characterized by  and , the prior density g() can be represented with a normal-Wishart density, with the following parameters: >D-1, >0. The normal Wishart pdf also has a vector nw being the mean of the Gaussian of the speaker-independent model, and a matrix S being the covariance matrix from the speaker-independent model.

Using the Lagrange multiplier similar to the EM derivation (Lecture 12) applied to this normal-Wishart pdf, the update formula for the means of the model  becomes: ot = observations for new speaker  = probabilities for new speaker  = from SI model is the mean of the S.I. model for state i, component k ik, the weight contribution of prior knowledge (the S.I. model) and new observed data (the speaker-dependent data), is determined empirically. This controls the rate of change of t(i,k) is the probability of being in state i and component k at time t given the speaker-dependent data and model (Lecture 11). This is updating of the means is iterated, just like EM. Each iteration changes the t(i,k) values, and therefore the

“When ik is large, the prior density is sharply peaked around the values of the seed (S.I.) HMM parameters which will be only slightly modified by the adaptation process. Conversely, if ik is small, the adaptation will be very fast” (Lee and Gauvain (1993), p. 560). When this weight is small, the effect of the S.I. model is smaller, and the speaker-specific observations dominate the computation. As the number of observations of the new speaker increases for state j and component k (or, as T approaches infinity), the MAP estimate approaches the ML estimate of the new data, as the new data dominate over the old mean The same approach can be used to adjust the covariance matrix. ik can be constrained to be the same for all components in all GMMS and states; a typical value is between 2 and 20.

“MAP HMM can be regarded as an interpolated model between the speaker-independent and speaker-dependent HMM. Both are derived from the standard ML forward-backward algorithm.” (Huang, p. 447) How much data is needed? Of course, more is better. Results have been reported for only several utterances per new speaker up to 600 utterances per new speaker. Problem 1: need (relatively) lots of training data for the speaker to be adapted to. Problem 2: each state and component is updated independently. If a speaker doesn’t say data associated with a particular state and component, then that state still uses the S.I. model. It would be nice to update all the parameters of the model from a small amount of data.

Maximum Likelihood Linear Regression (MLLR)
The idea behind MLLR is to use a set of linear regression transformation functions to map means (and maybe also covariances) in order to maximize the likelihood on the adaptation data. In other words, we want to find some linear transform (of the form ax+b) that warps the mean vector in such a way that the likelihood of the model given the new data, , is maximized. (In the following, ot is one frame of Onew.) Updating only the means is effective; updating the covariance matrix gives less than an additional 2% error reduction (Huang, p. 450) and so is less commonly done. The same transformation can be used for similar GMMs; this sharing allows updating of the entire model faster and uniformly.

The mean vector for state i, component k can be transformed using the following equation: where Ac is a regression matrix and bc is an additive bias vector; Ac and bc are associated with a broad class of phonemes or set of tied states (not just an individual state), called c, to better share model parameters. We want to find Ac and bc such that the mismatch with new (speaker-specific) data is smallest. We can re-write this as where ik is rewritten as and we need to solve for Wc, which contains both Ac and bc, e.g. Wc =[bc, Ac]

Maximizing a Q function by setting the derivative to zero, in the same way that was done in Lecture 12, maximizes the likelihood of the adaptation data (Huang p ); this yields the function which can be re-written as where

If the covariance matrix ik is diagonal, there is a closed-form solution for Wc: where subscript q denotes the qth row of matrix Wc and Z; where vqq denotes the qth diagonal element of Vik We need to make sure that Gq is invertible, by having enough training data. If there’s not enough data, we can tie more classes together. This process can be iterated with new values for t(i,k) and ik in each iteration, but usually one iteration gives the most gain in performance.

Unsupervised adaptation can be done by (a) recognizing with a speaker-independent (S.I.) model, and then (b) assuming that these recognized results are correct, using these results as training data for adaptation. (In this case, the use of confidence scores (indicating which regions of speech are better recognized) may be helpful to constrain the training to only adapt to correctly-recognized speech samples.) MLLR and MAP can be combined for (slightly) better performance over either technique alone. Also, MLLR and VTLN performance improvement is often approximately additive. For example, a 10% relative WER reduction from VTLN and 15% relative WER reduction from MLLR in isolation yields a 25% relative WER from using both VTLN and MLLR.) (Pye and Woodland, ICASSP 97)

One example of combining MAP and MLLR is from the Whisper system: or 15% WER reduction using MLLR and a total 22% relative error reduction on 1000 utterances from combined MAP+MLLR. (The speaker-dependent system was trained on the 1000 utterances from that speaker.)

Speech Recognition with Hidden Markov Models Winter 2011

Similar presentations

Presentation on theme: "Speech Recognition with Hidden Markov Models Winter 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speech Recognition with Hidden Markov Models Winter 2011

Similar presentations

Presentation on theme: "Speech Recognition with Hidden Markov Models Winter 2011"— Presentation transcript:

Similar presentations

About project

Feedback