Presentation on theme: "Speech Recognition with Hidden Markov Models Winter 2011"— Presentation transcript:
1Speech Recognition with Hidden Markov Models Winter 2011 CS 552/652Speech Recognition with Hidden Markov ModelsWinter 2011Oregon Health & Science UniversityCenter for Spoken Language UnderstandingJohn-Paul HosomLecture 17½Speaker AdaptationNotes based on:Huang, Acero, and Hon (2001), “Spoken Language Processing” section 9.6Lee and Gauvain (1993), “Speaker Adaptation based on MAP Estimation of HMM paramters”, ICASSP 93Woodland (2001), “Speaker Adaptation for Continuous Density HMMs: A Review” 2001.Gauvain and Lee (1994), “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains”Renals (2008) speaker adaptation lecture notes.Lee and Rose (1996), “Speaker Normalization Using Efficient Frequency Warping Procedures”Panchapagesan and Alwan (2008), “Frequency Warping for VTLN and Speaker Adaptation by Linear Transformation of Standard MFCC”
2Speaker AdaptationGiven an HMM that has been trained on a large number of people (a speaker-independent HMM), we can try to improve performance by adapting to the speaker currently being recognized in testing.Two basic types of speaker adaptation:1. Adaptation of the feature space (speaker normalization)Vocal-Tract Length Normalization (VTLN)= warp the feature space to better fit the model parameters2. Adaptation of the model parametersMaximum A Posteriori (MAP) adaptation= retrain individual state parametersMaximum Likelihood Linear Regression (MLLR)= “warp” model parameters to better fit adaptation data
3Speaker Normalization Common technique is Vocal Tract Length Normalization (VTLN)Assumption: The majority of speaker differences in the acoustic space are caused by different vocal tract lengths. Different lengths of the vocal tract can be normalized using a non-linear frequency warping (like Mel scale, but on speaker-by-speaker basis).Performance using VTLN typically improves by a relative reduction in error of 10% (e.g. from 22% WER to 20% WER, or 10% to 9%, or 5% to 4.5%).Two questions need to be answered to implement VTLN:1. What type of non-linear warping2. How to determine optimal parameter value for the non-linear warping during both training and recognition?
4Speaker Normalization With different lengths of vocal tract, the resonant frequencies (formants) shift. A shorter vocal tract yields higher formants; a longer vocal tract yields lower formants. But the shift is not a linear function of frequency. So, need to choose a non-linear warping function.1. what type of non-linear warping?piecewise linearadjustment of Mel scalepower functionAlso, what range for parameters? If we consider vocal tract lengths to be correlated with a person’s height, then we can look at variation in height to determine range of vocal tract lengths. In U.S., average male is 5’10” (1.776m) and has VTL of 17 cm. A tall man might be 6’6”, or 11% taller than average. An average woman is 5’ 4”, or 90% of the average male height. A short woman might be 85% of the average male height.
5frequencies for warping with Speaker Normalizationwarping of Mel scale:=0.85=1.0=1.10frequencies for warping with from 0.85 to 1.10 by 0.05frequencies for no warping (=1)(equation from Huang, Acero, Hon “Spoken Language Processing” 2001 p. 427)
6Speaker Normalization piecewise linear warping:(figure from Renals’ ASR lecture, 2008)
7Speaker Normalization warping by power function:(figure from Renals’ ASR lecture, 2008)
8Speaker Normalization Actual estimated warping for different vocal tract lengths, basedon two-tube model of four vowels (/ax/, /iy/, /ae/, /aa/; tube parameter values taken from CS551 Lecture 9):85%=14.4cm100%=17cm110%=18.7cm85%, 90%, 95%, 100%, 105%,formant frequencies forvocal tract lengths:and 110% of 17 cmformant frequencies for 17-cm vocal tractSo, complexity of non-linear warping actually isn’t warranted; a linear model fits theoretical data well, or -warping of Mel scale
9Speaker Normalization 2. how to determine optimal parameter value during both training and recognition?“Grid Search”: try 13 regularly-spaced values from to 1.12, and find the value that maximizes the likelihood of the model. (Linear increase in processing time) (Lee and Rose, 1996).Use gradient search instead of grid search.Estimate and align (along frequency scale) formant peaks in speaker data. For example, ratio of median position of 3rd formant for current speaker divided by median F3 averaged over all speakers (Eide and Gish, 1996):
10Maximum a Posteriori (MAP) Adaptation of Model Parameters If we have some (labeled) data for a speaker, we can adapt our model parameters to better fit that speaker using MAP adaptation.Sometimes just the means are updated; covariance matrix is assumed to be the same, as are transition probabilities and mixture weights. We also assume that each aspect (means, covariance matrix, etc.) can be treated independently.Maximum Likelihood estimation:MAP estimation:where g() is the prior probability distribution of the model over the space of model parameter values. (If we know nothing about g(), the prior probability of the model, then MAP reduces to ML estimation.)g()original paper on MAP:Lee and Gauvain, ICASSP 1993parameter space
11Maximum a Posteriori (MAP) Adaptation of Model Parameters What do we know about g(), the prior probability density function of the new model? Usually, we don’t know g(), so we use maximum-likelihood (EM) training. However, in this case, we have an existing, speaker-independent (S.I.) model (know, prior information) and we want to learn the model for a specific speaker.If we assume that each of the parts of the GMM model (, , weights) are independent, we can optimize each of these sub-problems independently.For the D-dimensional Gaussian distributions characterized by and , the prior density g() can be represented with a normal-Wishart density, with the following parameters: >D-1, >0. The normal Wishart pdf also has a vector nw being the mean of the Gaussian of the speaker-independent model, and a matrix S being the covariance matrix from the speaker-independent model.
12Maximum a Posteriori (MAP) Adaptation of Model Parameters Using the Lagrange multiplier similar to the EM derivation (Lecture 12) applied to this normal-Wishart pdf, the update formula for the means of the model becomes:ot = observations for new speaker = probabilities for new speaker = from SI modelis the mean of the S.I. model for state i, component kik, the weight contribution of prior knowledge (the S.I. model) and new observed data (the speaker-dependent data), is determined empirically. This controls the rate of change oft(i,k) is the probability of being in state i and component k at time t given the speaker-dependent data and model (Lecture 11).This is updating of the means is iterated, just like EM. Each iteration changes the t(i,k) values, and therefore the
13Maximum a Posteriori (MAP) Adaptation of Model Parameters “When ik is large, the prior density is sharply peaked around the values of the seed (S.I.) HMM parameters which will be only slightly modified by the adaptation process. Conversely, if ik is small, the adaptation will be very fast” (Lee and Gauvain (1993), p. 560). When this weight is small, the effect of the S.I. model is smaller, and the speaker-specific observations dominate the computation.As the number of observations of the new speaker increases for state j and component k (or, as T approaches infinity), the MAP estimate approaches the ML estimate of the new data, as the new data dominate over the old meanThe same approach can be used to adjust the covariance matrix.ik can be constrained to be the same for all components in all GMMS and states; a typical value is between 2 and 20.
14Maximum a Posteriori (MAP) Adaptation of Model Parameters “MAP HMM can be regarded as an interpolated model between the speaker-independent and speaker-dependent HMM. Both are derived from the standard ML forward-backward algorithm.” (Huang, p. 447)How much data is needed? Of course, more is better. Results have been reported for only several utterances per new speaker up to 600 utterances per new speaker.Problem 1: need (relatively) lots of training data for the speaker to be adapted to.Problem 2: each state and component is updated independently. If a speaker doesn’t say data associated with a particular state and component, then that state still uses the S.I. model. It would be nice to update all the parameters of the model from a small amount of data.
15Maximum Likelihood Linear Regression (MLLR) The idea behind MLLR is to use a set of linear regression transformation functions to map means (and maybe also covariances) in order to maximize the likelihood on the adaptation data.In other words, we want to find some linear transform (of the form ax+b) that warps the mean vector in such a way that the likelihood of the model given the new data, , is maximized. (In the following, ot is one frame of Onew.)Updating only the means is effective; updating the covariance matrix gives less than an additional 2% error reduction (Huang, p. 450) and so is less commonly done.The same transformation can be used for similar GMMs; this sharing allows updating of the entire model faster and uniformly.
16Maximum Likelihood Linear Regression (MLLR) The mean vector for state i, component k can be transformed using the following equation:where Ac is a regression matrix and bc is an additive bias vector; Ac and bc are associated with a broad class of phonemes or set of tied states (not just an individual state), called c, to better share model parameters.We want to find Ac and bc such that the mismatch with new (speaker-specific) data is smallest. We can re-write this aswhere ik is rewritten as and we need to solve for Wc, which contains both Ac and bc, e.g. Wc =[bc, Ac]
17Maximum Likelihood Linear Regression (MLLR) Maximizing a Q function by setting the derivative to zero, in the same way that was done in Lecture 12, maximizes the likelihood of the adaptation data (Huang p ); this yields the functionwhich can be re-written aswhere
18Maximum Likelihood Linear Regression (MLLR) If the covariance matrix ik is diagonal, there is a closed-form solution for Wc:where subscript q denotes the qth row of matrix Wc and Z;where vqq denotes the qth diagonal element of VikWe need to make sure that Gq is invertible, by having enough training data. If there’s not enough data, we can tie more classes together.This process can be iterated with new values for t(i,k) and ik in each iteration, but usually one iteration gives the most gain in performance.
19Maximum Likelihood Linear Regression (MLLR) Unsupervised adaptation can be done by (a) recognizing with a speaker-independent (S.I.) model, and then (b) assuming that these recognized results are correct, using these results as training data for adaptation. (In this case, the use of confidence scores (indicating which regions of speech are better recognized) may be helpful to constrain the training to only adapt to correctly-recognized speech samples.)MLLR and MAP can be combined for (slightly) better performance over either technique alone. Also, MLLR and VTLN performance improvement is often approximately additive. For example, a 10% relative WER reduction from VTLN and 15% relative WER reduction from MLLR in isolation yields a 25% relative WER from using both VTLN and MLLR.) (Pye and Woodland, ICASSP 97)
20Maximum Likelihood Linear Regression (MLLR) One example of combining MAP and MLLR is from the Whisper system:or 15% WER reduction using MLLR and a total 22% relative error reduction on 1000 utterances from combined MAP+MLLR.(The speaker-dependent system was trained on the 1000 utterances from that speaker.)