Download presentation

Presentation is loading. Please wait.

Published byVictoria Decourcy Modified about 1 year ago

1
CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 17½ Speaker Adaptation Notes based on: Huang, Acero, and Hon (2001), “Spoken Language Processing” section 9.6 Lee and Gauvain (1993), “Speaker Adaptation based on MAP Estimation of HMM paramters”, ICASSP 93 Woodland (2001), “Speaker Adaptation for Continuous Density HMMs: A Review” 2001. Gauvain and Lee (1994), “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains” Renals (2008) speaker adaptation lecture notes. Lee and Rose (1996), “Speaker Normalization Using Efficient Frequency Warping Procedures” Panchapagesan and Alwan (2008), “Frequency Warping for VTLN and Speaker Adaptation by Linear Transformation of Standard MFCC”

2
Speaker Adaptation Given an HMM that has been trained on a large number of people (a speaker-independent HMM), we can try to improve performance by adapting to the speaker currently being recognized in testing. Two basic types of speaker adaptation: 1. Adaptation of the feature space (speaker normalization) Vocal-Tract Length Normalization (VTLN) = warp the feature space to better fit the model parameters 2. Adaptation of the model parameters Maximum A Posteriori (MAP) adaptation = retrain individual state parameters Maximum Likelihood Linear Regression (MLLR) = “warp” model parameters to better fit adaptation data

3
Common technique is Vocal Tract Length Normalization (VTLN) Assumption: The majority of speaker differences in the acoustic space are caused by different vocal tract lengths. Different lengths of the vocal tract can be normalized using a non-linear frequency warping (like Mel scale, but on speaker-by-speaker basis). Performance using VTLN typically improves by a relative reduction in error of 10% (e.g. from 22% WER to 20% WER, or 10% to 9%, or 5% to 4.5%). Two questions need to be answered to implement VTLN: 1.What type of non-linear warping 2.How to determine optimal parameter value for the non-linear warping during both training and recognition? Speaker Normalization

4
With different lengths of vocal tract, the resonant frequencies (formants) shift. A shorter vocal tract yields higher formants; a longer vocal tract yields lower formants. But the shift is not a linear function of frequency. So, need to choose a non-linear warping function. 1. what type of non-linear warping? piecewise linear adjustment of Mel scale power function Also, what range for parameters? If we consider vocal tract lengths to be correlated with a person’s height, then we can look at variation in height to determine range of vocal tract lengths. In U.S., average male is 5’10” (1.776m) and has VTL of 17 cm. A tall man might be 6’6”, or 11% taller than average. An average woman is 5’ 4”, or 90% of the average male height. A short woman might be 85% of the average male height.

5
Speaker Normalization warping of Mel scale: (equation from Huang, Acero, Hon “Spoken Language Processing” 2001 p. 427) frequencies for no warping ( =1) frequencies for warping with from 0.85 to 1.10 by 0.05 =0.85 =1.0 =1.10

6
Speaker Normalization piecewise linear warping: (figure from Renals’ ASR lecture, 2008)

7
Speaker Normalization warping by power function: (figure from Renals’ ASR lecture, 2008)

8
Speaker Normalization Actual estimated warping for different vocal tract lengths, based on two-tube model of four vowels (/ax/, /iy/, /ae/, /aa/; tube parameter values taken from CS551 Lecture 9): formant frequencies for 17-cm vocal tract formant frequencies for vocal tract lengths: 85%, 90%, 95%, 100%, 105%, and 110% of 17 cm So, complexity of non-linear warping actually isn’t warranted; a linear model fits theoretical data well, or -warping of Mel scale 85%=14.4cm 100%=17cm 110%=18.7cm

9
2. how to determine optimal parameter value during both training and recognition? Speaker Normalization “Grid Search”: try 13 regularly-spaced values from 0.88 to 1.12, and find the value that maximizes the likelihood of the model. (Linear increase in processing time) (Lee and Rose, 1996). Use gradient search instead of grid search. Estimate and align (along frequency scale) formant peaks in speaker data. For example, ratio of median position of 3 rd formant for current speaker divided by median F3 averaged over all speakers (Eide and Gish, 1996):

10
If we have some (labeled) data for a speaker, we can adapt our model parameters to better fit that speaker using MAP adaptation. Sometimes just the means are updated; covariance matrix is assumed to be the same, as are transition probabilities and mixture weights. We also assume that each aspect (means, covariance matrix, etc.) can be treated independently. Maximum Likelihood estimation: MAP estimation: where g( ) is the prior probability distribution of the model over the space of model parameter values. (If we know nothing about g( ), the prior probability of the model, then MAP reduces to ML estimation.) parameter space g( ) Maximum a Posteriori (MAP) Adaptation of Model Parameters original paper on MAP: Lee and Gauvain, ICASSP 1993

11
What do we know about g( ), the prior probability density function of the new model? Usually, we don’t know g( ), so we use maximum-likelihood (EM) training. However, in this case, we have an existing, speaker-independent (S.I.) model (know, prior information) and we want to learn the model for a specific speaker. If we assume that each of the parts of the GMM model ( , , weights) are independent, we can optimize each of these sub- problems independently. For the D-dimensional Gaussian distributions characterized by and , the prior density g( ) can be represented with a normal- Wishart density, with the following parameters: >D-1, >0. The normal Wishart pdf also has a vector nw being the mean of the Gaussian of the speaker-independent model, and a matrix S being the covariance matrix from the speaker-independent model. Maximum a Posteriori (MAP) Adaptation of Model Parameters

12
Using the Lagrange multiplier similar to the EM derivation (Lecture 12) applied to this normal-Wishart pdf, the update formula for the means of the model becomes: o t = observations for new speaker = probabilities for new speaker = from SI model Maximum a Posteriori (MAP) Adaptation of Model Parameters is the mean of the S.I. model for state i, component k ik, the weight contribution of prior knowledge (the S.I. model) and new observed data (the speaker-dependent data), is determined empirically. This controls the rate of change of t (i,k) is the probability of being in state i and component k at time t given the speaker-dependent data and model (Lecture 11). This is updating of the means is iterated, just like EM. Each iteration changes the t (i,k) values, and therefore the

13
“When ik is large, the prior density is sharply peaked around the values of the seed (S.I.) HMM parameters which will be only slightly modified by the adaptation process. Conversely, if ik is small, the adaptation will be very fast” ( Lee and Gauvain (1993), p. 560). When this weight is small, the effect of the S.I. model is smaller, and the speaker-specific observations dominate the computation. As the number of observations of the new speaker increases for state j and component k (or, as T approaches infinity), the MAP estimate approaches the ML estimate of the new data, as the new data dominate over the old mean. The same approach can be used to adjust the covariance matrix. ik can be constrained to be the same for all components in all GMMS and states; a typical value is between 2 and 20. Maximum a Posteriori (MAP) Adaptation of Model Parameters

14
“MAP HMM can be regarded as an interpolated model between the speaker-independent and speaker-dependent HMM. Both are derived from the standard ML forward-backward algorithm.” (Huang, p. 447) How much data is needed? Of course, more is better. Results have been reported for only several utterances per new speaker up to 600 utterances per new speaker. Problem 1: need (relatively) lots of training data for the speaker to be adapted to. Problem 2: each state and component is updated independently. If a speaker doesn’t say data associated with a particular state and component, then that state still uses the S.I. model. It would be nice to update all the parameters of the model from a small amount of data. Maximum a Posteriori (MAP) Adaptation of Model Parameters

15
The idea behind MLLR is to use a set of linear regression transformation functions to map means (and maybe also covariances) in order to maximize the likelihood on the adaptation data. In other words, we want to find some linear transform (of the form ax+b) that warps the mean vector in such a way that the likelihood of the model given the new data,, is maximized. (In the following, o t is one frame of O new.) Updating only the means is effective; updating the covariance matrix gives less than an additional 2% error reduction (Huang, p. 450) and so is less commonly done. The same transformation can be used for similar GMMs; this sharing allows updating of the entire model faster and uniformly. Maximum Likelihood Linear Regression (MLLR)

16
The mean vector for state i, component k can be transformed using the following equation: where A c is a regression matrix and b c is an additive bias vector; A c and b c are associated with a broad class of phonemes or set of tied states (not just an individual state), called c, to better share model parameters. We want to find A c and b c such that the mismatch with new (speaker-specific) data is smallest. We can re-write this as where ik is rewritten as and we need to solve for W c, which contains both A c and b c, e.g. W c =[b c, A c ] Maximum Likelihood Linear Regression (MLLR)

17
Maximizing a Q function by setting the derivative to zero, in the same way that was done in Lecture 12, maximizes the likelihood of the adaptation data (Huang p. 448-449) ; this yields the function which can be re-written as where Maximum Likelihood Linear Regression (MLLR)

18
If the covariance matrix ik is diagonal, there is a closed-form solution for W c : where subscript q denotes the q th row of matrix W c and Z; where v qq denotes the q th diagonal element of V ik We need to make sure that G q is invertible, by having enough training data. If there’s not enough data, we can tie more classes together. This process can be iterated with new values for t (i,k) and ik in each iteration, but usually one iteration gives the most gain in performance. Maximum Likelihood Linear Regression (MLLR)

19
Unsupervised adaptation can be done by (a) recognizing with a speaker-independent (S.I.) model, and then (b) assuming that these recognized results are correct, using these results as training data for adaptation. (In this case, the use of confidence scores (indicating which regions of speech are better recognized) may be helpful to constrain the training to only adapt to correctly- recognized speech samples.) MLLR and MAP can be combined for (slightly) better performance over either technique alone. Also, MLLR and VTLN performance improvement is often approximately additive. For example, a 10% relative WER reduction from VTLN and 15% relative WER reduction from MLLR in isolation yields a 25% relative WER from using both VTLN and MLLR.) (Pye and Woodland, ICASSP 97) Maximum Likelihood Linear Regression (MLLR)

20
One example of combining MAP and MLLR is from the Whisper system: Maximum Likelihood Linear Regression (MLLR) or 15% WER reduction using MLLR and a total 22% relative error reduction on 1000 utterances from combined MAP+MLLR. (The speaker-dependent system was trained on the 1000 utterances from that speaker.)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google