Presentation is loading. Please wait.

Presentation is loading. Please wait.

Codebook-based Feature Compensation for Robust Speech Recognition 2007/02/08 Shih-Hsiang Lin ( 林士翔 ) Graduate Student National Taiwan Normal University,

Similar presentations


Presentation on theme: "Codebook-based Feature Compensation for Robust Speech Recognition 2007/02/08 Shih-Hsiang Lin ( 林士翔 ) Graduate Student National Taiwan Normal University,"— Presentation transcript:

1 Codebook-based Feature Compensation for Robust Speech Recognition 2007/02/08 Shih-Hsiang Lin ( 林士翔 ) Graduate Student National Taiwan Normal University, Taipei, Taiwan

2 2 Outline Introduction Codebook-based Cepstral Compensation –With Both Clean and Noisy (Stereo) Data Stereo-based Piecewise Linear Compensation (SPLICE) SNR Dependent Cepstral Normalization (SDCN) Codeword Dependent Cepstral Normalization (CDCN) Probabilistic Optimum Filtering (POF) –With Only Noisy Data Maximum Likelihood based Stochastic Vector Mapping (ML-SVM) Maximum Mutual Information-SPLICE (MMI-SPLICE) Minimum Classification Error based SVM (MCE-SVM) Stochastic Matching Conclusions

3 3 Introduction Speech recognition performance degrades seriously when –There is mismatch between training and test acoustic conditions –However, training systems for all noise conditions is impractical A Simplified Distortion Framework –Channel effects are usually assumed to be constant while uttering –Additive noises can be either stationary or non-stationary (Convolutional Noise) Channel Effect Background Noise (Additive Noise) Noisy Speech Clean Speech

4 4 Non-linear Environmental Distortions –Clean speech was corrupted by 10dB subway noise Not only linear but also non-linear distortions were involved Introduction (cont.) Mel filter bank output cepstral coefficient log energy

5 5 Introduction (cont.) Two main approaches to improving noise robustness –Model Compensation Adapt acoustic models to match corrupted speech features –Feature Compensation Restore corrupted speech features to corresponding clean ones Clean Speech Noisy Speech Clean Acoustic Models Noisy Acoustic Models Training Conditions Test Conditions Feature Compensation Model Compensation Feature Space Model Space

6 6 Introduction (cont.) Model Compensation –Adapt acoustic models to match corrupted speech features –Representative approaches: MAP, MLLR, MAPLR, etc. Feature Compensation –Restore corrupted speech features to corresponding clean ones –Representative approaches: SS, CMN, CMVN, etc. Model Compensation Clean ModelsCorrupted Models Corrupted Speech Corrupted Speech Features Feature Compensation Compensation Model Clean Speech Features

7 7 Theme of Presented Compensation Approaches Codebook-based Cepstral Compensation –All involve the utilization of vector quantization (VQ) Universal Codebook Feature Vector Correction Vector Finding of Correction Vector Compensation

8 8 Theme of Presented Compensation Approaches (cont.) SDCN (Data-Driven) CDCN (Model-based) (Acero and Stern,1990) FCDCN (Data-Driven) (Acero and Stern,1991) POF (Data-Driven) (Neumeyer and Weintraub, 1994) SPLICE (Data-Driven) (Deng and Acero et al. 2000) Stochastic Matching (Model-based) (Sankar and Lee 1996) MMI-SPLICE (Model-based) (Droppo and Acero, 2005) MCE-SVM (Model-based (Wu and Huo, 2006) ML-SVM (Model-based) (Wu, Huo and Zhu, 2005) With Only Noisy Data With Both Clean and Noisy (Stereo) Data Unsupervised ML-SVM (Model-based) (Zhu and Huo, 2007)

9 9 Stereo-based Piecewise Linear Compensation (SPLICE) Estimation of the Correction Vector (Bias) A Mixture of Gaussians are Trained For Each Gaussian Component (codeword) A Linear Relationship between the Clustered Noisy Data and Its Corresponding Clean Counterpart is Assumed Deng et al. 2000

10 10 SPLICE (cont.) The success of SPLICE has its roots on two assumptions 1. The noisy cepstral feature vectors follow a distribution of a mixture of Gaussians Can be thought of as a “codebook” with a total of codewords 2.The conditional probability density function for a clean vector given its corresponding noisy vector and the cluster index is Gaussian whose mean vector is a linear transformation of The correction vector (or bias) can be estimated by MMSE estimate given the clean training speech vectors and their noisy counterparts where Only mean shift (without rotation) is assumed here

11 11 SPLICE (cont.) Therefore, the clean feature vector can be restored by its corresponding noisy feature vector and a linear weighted sum of all codeword-dependent bias vectors Or, we can alternatively (simply) use the following equation to obtain the restored clean feature Since is an (observed) known constant vector here

12 12 SPLICE (cont.) The MMSE estimate of the correction vector (bias) can be expressed using the following formula –The use of the stereo (both clean and noisy) training data provides an accurate estimate of the correction vectors SPLICE has good potential to effectively handle a wide range of distortions –Nonstationary distortion, jointly additive or/and convolutional distortion, and even nonlinear distortion of the original speech signal where However, in many ASR applications, stereo data are too expensive to collect

13 13 SNR Dependent Cepstral Normalization (SDCN) Notice that in the simplified distortion framework, the noisy cepstral feature vector can be expressed by – is a non-linear function and it is very difficult to estimate Therefore, SDCN attempts to restore the clean cepstral vector using a compensation vector that approximates and for different SNR levels where Acero et al. 1990 is the discrete cosine transform matrix is the filter bank bin

14 14 SDCN (cont.) A schematic depiction of SDCN –The (instantaneous) SNR can be calculated in proportion to the difference between the C[0] of the input frame and the C[0] of the noise at a reference time Universal Codebook (Each codeword represents a specific SNR level) While for each SNR level (codeword), a compensation vector is estimated

15 15 SDCN (cont.) Therefore, a compensation vector that completely depends on the instantaneous SNR of the observed feature vector can be used to restore the clean one Two extreme cases - High SNR: - Low SNR: where Removal of the channel effect Removal of the additive noise effect

16 16 SDCN (cont.) The compensation vectors were estimated by MMSE using the following equation Disadvantage –For a new test environment, the compensation vectors have to be re-estimated using a new set of sufficient amounts of stereo data : cepstral vectors for test (noisy) condition : cepstral vectors for standard acoustical (clean) condition : Kronecker delta function : the instantaneous SNR level of

17 17 Codeword Dependent Cepstral Normalization (CDCN) Acero et al. 1990 Distribution of clean speech A schematic depiction of CDCN a noisy speech utterance Form a mixture of Gaussians Can be regarded as a kind of phone-dependent distributions For, we can estimate the corresponding noise and channel vectors of each Gaussian ||

18 18 CDCN (cont.) CDCN first models the distributions of cepstral feature vectors of clean speech by a mixture of Gaussian distributions Secondly, CDCN assumes the conditional probability density function for a clean vector given its corresponding noisy vector and the Gaussian index is a Gaussian distribution

19 19 CDCN (cont.) Then, the CDCN algorithm is conducted by two steps 1. The noise and channel vectors are estimated by MLE 2. The expected restored cepstral vector is calculated by a linear weighted sum of all Gaussian-dependent expected clean cepstral vectors is phone-dependent is phone-independent where we assume

20 20 CDCN (cont.) The ML estimate of the noise and channel vectors –Where To obtain the optimum values and, we can take derivatives w.r.t. and respectively and set them to zero and

21 21 POF is conducted based on a probabilistic piecewise-linear transform of the acoustic space –Using VQ algorithm to partition the clean feature space in clusters Each VQ region is assigned a multidimensional transversal filter Probabilistic Optimum Filtering (POF) Neumeyer et al. 1994 where The multidimensional transversal filter of a cluster

22 22 POF (cont.) The error between the clean vector and the estimated vectors produced by the -th filter is given by The conditional error in each region is defined as –where is the probability that the clean vector belongs to cluster given an arbitrary characteristic vector of The characteristic vector can be any acoustic information cues generated from each frame of the speech utterance –E.g. instantaneous SNR, energy, cepstral coefficients

23 23 POF (cont.) To compute the optimum filters in the MMSE sense, we have to minimize the error in each cluster – can be obtained by taking the gradient of with respect to it and then equating the gradient to zero The run-time estimate of the clean feature vector can be computed by integrating the outputs of all the filters as follows Accordingly, has the form where and

24 24 Maximum Likelihood based Stochastic Vector Mapping (ML-SVM) Distribution of noisy speech E environmental clusters Correction vectors A schematic depiction of ML-SVM For each environment cluster 0 1 23 Acoustic Model (Trained by Noisy Speech) GMM Noisy Speech + Wu et al. 2005

25 25 ML-SVM (cont.) Suppose the (noisy) training data can be partitioned into environmental clusters, while each cluster is modeled as a mixture of Gaussians Given a set of acoustic models, the aim of SVM is to estimate the restored clean feature vector from the noisy one by applying the environment-dependent transformation

26 26 The SVM function can be one of the following forms During recognition, given an unknown utterance –The most proximal environmental cluster is first identified –Then, the corresponding GMM and the mapping function are used to derive a compensated version of from noisy vector where ML-SVM (cont.)

27 27 ML-SVM (cont.) A Flowchart of The Joint Maximum Likelihood Training of SVM

28 28 The detailed procedures depicted in the above flowchart are explained as follows Step 1 : Initialization A set of HMM acoustic models with diagonal covariance matrices trained with multi-condition training data are used as the initial models, and the initial bias vectors are set to be zero vectors Step 2 : Estimating SVM Function Parameters Given the HMM acoustic model parameters, for each environmental class, N b times of EM training are performed to estimate the environment-dependent mapping function parameters to increase the likelihood function ML-SVM (cont.)

29 29 Let’s consider a particular environmental class. The auxiliary (Q) function for becomes The occupation probability of Gaussian component m of state s at time t By setting the derivates of w.r.t ‘s as zero ML-SVM (cont.)

30 30 Since above equation holds for all k, it equivalent to solve the root of vector in the following equation where is a K x K matrix with the (k,k’)-th element being and is a K –dimensional vector with each being The estimation need an inverse operation of the K x K matrix ML-SVM (cont.)

31 31 If the SVM function is used for feature compensation, the EM training formula for can be derived similarly with a much simpler updating form as follows Step 3 : Estimating HMM Acoustic Model Parameters We transform each training utterance using its respective mapping function with parameters. Having the environment- compensated utterances, N h EM iterations are then performed to re-estimate the HMM acoustic model parameters to increase the likelihood function Step 4 : Repeat Step 2 and Step 3 N e times ML-SVM (cont.)

32 32 Maximum Mutual Information-SPLICE (MMI-SPLICE) MMI-SPLICE is much like SPLICE, but without the need of target clean feature vectors (no need of stereo data) –MMI-SPLICE learns to increase recognition accuracy directly with a maximum mutual information (MMI) objective function –MMI objective function The global objective function is a linear sum of the objective function for each training utterance Droppo et al. 2005 where

33 33 MMI-SPLICE (cont.) The transformation parameters can be estimated by gradient-based linear method –Any gradient ascent method can be used (e.g. conjugate gradient or BFGS) Since every is a function of many conditional probabilities of HMM states

34 34 MMI-SPLICE (cont.)

35 35 Minimum Classification Error based SVM (MCE-SVM) Classification Error Function A continuous loss function is defined as follows Objective function Wu et al. 2006 anti-discriminant function log-likelihood of current enhanced feature vector sequence generated by the HMMs of the competitive word strings discriminant function for recognition decision making log-likelihood of current enhanced feature vector sequence generate by the current HMMs of the word string Z c

36 36 MCE-SVM (cont.) Let denote generically the parameters to be estimated and is updated as follows: In order to find the gradient, the following partial derivation is used

37 37 MCE-SVM (cont.) Therefore, the remaining partial derivate or is formulated differently depending on the parameters to be optimized Updating of the HMM acoustic model parameters is the same as that of the MCE training For each, it follows Therefore

38 38 Stochastic Matching The mismatch between the corrupted speech and the HMM acoustic models can be reduced in two ways –Feature-space transformation Find a inverse distortion function that maps into which matches better with the models –Model-space transformation Find a model transformation function that maps to the transformed model which matches better with The stochastic algorithm operates only on the given test utterance and the given set of HMM acoustic models –No additional training data is required for the estimation of the mismatch prior to actual testing Sankar et al. 1996

39 39 Stochastic Matching (cont.) In the feature-space, we need to find such that We are interested in the problem of finding the parameters,so Let be the set of all possible state sequences be the set of all possible mixture sequences Then the equation can be written as In general, it is not easy to estimate directly, but for some,we can use the EM algorithm to estimate them

40 40 Stochastic Matching (cont.) Expectation (E-Step) Maximization (M-Step) For simplification, we assume – –The covariance matrices are diagonal

41 41 Stochastic Matching (cont.) The auxiliary function can now be written as For the estimation of and, we can take the derivative w.r.t. and respectively and set to zero

42 42 Stochastic Matching (cont.) We now consider a special case of –when,,  We may model the bias as either fixed for an utterance or varying with time (depend on the state) –Fixed bias –State-dependent bias

43 43 Conclusions In this presentation, we have presented some codebook based cepstral normalization methods –Either using stereo data or noisy data only –Various optimization criteria Maximum Likelihood, Minimum Classification Error (MCE), Maximum Mutual Information (MMI) Further studies –Exploitation of other optimization criteria? Minimum Phone/Word Error (MPE/MWE) criteria –Having similar effectiveness in LVCSR applications? –Utilization of the distribution characteristics of speech feature vectors? Combination with Histogram EQualization (HEQ) approaches

44 44 Conclusions (cont.) Method Stereo Data Code book optimization Criterion Bias Estimation Training Complexity Test Complexity SPLICEYesNoisy DataMMSEData DrivenLow SDCNYesSNRMMSEData DrivenLow CDCNYesClean DataML & MMSEModel DrivenN/AHigh POFYes Characteristic Vector MMSEData DrivenHighLow ML-SVMNoNoisy DataMLModel DrivenHighLow MMI- SPLICE NoNoisy DataMMIModel DrivenHighLow MCE- SVM NoNoisy DataMCEModel DrivenHighLow Stochastic Matching NoN/AMLModel DrivenHighLow Comparison of various codebook-based feature compensation methods

45 45 References SDCN, CDCN, FCDCN –A. Acero, and R. M. Stern, "Environmental Robustness in Automatic Speech Recognition,“ in Proc. ICASSP 1990 –A. Acero, and R. M. Stern, "Robust Speech Recognition by Normalization of the Acoustic Space," in Proc. ICASSP 1991 POF –L. Neumeyer and M. Weintraub, "Probabilistic Optimum Filtering for Robust Speech Recognition," in Proc. ICASSP 1994 SPLICE –L. Deng, A. Acero, M. Plumpe and X. Huang, “Large-Vocabulary Speech Recognition under Adverse Acoustic Environments,” in Proc. ICSLP 2000 –L. Deng, A. Acero, L. Jiang, J. Droppo, and X.-D. Huang, “High- performance robust speech recognition using stereo training data,” in Proc. ICASSP 2002

46 46 References (cont.) –J. Droppo, A. Acero and L. Deng. Evaluation of the SPLICE Algorithm on the Aurora2 Database,” in Proc. EuroSpeech 2001 Stochastic Mapping –A. Sankar and C. H. Lee, “A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition,” IEEE Trans. Speech and Audio Processing, 1994 ML-SVM –J. Wu, Q. Huo, D. Zhu, “An Environment Compensated Maximum Likelihood Training Approach Based on Stochastic Vector Mapping,” in Proc. ICASSP 2005 –Q. Huo and D. Zhu, “A Maximum Likelihood Training Approach to Irrelevant Variability Compensation Based on Piecewise Linear Transformations” in Proc. ICSLP 2006 –D. Zhu and Q. Huo, “A Maximum Likelihood Approach to Unsupervised Online Adaptation of Stochastic Vector Mapping Function for Robust Speech Recognition,” in Proc. ICASSP 2007

47 47 References (cont.) MCE-SVM –J. Wu and Q. Huo, “An Environment-Compensated Minimum Classification Error Training Approach Based on Stochastic Vector Mapping,” IEEE Trans. Audio, Speech and Language Processing 14(6), 2006 MMI-SPLICE –J. Droppo and A. Acero., “Maximum Mutual Information SPLICE Transform for Seen and Unseen Conditions,” in Proc. EuroSpeech 2005. –J. Droppo, M. Mahajan, A. Gunawardana and A. Acero. "How to Train a Discriminative Front End with Stochastic Gradient Descent and Maximum Mutual Information," in Proc. ASRU 2005 Others –H. Liao and M.J.F. Gales. "Joint Uncertainty Decoding for Noise Robust Speech Recognition," in Proc. EuroSpeech 2005


Download ppt "Codebook-based Feature Compensation for Robust Speech Recognition 2007/02/08 Shih-Hsiang Lin ( 林士翔 ) Graduate Student National Taiwan Normal University,"

Similar presentations


Ads by Google