Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Slides:



Advertisements
Similar presentations
Change-Point Detection Techniques for Piecewise Locally Stationary Time Series Michael Last National Institute of Statistical Sciences Talk for Midyear.
Advertisements

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Chapter 3 Image Enhancement in the Spatial Domain.
Image Enhancement in the Spatial Domain
Histogram-based Quantization for Distributed / Robust Speech Recognition Chia-yu Wan, Lin-shan Lee College of EECS, National Taiwan University, R. O. C.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
2-5 : Normal Distribution
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
Image processing. Image operations Operations on an image –Linear filtering –Non-linear filtering –Transformations –Noise removal –Segmentation.
Communications & Multimedia Signal Processing Analysis of the Effects of Train noise on Recognition Rate using Formants and MFCC Esfandiar Zavarehei Department.
Probabilistic video stabilization using Kalman filtering and mosaicking.
SUMS OF RANDOM VARIABLES Changfei Chen. Sums of Random Variables Let be a sequence of random variables, and let be their sum:
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Speaker Adaptation for Vowel Classification
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Lecture II-2: Probability Review
Review of Probability.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Dr. Richard Young Optronic Laboratories, Inc..  Uncertainty budgets are a growing requirement of measurements.  Multiple measurements are generally.
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
Speech Enhancement Using Spectral Subtraction
1 Statistical Distribution Fitting Dr. Jason Merrick.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Chapter 3: Image Restoration Introduction. Image restoration methods are used to improve the appearance of an image by applying a restoration process.
Basics of Neural Networks Neural Network Topologies.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
3.7 Adaptive filtering Joonas Vanninen Antonio Palomino Alarcos.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
S.Frasca on behalf of LSC-Virgo collaboration New York, June 23 rd, 2009.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
: Chapter 5: Image Filtering 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Image Contrast Enhancement Based on a Histogram Transformation of Local Standard Deviation Dah-Chung Chang* and Wen-Rong Wu, Member, IEEE IEEE TRANSACTIONS.
1Ben ConstanceCTF3 working meeting – 09/01/2012 Known issues Inconsistency between BPMs and BPIs Response of BPIs is non-linear along the pulse Note –
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Speech Enhancement Summer 2009
Linear Algebra Review.
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
AN ANALYSIS OF TWO COMMON REFERENCE POINTS FOR EEGS
A Tutorial on Bayesian Speech Feature Enhancement
EE513 Audio Signals and Systems
Missing feature theory
A maximum likelihood estimation and training on the fly approach
Speech / Non-speech Detection
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Probabilistic Surrogate Models
Continuous Random Variables: Basics
Presentation transcript:

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A. Acero)

2 Feature Moment Normalization The goal of feature normalization is to apply a transformation to the incoming observation features. –This transformation should eliminate variabilities unrelated to the transcription. Even if you do not know how the ASR features have been corrupted, it is possible to normalize them to reduce the effects of the corruption. Techniques using this approach include cepstral mean normalization, cepstral mean and variance normalization, and cepstral histogram normalization.

3 Automatic Gain Normalization Another type of normalization affects only the energy-like features of each frame. Automatic gain normalization (AGN) is used to ensure that the speech occurs at the same absolute signal level, regardless of the incoming level of background noise or SNR. it is sometimes beneficial to use AGN on the energy-like features, and the more-general moment normalization on the rest.

4 Cepstral Mean Normalization Cepstralmean normalization consists of subtracting the mean feature vector μ from each vector x to obtain the normalized vector. As a result, the long-term average of any observation sequence (the first moment) is zero.

5 Cepstral Mean Normalization As long as these convolutional distortions have a time constant that is short with respect to the front end’s analysis window length, and does not suppress large regions of the spectrum below the noise floor (e.g., a severe low-pass filter), CMN can virtually eliminate their effects. As the filter length h[m] grows, becomes less accurate and CMN is less effective in removing the convolutional distortion.

6 CMN VS. AGN Inmost cases, using AGNis better than applying CMN on the energy term. The failure of CMN on the energy feature is most likely due to the randomness it induces on the energy of noisy speech frames. AGN tends to put noisy speech at the same level regardless of SNR, which helps the recognizer make sharp models. On the other hand, CMN will make the energy term smaller in low-SNR utterances and larger in high-SNR utterances, leading to less-effective speech models.

7 CMN VS. AGN in different stages One option is to use CMN on the static cepstra, before computing the dynamic cepstra. Because of the nature of CMN, this is equivalent to leaving the dynamic cepstra untouched. The other option is to use CMN on the full feature vector, after dynamic cepstra have been computed from the unnormalized static cepstra. The following table shows that it is slightly better to apply the normalization to the full feature vectors.

8 Cepstral Variance Normalization Cepstral variance normalization (CVN) is similar to CMN, and the two are often paired as cepstral mean and variance normalization (CMVN). CMVN uses both the sample mean and standard deviation to normalize the cepstral sequence: After normalization, the mean of the cepstral sequence is zero, and it has a variance of one.

9 Cepstral Variance Normalization UnlikeCMN, CVNis not associated with addressing a particular type of distortion. It can, however, be shown empirically that it provides robustness against acoustic channels, speaker variability, and additive noise. As with CMN, CMVN is best applied to the full feature vector, after the dynamic cepstra have been computed. Unlike CMN, the tables show that applying CMVN to the energy term is often better than using whole-utterance AGN.

10 Cepstral Variance Normalization Unlike CMN, the tables show that applying CMVN to the energy term is often better than using whole-utterance AGN. Because CMVN is both shifting and scaling the energy term, both the noisy speech and the noise are placed at a consistent absolute levels.

11 Cepstral Histogram Normalization Cepstral histogram normalization (CHN) takes the core ideas behind CMN and CVN, and extends them to their logical conclusion. Instead of only normalizing the first or second central moments, CHN modifies the signal such that all of its moments are normalized. As with CMN and CHN, a one-to-one transformation is independently applied to each dimension of the feature vector.

12 Cepstral Histogram Normalization The first step in CHN is choosing a desired distribution for the data,. It is common to choose a Gaussian distribution with zero mean and unit covariance. Let represent the actual distribution of the data to be transformed. It can be shown that the following function f (·) applied to y produces features with the probability distribution function (PDF) px (x): Here, Fy(y) is the cumulative distribution function (CDF) of the test data.

13 Cepstral Histogram Normalization Applying Fy(·) to y transforms the data distribution from py(y) to a uniform distribution. Subsequent application of (·) imposes a final distribution of px (x). When the target distribution is chosen to be Gaussian as described above, the final sequence has zero mean and unit covariance, just as if CMVN were used. First, the data is transformed so it has a uniform distribution.

14 Cepstral Histogram Normalization The second and final step consists of transforming so that it has a Gaussian distribution. This can be accomplished, as in (33.11), using an inverse Gaussian CDF :

15 Analysis of Feature Normalization When implementing feature normalization, it is very important to use enough data to support the chosen technique. If test utterances are too short to support the chosen normalization technique, degradation will be most apparent in the clean-speech recognition results. In cases where there is not enough data to support CMN, Rahim has shown that using the recognizer’s acoustic model to estimate a maximum-likelihood mean normalization is superior to conventional CMN.

16 Analysis of Feature Normalization It has been found that CMN does not degrade the recognition rate on utterances from the same acoustical environment, as long as there are at least four seconds of speech frames available. CMVN and CHN require even longer segments of speech. When a system is trained on one microphone and tested on another, CMN can provide significant robustness. Interestingly, it has been found in practice that the error rate for utterances within the same environment can actually be somewhat lower. This is surprising, given that there is no mismatch in channel conditions.

17 Analysis of Feature Normalization One explanation is that, even for the same microphone and room acoustics, the distance between the mouth and the microphone varies for different speakers, which causes slightly different transfer functions. The cepstral mean characterizes not only the channel transfer function, but also the average frequency response of different speakers. By removing the long-term speaker average, CMN can act as sort of speaker normalization. One drawback of CMN, CMVN, and CHN is that they do not discriminate between nonspeech and speech frames in computing the utterance mean.

18 Analysis of Feature Normalization For instance, the mean cepstrum of an utterance that has 90% nonspeech frames will be significantly different from one that contains only 10% nonspeech frames. An extension to CMN that addresses this problem consists in computing different means for noise and speech. Speech/noise discrimination could be done by classifying frames into speech frames and noise frames, computing the average cepstra for each, and subtracting them from the average in the training data.

19 My Experiment and observation They are both mean normalization wethods, why is AGN better than CMN ?  Because the maximum c0 must contain noise? It not only remove convolution but also the most noise, and that’s why is can just used on the log energy term. Why CMVN is better than both of CMN and AGN, even if we just use CMVN on energy term while use AGN and CMN to full MFCC ?  Because variance normalization on energy term has the most contribution. The energy term reacts the whole energy and contains the maximum vanriance.

20 My Experiment and observation Both of CMVN and CHN have assumption of following Gaussian distribution with. They are the same in term of distribution.  What’s different ? CMVN uses linear transformation to complete Gaussian distribution, but CHN gets it through nonlinear transformation of Gaussian distribution.  Is there no miss information in CMVN? The data sparseness is more sever in CMVN.

21 My Experiment and observation CMVN  Std dev >1 The more near to mean, the more is left. The more far from mean, the more is subtracted. The distribution changes form fat and short to tall and thin.  Std dev <1 The more near to mean, the less is enlarged. The more far from mean, the more is enlarged. The distribution changes form tall and thin to short and fat.

22 Question Is it good for contain smaller variance? The range of value to PCA should be smaller? The sharp acoustic model is good?

23 Idea Use multi data to train a good variance. Map multi cdf to clean MFCC Shift mean of test data to recognize.