Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

Slides:

Advertisements

Similar presentations

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.

Advertisements

Multiple Analysis of Variance – MANOVA

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Speech Recognition with Hidden Markov Models Winter 2011

Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.

Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,

Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Visual Recognition Tutorial

Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.

HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.

CSci 6971: Image Registration Lecture 4: First Examples January 23, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI Dr.

Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,

HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

HIWIRE MEETING Athens, November 3-4, 2005 José C. Segura, Ángel de la Torre.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh

Adaptive Signal Processing

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Codebook-based Feature Compensation for Robust Speech Recognition 2007/02/08 Shih-Hsiang Lin ( 林士翔 ) Graduate Student National Taiwan Normal University,

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Independent Component Analysis Independent Component Analysis.

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition Shih-Hsiang Lin, Berlin Chen, Yao-Ming.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Environmental Data Analysis with MatLab 2 nd Edition Lecture 22: Linear Approximations and Non Linear Least Squares.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.

Speech Enhancement Summer 2009

LECTURE 10: DISCRIMINANT ANALYSIS

Speech Enhancement with Binaural Cues Derived from a Priori Codebook

Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

SMEM Algorithm for Mixture Models

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

A Tutorial on Bayesian Speech Feature Enhancement

EE513 Audio Signals and Systems

Missing feature theory

Generally Discriminant Analysis

LECTURE 09: DISCRIMINANT ANALYSIS

A maximum likelihood estimation and training on the fly approach

Multivariate Methods Berlin Chen

Presented by Chen-Wei Liu

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel, Alfonso Ortega, and Óscar Saz Communication Technologies Group (GTC) Aragon Institute of Engineering Research (I3A) University of Zaragoza, Spain IEEE Transactions on Audio, Speech and Language Processing, Feb.,2007

2 Reference “Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition,” IEEE Trans on Audio, Speech and Language Processing, 2007 “Multi-environment models based linear normalization for robust speech recognition in car conditions,” in Proc. ICASSP 2004 “Multi-environment models based linear normalization for robust speech recognition,” in Proc. SPECOM 2004 “Robust speech recognition in cars using phoneme dependent multi- environment linear normalization,” in Proc. EuroSpeech 2005 “Recent advances in PD-MEMLIN for speech recognition in car conditions,” in Proc. ASUR 2005

3 Outline Introduction Approaches –Multi-Environment Model-based Linear Normalization (MEMLIN) –Polynomial MEMLIN (P-MEMLIN) –Multi-Environment Model-based Histogram Normalization (MEMHIN) –Phoneme-Dependent MEMLIN (PD-MEMLIN) –Blind PD-MEMLIN Experimental Results Conclusions

4 Introduction Robustness techniques have been developed along the following two main lines of research –Acoustic model adaptation method Require more data and computing time –MAP, MLLR, PMC –Feature vector adaptation/normalization method Map recognition space feature vectors to the training space –High-pass filtering »The results produced by those methods are limited individually »CMN, RASTA processing –Model-based techniques »VTS, CDCN –Empirical compensation »Entirely data-driven »Need a training phase where some transformations are estimated by computing the frame-by-frame differences between the stereo data »SPLICE, POF

5 Introduction (cont.) This paper focuses on empirical feature vector normalization base on stereo data and the MMSE estimator –Based on the joint modeling of clean and noisy space which splits noisy space into several basic environments and models each basic noisy and clean feature spaces using GMMs –To learn a transformation between clean and noisy feature vectors associated with each pair of clean and noisy model Gaussians Noisy basic environment space Clean space

6 Noise Effects Convolutional Noise: Shifts the mean of the coefficients Additive Noise: Modifies the PDF, reducing the variances of the coefficients Real Car Environment: Modifies the mean and variance, jointly

7 MMSE-based Feature Vector Normalization Methods Given the noisy feature vector, the estimated clean feature vector is obtained by using the MMSE criterion as –Method 1: CMN No assumption are made in estimating The clean feature vector is approximated as To estimated the bias vector transformation,, the mean square error,, is defined and minimized w.r.t.

8 MMSE-based Feature Vector Normalization Methods (cont.) In some cases, the mean of the clean feature vectors is removed before training acoustic model, so the bias vector transformation is computed as –Method 2: (RATZ) Multivariate Gaussian-based cepstral Normalization Modeling the clean space using a GMM Approximate the clean feature vector as The estimation of can produce a mismatch

9 MMSE-based Feature Vector Normalization Methods (cont.) –Method 3: SPLICE Modeling the noisy space instead of the clean one using GMM Approximate the clean feature vector as –Furthermore, several acoustic conditions have been developed Interpolated RATZ (IRATZ) SPLICE with environmental model selection

10 Multi-Environment Model-based Linear Normalization (MEMLIN) MEMLIN Approximation –Noisy space is divided into a combination of several basic environment, and the noisy feature vectors are modeled as a GMM for each basic environment –Clean feature vectors are modeled using a GMM –Clean feature vectors can be approximated as a linear function of the noisy feature vector, which depends on the basic environment and the clean and noisy model Gaussians bias vector transformation

11 MEMLIN Enhancement –Given the noisy feature vector, the estimated clean feature vector is obtained by using the MMSE criterion as –calculate is considered to be uniformly distributed over all the environments has to be close to 1 MEMLIN (Cont.) estimated in a training phase using stereo data

12 MEMLIN (Cont.) MEMLIN Training –Given a stereo data corpus for each environment The bias vector transformation,, is estimated by minimizing the defined mean weighted squared square error

13 –Calculate The cross probability is simplified by avoiding the time dependence given by the noisy feature vector The term can be estimated by either using a hard solution or using a soft decision –Hard decision (using relatively frequency) –Soft decision MEMLIN (Cont.)

14 Experiments Results Using Basic MMSE-based methods A set of experiments were performed using the Spanish SpeechDat Car database –Seven basic environments were defined E1: car stopped, motor running E2: town traffic, closed windows, and climatizer off (silent conditions) E3: town traffic and noisy conditions (windows open, and/or climatizer on) E4: low speed, rough road, and silent conditions E5: low speed, rough road, and noisy conditions E6: high speed, good road, and silent conditions E7: high speed, good road, and noisy conditions –Tow channels have been used (stereo data) Close talK channel (CLK) Hands-free channel (HF) The recognition task is isolated and continuous digits recognition

15 Experiments Results (cont.) The SPLICE MS method always produces better results than does IRATZ - because of the assumption of the a posterior probability MEMLIN performed better than IRATEZ and SPLICE MS

16 Improvement Over MEMLIN There are two important approximations in MEMLIN expressions that can affect the final performance of the method –The selection of the linear model for associated with a pair of Gaussians compensates for the mean shift, but not for the modification of variance –Treating all of the sound in the same way There is always a bias vector transformation which maps from a noisy model Gaussian to every clean model one –e.g. non silence noisy feature vectors are mapped towards the clean silence

17 Polynomial MEMLIN (P-MEMLIN) The transformation function for P-MEMLIN is –Given the noisy feature vector, the estimated clean feature vector is obtained by using the MMSE criterion as – and are computed in the training phase using stereo data where If the standard deviation terms are equal, the algorithm expressions are the same as those in MEMLIN

18 Multi-Environment Model-based Histogram Normalization (MEMHIN) Sometime noise can produce a more complex modification of clean and noisy feature pdfs associated with a pair of Gaussians –In that case, the linear approximation for of MEMLIN or P- MEMLIN is not the best option –Therefore, a nonlinear model based on histogram equalization is used The transformation function for MEMHIN is expressed as – band histograms associated with and for each component of the noisy and clean feature vectors are obtained in the training phase

19 Results from modifications P-MEMLIN and MEMHIN provide significant improvement over MEMLIN when few Gaussians are considered –33.87% of MIMP for MEMLIN –39.12% of MIMP for P-MEMLIN –37.82% of MIMP for MEMHIN –However, if the algorithms are evaluated using more than eight Gaussians per environment, the mean results are very similar among the three models Then, additive car noise was added to clean signals of the Spanish SpeechDat Car database (5dB noise) 4 Gaussians per environment

20 Phoneme-Dependent MEMLIN (PD-MEMLIN)

21 PD-MEMLIN (Cont.) PD-MEMLIN Approximation –Noisy space is split into several basic environment. The noisy feature vectors associated with the different phonemes of each basic environment are modeled as GMM –The clean feature vectors of each phoneme are modeled as a GMM –Clean feature vector can be approximated by a linear function that depend on the environment and the phoneme-dependent Gaussians of the clean and noisy model

22 PD-MEMLIN Enhancement –Given the noisy feature vector, the estimated clean feature vector is obtained by using the MMSE criterion as –calculate PD-MEMLIN Training PD-MEMLIN (Cont.)

23 –cross probability –bias vector transformation PD-MEMLIN (Cont.) For a more detailed description, please refer to the paper Hard solution Soft solution

24 25 Spanish phonemes and the silence –To make a fair comparisons between two methods, the results have been plotted as a function of the number of Transformations per basic Environment (TpE) Results from PD-MEMLIN The number of noisy Gaussians for phoneme The number of clean Gaussians for phoneme The number of phoneme (1 for MEMLIN) (2,4,8,16,32) The results show that PD-MEMLIN makes significant improvements relative to MEMLIN, specially when more than for Gaussians per phoneme are used

25 Results from PD-MEMLIN (cont.) To estimate the limit of the PD-MEMLIN –Each frame was normalized using only the bias vector transformation of the “correct” phoneme (KPD-MEMLIN) KPD-MEMLIN PD-MEMLIN MCP: mean correct phoneme

26 Blind PD-MEMLIN In many cases, stereo data are not available –An iterative “blind” training procedure is needed –Assume that the noisy training feature vectors and the phoneme-dependent clean and noisy GMMs are available –The problem is to estimate the cross probability and the bias vector transformation –It consists of an initialization and an iterative process

27 Initialization –The cross probability is estimated using a modified Kullback-Leibler distance Gives a similarity measure of and without considering the effects of the noise –Assume that the noise modifies mainly the mean vectors of the Gaussian model So, the similarity is computed in terms of the a priori probabilities and the diagonal covariance matrices of the corresponding Gaussians Since KL distance is not symmetric, and it is not proportional to the likelihood; therefore, a pseudo-likelihood is defined Blind PD-MEMLIN (cont.)

28 Blind PD-MEMLIN (cont.) Finally, is estimated as –On the other hand, is defined as –The mean improvement in WER over the seven basic environments and four Gaussians per phoneme-depend GMM was 20.2%

29 Blind PD-MEMLIN (cont.) Iterative process –Objective function –The value of is obtained by taking partial derivatives and setting it equal to zero

30 Blind PD-MEMLIN (cont.) –Update the cross probability and the bias vector transformation –The mean improvement in WER in this case was 41.03% if n=1 and 46.90% if n=10

31 Results from Blind PD-MEMLIN The results show that blind PD-MEMLIN is able to produce improvement that are very similar to MEMLIN ones for all the TpE It can be observed that PD-MEMLIN obtains the Best improvement with the smallest TpE