Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Slides:



Advertisements
Similar presentations
Change-Point Detection Techniques for Piecewise Locally Stationary Time Series Michael Last National Institute of Statistical Sciences Talk for Midyear.
Advertisements

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition Speaker: Chang-wen Hsu Advisor: Lin-shan Lee 2007/02/08.
Modulation Spectrum Factorization for Robust Speech Recognition Wen-Yi Chu 1, Jeih-weih Hung 2 and Berlin Chen 1 Presenter : 張庭豪.
Digital Signal Processing
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
HIWIRE MEETING Nancy, July 6-7, 2006 José C. Segura, Ángel de la Torre.
HIWIRE MEETING Torino, March 9-10, 2006 José C. Segura, Javier Ramírez.
Communications & Multimedia Signal Processing Meeting 7 Esfandiar Zavarehei Department of Electronic and Computer Engineering Brunel University 23 November,
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Speech Recognition in Noise
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Communications & Multimedia Signal Processing Analysis of Effects of Train/Car noise in Formant Track Estimation Qin Yan Department of Electronic and Computer.
Gaussian Mixture-Sound Field Landmark Model for Robot Localization Talker: Prof. Jwu-Sheng Hu Department of Electrical and Control Engineering National.
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Introduction to Spectral Estimation
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.
Speech Enhancement Using Spectral Subtraction
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
1/ , Graz, Austria Power Spectral Density of Convolutional Coded Pulse Interval Modulation Z. Ghassemlooy, S. K. Hashemi and M. Amiri Optical Communications.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer Science Queen’s University Belfast, Belfast BT7 1NN,
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Experimental Results ■ Observations:  Overall detection accuracy increases as the length of observation window increases.  An observation window of 100.
Basics of Neural Networks Neural Network Topologies.
ON REAL-TIME MEAN-AND-VARIANCE NORMALIZATION OF SPEECH RECOGNITION FEATURES Pere Pujol, Dušan Macho, and Climent NadeuNational ICT TALP Research Center.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者:汪逸婷 1.
Gammachirp Auditory Filter
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Lecture#10 Spectrum Estimation
Performance Comparison of Speaker and Emotion Recognition
Subband Feature Statistics Normalization Techniques Based on a Discrete Wavelet Transform for Robust Speech Recognition Jeih-weih Hung, Member, IEEE, and.
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Single Correlator Based UWB Receiver Implementation through Channel Shortening Equalizer By Syed Imtiaz Husain and Jinho Choi School of Electrical Engineering.
Analysis of Traction System Time-Varying Signals using ESPRIT Subspace Spectrum Estimation Method Z. Leonowicz, T. Lobos
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
PART II: TRANSIENT SUPPRESSION. IntroductionIntroduction Cohen, Gannot and Talmon\11 2 Transient Interference Suppression Transient Interference Suppression.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Speech Enhancement Summer 2009
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Digital Communications Chapter 13. Source Coding
Speech Enhancement with Binaural Cues Derived from a Priori Codebook
III. Analysis of Modulation Metrics IV. Modifications
Historic Document Image De-Noising using Principal Component Analysis (PCA) and Local Pixel Grouping (LPG) Han-Yang Tang1, Azah Kamilah Muda1, Yun-Huoy.
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
朝陽科技大學 資訊工程系 謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學 資訊工程系 謝政勳
Missing feature theory
DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.
Source: Pattern Recognition Letters, Article In Press, 2007
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer Science & Information Engineering National Taiwan Normal University

Outline Introduction Noise Effects on Speech Modulation Spectra Temporal Structure Normalization Segment-based Implementation Experiments Conclusion

Introduction(1/2) The modulation spectra of a speech signal are the power spectral density (PSD) functions of the feature trajectories generated from the signal, hence they describe the temporal structure of the features. When speech is distorted by noise, the temporal characteristics of feature trajectories are also distorted and need to be enhanced. We propose the temporal structure normalization (TSN) filter to reduce the noise effects by normalizing the modulation spectra to reference spectra.

Introduction(2/2) In this paper, we conduct an inquiry into the theoretical and practical issues of the TSN filter that includes the following. 1)We investigate the effects of noises on the speech modulation spectra and show the general characteristics of noisy speech modulation spectra. 2) We evaluate the TSN filter on the Aurora-4 task and demonstrate its effectiveness for a large vocabulary task. 3) We propose a segment-based implementation of the TSN filter that reduces the processing delay significantly without affecting the performance.

Noise Effects on Speech Modulation Spectra Definition of Modulation Spectra For a time domain speech signal , its short-time power spectral density is defined as where and are the sample index, frame index, and the acoustic frequency index, respectively. The denotes the short-time Fourier transform (STFT), and denotes the magnitude.

Noise Effects on Speech Modulation Spectra Modulation Spectra of Clean Speech and Noises As the modulation spectra are averaged over 200 utterances, the effects of the spoken content and speaker are reduced. From Fig. (a), it is clear that the modulation spectra of noises are flatter than that of speech.

Noise Effects on Speech Modulation Spectra Modulation Spectra of Noisy Speech The noisy modulation spectrum of any filterbank is represented as The three noisy modulation spectra have higher power than the clean modulation spectrum in the high modulation frequency range. The flatness of the noisy modulation spectra are inversely proportional to the SNR level.

Temporal Structure Normalization Overview of TSN Framework(1/2) The TSN filter aims to reduce the temporal differences between the clean and noisy features, such as the smoothness of the feature trajectory. The magnitude response of the filter is designed from the PSD functions of the feature trajectories.

Temporal Structure Normalization Overview of TSN Framework(2/2) Specifically, the magnitude response is designed to modify the feature trajectory’s PSD function towards a reference PSD function. In this way, the temporal characteristics of the features are normalized. In both the training and processing steps, the feature trajectories are preprocessed individually by either the mean and variance normalization (MVN) or HEQ before the PSD estimation.

Temporal Structure Normalization Filter Design(1/2) The filter design process is divided into two parts: 1) estimates the desired magnitude response of the filter with optional modification (steps 1–3 in Table I) 2) designs the time domain finite- impulse response (FIR) filter that implements the desired magnitude response (steps 4–7 in Table I) The same filter design process is performed individually for each feature trajectory.

Temporal Structure Normalization Filter Design(2/2) To normalize to ,the filter’s magnitude response should be During the filter design process, there are two considerations. 1) Normalizing the Trend of the PSD Function Only: The simplest approach is to use a smooth and in the filter design process. (the autoregressive (AR) model-based Yule–Walker method instead of the Fourier transform) 2) Combination With Other Temporal Filters: To take advantage of other temporal filters, we can simply multiply their magnitude responses with that of the TSN filter (step 3 of Table I)

Temporal Structure Normalization Training of Reference PSD Functions In the TSN framework, the general temporal structure is obtained by estimating the reference PSD functions from averaging the PSD functions of feature trajectories over a larger number of utterances. The averaging process reduces other factors affecting the PSD functions, such as spoken content and speaker’s characteristics.

Segment-based Implementation When short processing delay is desired or environmental conditions change rapidly within one utterance, it is more desirable to process the features in a segment-by-segment fashion. The segment-based MVN and HEQ are implemented using the same segmenting scheme as the TSNseg. Considerations: Two important parameters of the TSNseg are the segment length and the segment shift.

Experimental Setup The TSN filter is evaluated on both the small vocabulary Aurora-2 task and the large vocabulary Aurora-4 task. Mel-frequency cepstral coefficients (MFCCs) are used as the features for speech recognition. For the TSN filter, the Yule–Walker method is used to estimate the PSD functions of feature trajectories. A filter length of 33 taps is used for the evaluation if not otherwise stated.

Evaluation on the Aurora-2 Task(1/2) From the table, the performance of the TSN filter is similar to that of the ARMA filter and better than that of the RASTA filter. From the table, it is found that the integrated case TSN+ARMA is consistently better than both the ARMA filter and the TSN filter alone. With the combined magnitude response, the TSN filter not only normalizes the feature’s temporal structure, but also applies additional smoothing to the features.

Evaluation on the Aurora-2 Task(2/2) To ensure that the results of these tests are statistically significant, we carry out hypothesis test as follows. Evaluation on Different Features The TSN filter alone may not produce the best performance. However, the integration of the TSN with the ARMA filter (TSN ARMA) always improves the performance.

Evaluation on the Aurora-4 Task From table, the results of TSN are always significantly better than those of ARMA and RASTA. While feature smoothing can reduce the feature mismatches, it also leads to removal of useful speech information. The degree of smoothing is a tradeoff between these two factors and usually depends on the signal conditions such as the SNR level and the task complexity.

Segment-Based Implementation the gain of TSNseg over TSN is mainly due to its use of MVNseg preprocessing. The main advantage of the TSNseg is the reduction of the processing delay. With the segment-based implementation, the TSN filter can adapt much faster to run-time changing environments, and is thus suitable for low-delay speech recognition deployments.

Effect of Filter Length A long filter can model the desired magnitude response better, but also means higher computational cost and a longer transition period for filtering. Due to these considerations, a short filter is preferred if it does not degrade the performance too much compared to a long filter.

Contribution of Modulation Frequency Bands It is generally believed that the intelligibility of human speech is closely related to the 1–16 Hz modulation frequency range. This suggests that the performance improvement of the TSN filter should mainly come from its gain in the 1–16 Hz range.

Conclusion The major contributions of the paper include the following: It discusses the effects of additive noises on the speech modulation spectra. It conducts a comprehensive evaluation of the TSN framework with respect to different parameter settings. It proposes a novel segment-based implementation of the TSN filter. The major advantage of the TSN filter is its ability to adapt to changing environments. From statistical point of view, the TSN normalizes the correlation of features over different frames, while other normalization techniques such as HEQ normalize the probability distributions of the features.