Department of Electrical Engineering and Information Sciences Institute of Communication Acoustics (IKA) 1 Institute of Communication Acoustics (IKA)

Slides:



Advertisements
Similar presentations
1 MAXENT 2007 R. F. Astudillo, D. Kolossa and R. Orglmeister.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Results obtained in speaker recognition using Gaussian Mixture Models Marieta Gâta*, Gavril Toderean** *North University of Baia Mare **Technical University.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.
Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech Recognition Kenichi Kumatani, Disney Research, Pittsburgh.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
Communications & Multimedia Signal Processing Meeting 6 Esfandiar Zavarehei Department of Electronic and Computer Engineering Brunel University 6 July,
Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Advances in WP1 and WP2 Paris Meeting – 11 febr
HIWIRE MEETING Trento, January 11-12, 2007 José C. Segura, Javier Ramírez.
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Nico De Clercq Pieter Gijsenbergh Noise reduction in hearing aids: Generalised Sidelobe Canceller.
Codebook-based Feature Compensation for Robust Speech Recognition 2007/02/08 Shih-Hsiang Lin ( 林士翔 ) Graduate Student National Taiwan Normal University,
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Basics of Neural Networks Neural Network Topologies.
Nico De Clercq Pieter Gijsenbergh.  Problem  Solutions  Single-channel approach  Multichannel approach  Our assignment Overview.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
In-car Speech Recognition Using Distributed Microphones Tetsuya Shinde Kazuya Takeda Fumitada Itakura Center for Integrated Acoustic Information Research.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
A Semi-Blind Technique for MIMO Channel Matrix Estimation Aditya Jagannatham and Bhaskar D. Rao The proposed algorithm performs well compared to its training.
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Subproject II: Robustness in Speech Recognition. Members (1/2) Hsiao-Chuan Wang (PI) National Tsing Hua University Jeih-Weih Hung (Co-PI) National Chi.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Speech Enhancement for ASR by Hans Hwang 8/23/2000 Reference 1. Alan V. Oppenheim,etc., ” Multi-Channel Signal Separation by Decorrelation ”,IEEE Trans.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
Performance Comparison of Speaker and Emotion Recognition
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
PHASE-BASED DUAL-MICROPHONE SPEECH ENHANCEMENT USING A PRIOR SPEECH MODEL Guangji Shi, M.A.Sc. Ph.D. Candidate University of Toronto Research Supervisor:
Statistical Models for Automatic Speech Recognition Lukáš Burget.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
A NONPARAMETRIC BAYESIAN APPROACH FOR
A Tutorial on Bayesian Speech Feature Enhancement
EE513 Audio Signals and Systems
Missing feature theory
Generally Discriminant Analysis
Presenter: Shih-Hsiang(士翔)
Keyword Spotting Dynamic Time Warping
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Department of Electrical Engineering and Information Sciences Institute of Communication Acoustics (IKA) 1 Institute of Communication Acoustics (IKA) Ruhr-Universität Bochum 2 Spoken Language Laboratory, INESC-ID, Lisbon 3 School of Computing, University of Eastern Finland INESC-ID Lisboa CHiME Challenge: Approaches to Robustness using Beamforming and Uncertainty-of-Observation Techniques Dorothea Kolossa 1, Ramón Fernandez Astudillo 2, Alberto Abad 2, Steffen Zeiler 1, Rahim Saeidi 3, Pejman Mowlaee 1, João Paulo da Silva Neto 2, Rainer Martin 1 1

INESC-ID Lisboa Overview  Uncertainty-Based Approach to Robust ASR  Uncertainty Estimation by Beamforming & Propagation  Recognition under Uncertain Observations  Further Improvements  Training: Full-covariance Mixture Splitting  Integration: Rover  Results and Conclusions 2

INESC-ID Lisboa Introduction: Uncertainty-Based Approach to ASR Robustness  Speech enhancement in time-frequency-domain is often very effective.  However, speech enhancement itself can neither  remove all distortions and sources of mismatch completely  nor can it avoid introducing artifacts itself 3 Mixture Simple example: Time-Frequency Masking

INESC-ID Lisboa Introduction: Uncertainty-Based Approach to ASR Robustness Problem: Recognition performs significantly better in other domains, such that missing feature approach may perform worse than feature reconstruction [1]. How can decoder handle such artificially distorted signals? One possible compromise: Missing Feature HMM Speech Recognition Time-Frequency-Domain STFT Speech Processing m(n)Y kl M kl X kl [1] B. Raj and R. Stern: „Reconstruction of Missing Features for Robust Speech Recognition“, Speech Communication 43, pp , 2004.

INESC-ID Lisboa X kl Introduction: Uncertainty-Based Approach to ASR Robustness 5 Solution used here: Transform uncertain features to desired domain of recognition M kl Missing Data HMM Speech Recognition m(n) Recognition Domain Y kl TF-Domain Speech Processing STFT Uncertainty Propagation

INESC-ID Lisboa Introduction: Uncertainty-Based Approach to ASR Robustness 6 m(n) Recognition Domain Y kl TF-Domain Speech Processing STFT Uncertainty Propagation Solution used here: Transform uncertain features to desired domain of recognition p(X kl | Y kl ) Missing Data HMM Speech Recognition

INESC-ID Lisboa Introduction: Uncertainty-Based Approach to ASR Robustness 7 Uncertainty- based HMM Speech Recognition m(n) Recognition Domain Y kl TF-Domain Speech Processing STFT Uncertainty Propagation Solution used here: Transform uncertain features to desired domain of recognition p(X kl | Y kl )p(x kl |Y kl ) c

INESC-ID Lisboa Uncertainty Estimation & Propagation 8  Posterior estimation here is performed by using one of four beamformers:  Delay and Sum (DS)  Generalized Sidelobe Canceller (GSC) [2]  Multichannel Wiener Filter (WPF)  Integrated Wiener Filtering with Adaptive Beamformer (IWAB) [3] [2] O. Hoshuyama, A. Sugiyama, and A. Hirano, “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” IEEE Trans. Signal Processing, vol. 47, no. 10, pp –2684, [3] A. Abad and J. Hernando, “Speech enhancement and recognition by integrating adaptive beamforming and Wiener filtering,” in Proc. 8th International Conference on Spoken Language Processing (ICSLP), 2004, pp. 2657–2660.

INESC-ID Lisboa Uncertainty Estimation & Propagation 9  Posterior of clean speech, p(X kl |Y kl ), is then propagated into domain of ASR  Feature Extraction  STSA-based MFCCs  CMS per utterance  possibly LDA

INESC-ID Lisboa Uncertainty Estimation & Propagation 10  Uncertainty model: Complex Gaussian distribution

INESC-ID Lisboa Uncertainty Estimation & Propagation 11  Two uncertainty estimators: a)Channel Asymmetry Uncertainty Estimation  Beamformer output input to Wiener filter  Noise variance estimated as squared channel difference  Posterior directly obtainable for Wiener filter [4]: [4] R. F. Astudillo and R. Orglmeister, “A MMSE estimator in mel-cepstral domain for robust large vocabulary automatic speech recognition using uncertainty propagation,” in Proc. Interspeech, 2010, pp. 713–716. ;

INESC-ID Lisboa Uncertainty Estimation & Propagation 12  Two uncertainty estimators: b) Equivalent Wiener variance  Beamformer output directly passed to feature extraction  Variance estimated using ratio of beamformer input and output, interpreted as Wiener gain 12 [4] R. F. Astudillo and R. Orglmeister, “A MMSE estimator in mel-cepstral domain for robust large vocabulary automatic speech recognition using uncertainty propagation,” in Proc. Interspeech, 2010, pp. 713–716.

INESC-ID Lisboa Uncertainty Propagation  Uncertainty propagation from [5] was used  Propagation through absolute value yields MMSE-STSA  Independent log normal distributions after filterbank assumed  Posterior of clean speech in cepstrum domain assumed Gaussian  CMS and LDA transformations simple 13 [5] R. F. Astudillo, “Integration of short-time Fourier domain speech enhancement and observation uncertainty techniques for robust automatic speech recognition,” Ph.D. thesis, Technical University Berlin, 2010.

INESC-ID Lisboa € Recognition under Uncertain Observations  Standard observation likelihood for state q mixture m :  Uncertainty Decoding: L. Deng, J. Droppo, and A. Acero, “Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 3, pp. 412–421, May  Modified Imputation:   Both uncertainty-of-observation techniques collapse to standard observation likelihood for  x = D. Kolossa, A. Klimas, and R. Orglmeister, “Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques,” in Proc. Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 2005, pp. 82–85. € €

INESC-ID Lisboa Further Improvements  Training: Informed Mixture Splitting  Baum-Welch Training is only optimal locally -> good initialization and good split directions matter.  Therefore, considering covariance structure in mixture splitting is advantageous: 15 x1x1 x2x2 split along maximum variance axis

INESC-ID Lisboa Further Improvements  Training: Informed Mixture Splitting  Baum-Welch Training is only optimal locally -> good initialization and good split directions matter.  Therefore, considering covariance structure in mixture splitting is advantageous: 16 x1x1 x2x2 split along first eigenvector of covariance matrix

INESC-ID Lisboa Further Improvements  Integration: Recognizer output voting error reduction (ROVER)  Recognition outputs at word level are combined by dynamic programming on generated lattice, taking into account  the frequency of word labels and  the posterior word probabilities  We use ROVER on 3 jointly best systems selected on development set. J. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in IEEE Workshop on Automatic Speech Recognition and Understanding, Dec. 1997, pp. 347 –

INESC-ID Lisboa Results and Conclusions  Evaluation:  Two scenarios are considered, clean training and multicondition (‚mixed‘) training.  In mixed training, all training data was used at all SNR levels, artifically adding randomly selected noise from noise-only recordings.  Results are determined on the development set first.  After selecting the best performing system on development data, final results are obtained as keyword accuracies on the isolated sentences of the test set. 18

INESC-ID Lisboa Results and Conclusions  JASPER Results after clean training * JASPER uses full covariance training with MCE iteration control. Token passing is equivalent to HTK dB-3dB0dB3dB6dB9dB Clean: Official Baseline JASPER* Baseline

INESC-ID Lisboa Results and Conclusions  JASPER Results after clean training * Best strategy here: Delay and sum beamformer + noise estimation + modified imputation 20 -6dB-3dB0dB3dB6dB9dB Clean: Official Baseline JASPER Baseline JASPER + BF* + UP

INESC-ID Lisboa  HTK Results after clean training * Best strategy here: Wiener post filter + uncertainty estimation Results and Conclusions 21 -6dB-3dB0dB3dB6dB9dB Clean: Official Baseline HTK + BF* + UP

INESC-ID Lisboa  Results after clean training * Best strategy here: Delay and sum beamformer + noise estimation Results and Conclusions 22 -6dB-3dB0dB3dB6dB9dB Clean: Official Baseline HTK + BF + UP HTK + BF* + UP + MLLR

INESC-ID Lisboa  Overall Results after clean training * (JASPER +DS + MI) & (HTK+GSC+NE) & (JASPER+WPF+MI) Results and Conclusions 23 -6dB-3dB0dB3dB6dB9dB Clean: Official Baseline JASPER Baseline JASPER + BF + UP HTK + BF + UP HTK + BF + UP + MLLR ROVER (JASPER + HTK )*

INESC-ID Lisboa Results and Conclusions  JASPER Results after multicondition training 24 -6dB-3dB0dB3dB6dB9dB Multicondition: HTK Baseline JASPER Baseline

INESC-ID Lisboa Results and Conclusions  JASPER Results after multicondition training * best JASPER setup here: Delay and sum beamformer + noise estimation + modified imputation + LDA to 37d 25 -6dB-3dB0dB3dB6dB9dB Multicondition: HTK Baseline JASPER Baseline JASPER + BF* + UP

INESC-ID Lisboa Results and Conclusions  JASPER Results after multicondition training * best JASPER setup here: Delay and sum beamformer + noise estimation + modified imputation + LDA to 37d 26 -6dB-3dB0dB3dB6dB9dB Multicondition: HTK Baseline JASPER Baseline JASPER + BF* + UP as above, but 39d+0.58%-0.25%-2.16%-1.41%-2.0%-0.5%

INESC-ID Lisboa Results and Conclusions  HTK Results after multicondition training * best HTK setup here: Delay and sum beamformer + noise estimation 27 -6dB-3dB0dB3dB6dB9dB Multicondition: HTK Baseline HTK + BF* + UP

INESC-ID Lisboa Results and Conclusions  HTK Results after multicondition training * best HTK setup here: Delay and sum beamformer + noise estimation 28 -6dB-3dB0dB3dB6dB9dB Multicondition: HTK Baseline HTK + BF + UP HTK + BF* + UP + MLLR

INESC-ID Lisboa Results and Conclusions  Overall Results after multicondition training * (JASPER +DS + MI + LDA ) & (JASPER+WPF, no observation uncertainties) & (HTK+DS+NE) 29 -6dB-3dB0dB3dB6dB9dB Multicondition: HTK Baseline JASPER Baseline JASPER + BF + UP HTK + BF + UP HTK + BF + UP + MLLR ROVER (JASPER + HTK )*

INESC-ID Lisboa Results and Conclusions  Conclusions  Beamforming provides an opportunity to estimate not only the clean signal but also its standard error.  This error - the observation uncertainty - can be propagated to the MFCC domain or an other suitable domain for improving ASR by uncertainty-of-observation techniques.  Best results were attained for uncertainty propagation with modified imputation.  Training is critical, and despite strange philosophical implications, observation uncertainties improve the behaviour after multicondition training as well.  Strategy is simple & easily generalizes to LVCSR. 30

INESC-ID Lisboa Thank you ! 31

INESC-ID Lisboa Further Improvements  Training: MCE-Guided Training  Iteration and splitting control is done by minimum classification error (MCE) criterion on held-out dataset.  Algorithm for mixture splitting:  initialize split distance d  while m < numMixtures  split all mixtures by distance d along 1st eigenvector  carry out re-estimations until accuracy improves no more  if acc m >= acc m-1  m = m+1  else  go back to previous model  d = d/f 32