Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Speech Enhancement through Noise Reduction By Yating & Kundan.
Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech Recognition Kenichi Kumatani, Disney Research, Pittsburgh.
2004 COMP.DSP CONFERENCE Survey of Noise Reduction Techniques Maurice Givens.
CHAPTER 4 Noise in Frequency Modulation Systems
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress Richard Stern (with Chanwoo Kim and Yu-Hsiang Chiu) Department.
SOME SIMPLE MANIPULATIONS OF SOUND USING DIGITAL SIGNAL PROCESSING Richard M. Stern demo August 31, 2004 Department of Electrical and Computer.
Single-Channel Speech Enhancement in Both White and Colored Noise Xin Lei Xiao Li Han Yan June 5, 2002.
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Advances in WP1 and WP2 Paris Meeting – 11 febr
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
1 INTRODUCTION METHODSRESULTSCONCLUSION Noise Robust Speech Recognition Group SB740 Noise Robust Speech Recognition Group SB740.
Robust Automatic Speech Recognition In the 21 st Century Richard Stern (with Alex Acero, Yu-Hsiang Chiu, Evandro Gouvêa, Mark Harvilla, Chanwoo Kim, Kshitiz.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Topics covered in this chapter
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Multiuser Detection (MUD) Combined with array signal processing in current wireless communication environments Wed. 박사 3학기 구 정 회.
Survey of ICASSP 2013 section: feature for robust automatic speech recognition Repoter: Yi-Ting Wang 2013/06/19.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Basics of Neural Networks Neural Network Topologies.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Gammachirp Auditory Filter
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
Subband Feature Statistics Normalization Techniques Based on a Discrete Wavelet Transform for Robust Speech Recognition Jeih-weih Hung, Member, IEEE, and.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Department of Electrical and Computer Engineering
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Suppression of Musical Noise Artifacts in Audio Noise Reduction by Adaptive 2D filtering Alexey Lukin AES Member Moscow State University, Moscow, Russia.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
SOME SIMPLE MANIPULATIONS OF SOUND USING DIGITAL SIGNAL PROCESSING Richard M. Stern demo January 15, 2015 Department of Electrical and Computer.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
UNIT-IV. Introduction Speech signal is generated from a system. Generation is via excitation of system. Speech travels through various media. Nature of.
Speech Processing Dr. Veton Këpuska, FIT Jacob Zurasky, FIT.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Speech Enhancement Summer 2009
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Scatter-plot Based Blind Estimation of Mixed Noise Parameters
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
A Tutorial on Bayesian Speech Feature Enhancement
Richard M. Stern demo January 12, 2009
Missing feature theory
EE 492 ENGINEERING PROJECT
INTRODUCTION TO THE SHORT-TIME FOURIER TRANSFORM (STFT)
INTRODUCTION TO ADVANCED DIGITAL SIGNAL PROCESSING
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2 and Richard M. Stern1,2 1Electrical and Computer Engineering Department and 2Language Technologies Institute Carnegie Mellon University, Pittsburgh, PA Power-Normalized Cepstral Coefficients (PNCC) Compensatory Spectral Averaging and Warping using Histograms (CSAWH) Introduction CSAWH is based on the observation that noise significantly alters the characteristic distribution of subband speech power CSAWH applies nonparametric transformations to match the distribution of the input speech to prototype distributions observed over clean reference data The nonlinear transformations do not inherently discriminate between speech and noise; weighted spectral averaging helps to mitigate sporadic suppression of speech or amplification of noise Input audio PNCC combines multiple properties of the HAS including: Knowledge of the shape of the effective auditory filters related to cochlear response The precedence effect (see SSF below) The rate-level nonlinearity PNCC imposes power bias subtraction motivated by the mismatch in the AM-GM ratio between clean and noisy speech. The AM-GM ratio is related to the shape parameter of the Gamma distribution, which characterizes well the distribution of linear power of speech. The AM-GM ratio can also be used for blind SNR measurement (Waveform-amplitude distribution analysis, WADA). Input audio It is well known that the accuracy of automatic speech recognition systems is compromised in high-noise environments. In contrast, humans have a remarkable ability to accurately recognize continuous speech fairly independently of the environment. This observation might imply that the robustness of ASR systems can be increased by exploiting principles and adopting characteristic mechanisms of the human auditory system (HAS). A contrasting, and quite possibly complimentary, approach is to confront the problem from a statistical standpoint. By designing algorithms with statistical optimality in mind, robustness systems can be built that are effective, but that don’t necessarily adhere to any physiological mechanism. Pre-emphasis Pre-emphasis STFT STFT Magnitude squared Magnitude squared Gammatone filter bank Gammatone filter bank Peak power normalization Peak power normalization Medium-duration power bias subtraction Power function nonlinearity Histogram matching Power function nonlinearity Weighted spectral averaging DCT & CMN Audio resynthesis Feature The figures to the left illustrate clean and noisy subband speech power signals with and without PNCC processing (bottom and top, respectively). Output audio The figure to the left illustrates the effect of the noise on the distribution of the subband speech power Objectives Our general objectives are to: Develop portable front-end features that work generally to improve the robustness of speech-based systems, most specifically ASR, but conceivably helpful for any fundamentally speech-based systems such as voice-activity detectors, speaker recognizers, keyword spotters, and so on. Maintain generality of the features so that they are independent of ASR systems, tasks, and other adaptation and normalization techniques Design the features with the overall objective of reducing the mismatch between training and testing data in mind Suppression of Slowly-varying components and the Falling edge (SSF) Selected Experimental Results Input audio SSF is based on the precedence effect, which is the tendency of the HAS to focus on the first arriving wave front of a given sound source. By emphasizing onsets, the spectral smearing effect of reverberation can be partially counteracted. Pre-emphasis STFT Magnitude squared The above plots show results from CMU Sphinx-3 on RM1 in white noise. Below, results are depicted for RATS-like noise on the SRI DECIPHER ASR. For both cases, the left column shows clean training and the right column shows multistyle training. Gammatone filter bank SSF processing M[m,l] = λM[m-1,l] + (1- λ)P[m,l] P1[m,l] = max(P[m,l]-M[m,l],c0P[m,l]) P2[m,l] = max(P[m,l]-M[m,l],c0M[m,l]) Spectral reshaping These processing blocks effectively resynthesize audio; this helps to smooth spectral discontinuities introduced by nonlinear processing Below, results for SSF are compared to other standard feature extraction algorithms. The left plot shows results for clean speech in reverberation and the right plot shows results for speech in music noise. The differences between SSF Type-I and Type-II are pronounced in reverberation, but insignificant in the other case. Inverse STFT Post de-emphasis The figure below illustrates the effect of SSF to emphasize onsets in reverberation: Output audio