Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition S. Srinivasan, T. Ma, D. May, G. Lazarou and J. Picone Department of Electrical.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Advertisements

Supervised Learning Recap
Particle Swarm Optimization PSO was first introduced by Jammes Kennedy and Russell C. Eberhart in Fundamental hypothesis: social sharing of information.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Motivation Traditional approach to speech and speaker recognition:
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Speaker Adaptation for Vowel Classification
Independent Component Analysis (ICA) and Factor Analysis (FA)
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Optimal Adaptation for Statistical Classifiers Xiao Li.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
A Unifying Review of Linear Gaussian Models
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Isolated-Word Speech Recognition Using Hidden Markov Models
Gaussian Mixture Model and the EM algorithm in Speech Recognition
Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.
Reconstructed Phase Space (RPS)
7-Speech Recognition Speech Recognition Concepts
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
A NONLINEAR MIXTURE AUTOREGRESSIVE MODEL FOR SPEAKER VERIFICATION Sundararajan Srinivasan Department of Electrical.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Nonlinear Dynamical Invariants for Speech Recognition S. Prasad, S. Srinivasan, M. Pannuri, G. Lazarou and J. Picone Department of Electrical and Computer.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
CS Statistical Machine learning Lecture 24
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
NCAF Manchester July 2000 Graham Hesketh Information Engineering Group Rolls-Royce Strategic Research Centre.
Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Estimation Method of Moments (MM) Methods of Moment estimation is a general method where equations for estimating parameters are found by equating population.
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
S.Patil, S. Srinivasan, S. Prasad, R. Irwin, G. Lazarou and J. Picone Intelligent Electronic Systems Center for Advanced Vehicular Systems Mississippi.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Spectrum Sensing In Cognitive Radio Networks
NTU & MSRA Ming-Feng Tsai
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Kalman Filter with Process Noise Gauss- Markov.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
LECTURE 11: Advanced Discriminant Analysis
Statistical Models for Automatic Speech Recognition
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
LECTURE 15: REESTIMATION, EM AND MIXTURES
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Combination of Feature and Channel Compensation (1/2)
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition S. Srinivasan, T. Ma, D. May, G. Lazarou and J. Picone Department of Electrical and Computer Engineering Mississippi State University URL: This material is based upon work supported by the National Science Foundation under Grant No. IIS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Page 1 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Gaussian mixture models are a very successful method for modeling the output distribution of a state in a hidden Markov model (HMM). However, this approach is limited by the assumption that the dynamics of speech features are linear and can be modeled with static features and their derivatives. In this paper, a nonlinear mixture autoregressive model is used to model state output distributions (MAR-HMM). Estimation of model parameters is extended to handle vector features. MAR-HMMs are shown to provide superior performance to comparable Gaussian mixture model-based HMMs (GMM-HMM) with lower complexity on two pilot classification tasks. Abstract

Page 2 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Popular approaches to speech recognition use:  Gaussian Mixture Models (GMMs) to model state output distributions in hidden Markov model-based acoustic models.  Additional passes of discriminative training to optimize model performance.  Discriminative classifiers such as Support Vector Machines for rescoring or combining hypotheses. Limitations of current approaches:  GMMs model the marginal pdf and do not explicitly model dynamics.  Adding differential features to GMMs still only allows modeling of linear dynamic system and does not directly model nonlinear dynamics.  Extraction of features that explicitly model nonlinear dynamics of speech [May, 2008] has not provided significant improvements that are robust to noise or mismatched training conditions. Goal: Directly capture dynamic information in our acoustic model. Motivation

Page 3 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Nonlinear Dynamic Invariants as Features (May, 2008) Nonlinear Features: Quantify the amount of nonlinearity in the signal (e.g. Lyapunov exponents, fractal dimension, and correlation entropy). Pros: Invariant to signal transformations: noise distortions, channel effects (at least in theory). This leads to expectations that nonlinear invariants could be noise-robust. Cons: Contrary to expectations, experiments with continuous speech data show that performance with invariants are not noise-robust (May, 2008). While there is a 11.1% relative increase in recognition performance with clean evaluation set, a relative 7.6% decrease is noted with noisy set. Hypotheses:  (Theoretical) Even when estimated correctly, the amount of nonlinearity is only a very crude measure of dynamic information. Invariant values can be similar even when the signals are quite different. This places a limit to the gain achievable by using nonlinear invariants as features.  (Practical) Invariant feature estimation algorithms are very sensitive to changes in channel conditions – almost never “invariant.”

Page 4 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition A Mixture Autoregressive Model (MAR) MAR (Wong and Li, 2000) of order p with m components: where, ε i : zero mean Gaussian with variance σ j 2 “w.p. w i ” : with probability w i a i,j (j>0) : AR predictor coefficients a i,0 : mean for component i Additional nonlinear components – weighted sum of Gaussian AR processes

Page 5 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Interpretations MAR as a weighted sum of Gaussian autoregressive (AR) processes:  Analysis/Modeling: Conditional distribution modeled as a weighted combination of AR filters with Gaussian white noise inputs (hence, this models data as colored non-Gaussian signals).  Synthesis: A random process in which data sampled at any one point in time is generated from one of the components of an AR mixture process chosen randomly according to its weight, w i. Comparisons to GMMs:  GMMs model the marginal distribution as a weighted sum of Gaussian random variables.  MAR with AR filter order 0 is equivalent to a GMM.  MAR can capture information in dynamics in addition to the static information captured by a GMM.

Page 6 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Overview of the MAR-HMM Approach An example MAR-HMM system with 3 states: each state observation is modeled by a 2-component MAR.

Page 7 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition MAR Properties Ability to model nonlinearities:  Each AR component linear.  Probabilistic mixing of AR processes leads to nonlinearities. Dynamics of probability distribution:  In GMM:  Distribution invariant to past samples.  Only marginal distribution is modeled.  In MAR:  Conditional distribution varies with past samples.  Joint distribution of random process is modeled.  Nonlinear evolution in time series is modeled using a probabilistic mixing of linear AR processes.

Page 8 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition MAR Parameter Estimation Using EM E-step: Conditional probability sample came from mixture component : where: In a GMM-HMM, for, while corresponds to the mean. This reduces to the familiar EM equations.

Page 9 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition MAR Parameter Estimation Using EM M-step: Parameter Updates Again, for GMM-HMM, for, and is the component mean, reducing to the familiar EM equations.

Page 10 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Vector Extension of MAR Basic formulation of MAR was only for scalar time series. How can we extend MAR to handle vector time series (e.g. MFCCs) without fundamental modifications to the EM equations? Solution: Tie MAR mixture components across scalar components using a single weight estimated by combining likelihood of all scalar components for each mixture: Pros: Only simple modification to weight update during EM training. Cons: Assumes scalar components are decoupled (except for the shared mixture weight parameter). This is similar to an assumption of a diagonal covariance in GMMs – a very common assumption for MFCC features.

Page 11 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Relationship to Other AR-HMMs (Poritz, 1980) Linear Predictive HMM: Equivalent to MAR-HMM with a single mixture component. (Juang and Rabiner, 1985) Mixture Autoregressive HMM: Somewhat similar to the current model investigated (even identical in name!), but with two major differences:  Component means were assumed to be zero.  Variances were assumed to be equal across components. (Ephraim and Robert, 2005) Switching AR-HMM: Restricted to single- component MAR-HMM. All the above were applied to scalar time series with speech samples as observations. Our vector extension of MAR has been applied to modeling of the MFCC vector feature stream.

Page 12 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition (Synthetic) Signal Classification – Experimental Design Two-class problem Same marginal distribution Different dynamics  Signal 1: no dependence on past samples  Signal 2: nonlinear dependence on past samples  Signal parameters for the two classes:

Page 13 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition (Synthetic) Signal Classification – Results Table 1: Classification (% accuracy) results for synthetic data Static-only features: GMM can only do as good as a random guess (because both signals have the same marginal); MAR uses the distinct signal dynamics to yield 100% classification accuracy. Static+ ∆+∆∆ : GMM performance improves significantly due to the addition of dynamic information. Yet, MAR outperforms GMM because it can capture nonlinear dynamics in the signals. # mixts.GMM Static only MAR Static only GMM static+ ∆+∆∆ MAR static+ ∆+∆∆ (6)100.0 (8)82.5 (14)100.0 (20) (12)100.0 (16)85.0 (28)100.0 (40)

Page 14 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Sustained Phone Classification – Experimental Design Collected artificially elongated pronunciations of three sounds: vowel /aa/, nasal /m/, sibilant /sh/, sampled at 16 kHz. For this preliminary study, we wanted to avoid artifacts introduced by coarticulation. Static feature database: 13 MFCCs (including energy) extracted at a frame rate of 10 ms and window duration of 25 ms. Static+∆+∆∆ database: Static + velocity + acceleration MFCC coefficients. Training: 70 recordings of each phone. Testing: 30 recordings of each phone.

Page 15 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Sustained Phone Classification (Static Features) – Results Table 2: Sustained phone classification (% accuracy) results with MAR and GMM using static (13) MFCC features For an equal number of parameters, MAR outperforms GMM. MAR exploits dynamic information to better model evolution of MFCCs for phones that GMM is unable to model. # mixturesGMMMAR (54)83.3 (80) (108)90.0 (160) (216)94.4 (320) (432)95.6 (640)

Page 16 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Sustained Phone Classification (Static +∆+∆∆ ) – Results Table 3: Sustained phone classification (% accuracy) results with MAR and GMM using static+∆+∆∆ (39) MFCC features MAR performs better than GMM when the number of parameters are equal. MAR performance saturates as the number of parameters increases. Why?: Assumption that features are uncorrelated during MAR training is invalid. Static features may be unrelated but obviously ∆ features are related to static features. Causes problems both for diagonal covariance GMMs and MAR, but more severe for MAR due to its explicit modeling of dynamics. # MixturesGMMMAR (158)94.4 (236) (316)97.8 (472) (632)97.8 (944) (1264)98.9 (1888)

Page 17 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition Summary and Future Work Conclusions: Presented a novel nonlinear mixture autoregressive HMM (MAR-HMM) for speech recognition with a vector extension. Demonstrated the efficacy of MAR-HMM over GMM-HMM for signal classification with a two-class problem using synthetic signals. Showed the superiority of MAR-HMM over GMM-HMM for sustained phone classification using static MFCC features. Experiments with static+ ∆+∆∆ features were not as conclusive – MAR-HMM performance saturated more quickly than GMM-HMM. Future Work: Perform large-scale phone recognition experiments to study MAR-HMM performance in more practical scenarios involving noise (e.g. Aurora-4). Investigate whether phone-dependent decorrelation (class-based PCA) of the features improves MAR-HMM performance.

Page 18 of 18 Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition References 1.D. May, Nonlinear Dynamic Invariants For Continuous Speech Recognition, M.S. Thesis, Department of Electrical and Computer Engineering, Mississippi State University, May Wong, C. S. and Li, W. K., “On a Mixture Autoregressive Model,” Journal of the Royal Statistical Society, vol. 62, no. 1, pp. 95 ‑ 115, Feb Poritz, A. B., “Linear Predictive Hidden Markov Models,” Proceedings of the Symposium on the Application of Hidden Markov Models to Text and Speech, Princeton, New Jersey, USA, pp , Oct Juang, B. H., and Rabiner, L. R., “Mixture Autoregressive Hidden Markov Models for Speech Signals,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 33, no. 6, pp , Dec Ephraim, Y. and Roberts, W. J. J., “Revisiting Autoregressive Hidden Markov Modeling of Speech Signals,” IEEE Signal Processing Letters, vol. 12, no. 2, pp , Feb