PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 1 of 20 An Example of Kalman Filter (another name of LDM) Observation A Kalman Filter models the position evolution In control system engineering, Kalman Filter succeeds to model a system with noisy observations Filtering : Position at present time (remove noise effect) Predicting : Position at a future time Smoothing : Position at a time in the past

Institute for Signal and Information Processing (ISIP) Page 2 of 20 Outline Why Linear Dynamic Model (LDM)? Linear Dynamic Model Pilot experiment: LDM phone classification on Aurora 4 Hybrid HMM/LDM decoder architecture for LVCSR Future work

Institute for Signal and Information Processing (ISIP) Page 3 of 20 HMM & Speech Recognition System Hidden Markov Models

Institute for Signal and Information Processing (ISIP) Page 4 of 20 Is HMM a perfect model for ASR? Progress on improving the accuracy of HMM-based system has slowed in the past decade Theory drawbacks of HMM –False assumption that frames are independent and stationary –Spatial correlation is ignored (diagonal covariance matrix) –Limited discrete state space Accuracy Time Clean Noisy

Institute for Signal and Information Processing (ISIP) Page 5 of 20 Motivation of Linear Dynamic Model (LDM) Research Motivation –A model which reflects the characteristics of speech signals will ultimately lead to great ASR performance improvement –LDM incorporates frame correlation information of speech signals, which is potential to increase recognition accuracy –“Filter” characteristic of LDM has potential to improve noise robustness of speech recognition –Fast growing computation capacity (thanks to Intel) make it realistic to build a two-way HMM/LDM hybrid speech engine

Institute for Signal and Information Processing (ISIP) Page 6 of 20 State Space Model Linear Dynamic Model (LDM) is derived from State Space Model Equations of State Space Model: y: observation feature vector x: corresponding internal state vector h(): relationship function between y and x at current time f(): relationship function between current state and all previous states epsilon: noise component eta: noise component

Institute for Signal and Information Processing (ISIP) Page 7 of 20 Linear Dynamic Model Equations of Linear Dynamic Model (LDM) –Current state is only determined by previous state –H, F are linear transform matrices –Epsilon and Eta are driving components y: observation feature vector x: corresponding internal state vector H: linear transform matrix between y and x F: linear transform matrix between current state and previous state epsilon: driving component eta: driving component

Institute for Signal and Information Processing (ISIP) Page 8 of 20 Kalman filtering for state inference (E-Step of EM training) Human Being Sound System Kalman Filtering Estimation e For a speech sound,

Institute for Signal and Information Processing (ISIP) Page 9 of 20 RTS smoother for better inference Standard Kalman FilterKalman Filter with RTS smoother Rauch-Tung-Striebel (RTS) smoother –Additional backward pass to minimize inference error –During EM training, computes the expectations of state statistics

Institute for Signal and Information Processing (ISIP) Page 10 of 20 Maximum Likelihood Parameter Estimation (M-Step of EM training) Nothing but matrix multiplication! LDM Parameters aa ae ah ao aw ay b ch d dh eh er ………

Institute for Signal and Information Processing (ISIP) Page 11 of 20 LDM for Speech Classification MFCC Feature ……… aa ch eh x y HMM-Based Recognition LDM-Based Recognition MFCC Feature ……… aa ch eh x y Hypothesis x ^ x ^ x ^ x ^ x ^ x ^

Institute for Signal and Information Processing (ISIP) Page 12 of 20 Challenges of Applying LDM to ASR Segment-based model –frame-to-phoneme information is needed before classification EM training is sensitive to state initialization –Each phoneme is modeled by a LDM, EM training is to find a set of parameters for a specific LDM –No good mechanism for state initialization yet More parameters than HMM (2~3x) –Currently mono-phone model, to build a tri-phone model for LVCSR would need more training data

Institute for Signal and Information Processing (ISIP) Page 13 of 20 Pilot experiment: phone classification on Aurora 4 Aurora 4: Wall Street Journal + six kinds of noises –Airport, Babble, Car, Restaurant, Street, and Train Frame-to-phone alignment is generated by ISIP decoder (force align mode) – Adding language model will get 93% accuracy for clean data 40 phones, one vs. all classifier model clean dataset (Acc) noisy dataset (Acc) HMM46.9%36.8% LDM49.2%39.2%

Institute for Signal and Information Processing (ISIP) Page 14 of 20 Hybrid HMM/LDM decoder architecture for LVCSR Confidence Measurement Best Hypothesis

Institute for Signal and Information Processing (ISIP) Page 15 of 20 Status and future work The development of HMM/LDM hybrid decoder is still in progress –HMM/LDM hybrid decoder is Expected to be done in 2009 –ISIP HMM/SVM hybrid decoder acts as the reference for implementation Future work –Research has proved the nonlinear effects in speech signals –Investigate the probability of replacing Kalman filtering with nonlinear filtering (such as Unscented Kalman Filter, Extended Kalman Filter)

Institute for Signal and Information Processing (ISIP) Page 16 of 20 Thank you! Questions?

Institute for Signal and Information Processing (ISIP) Page 17 of 20 References Digalakis, V., “Segment-based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition,” Ph.D. Dissertation, Boston University, Boston, Massachusetts, USA, 1992. Digalakis, V., Rohlicek, J. and Ostendorf, M., “ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 4, pp. 431–442, October 1993. Frankel, J., “Linear Dynamic Models for Automatic Speech Recognition,” Ph.D. Dissertation, The Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK, 2003.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Similar presentations

Presentation on theme: "PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Similar presentations

Presentation on theme: "PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic."— Presentation transcript:

Similar presentations

About project

Feedback