Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.

Hybrid Systems for Continuous Speech Recognition Issac Alphonso alphonso@isip.msstate.edu Institute for Signal and Information Processing Mississippi State University

Abstract Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to over-fitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. SVMs have been shown to provide significant improvements in performance on small pattern recognition tasks compared to a number of conventional approaches. In this presentation, I will describe some of the work that I have done in implementing a kernel based speech recognition system (this is based on work done by Aravind Ganipathiraju). I will then describe our work in using kernel based machines as acoustic models in large vocabulary speech recognition systems. Finally, I will show that SVM’s perform better than Gaussian mixture- based HMMs in open-loop recognition.

Bio Issac Alphonso is a M.S. graduate from the Department of Electrical and Computer Engineering at Mississippi State University (MSU) under the supervision of Dr. Joe Picone. He has been a member of the Institute for Signal and Information Processing (ISIP) at MSU since 1997. Mr. Alphonso's work as a graduate student has revolved around exploring new acoustic modeling techniques for continuous speech recognition systems. His most recent work has been in the implementation of a hybrid hierarchical decoder that employs kernel based techniques like Support Vector machines, which replaces the underlying Gaussian distribution in hidden Markov models. His thesis work looks at a new network training framework that reduces the complexity of the training process, while retaining the robustness of the expectation- maximization based supervised training framework.

Outline What we do and how we fit in the big picture What we do and how we fit in the big picture The acoustic modeling problem for speech The acoustic modeling problem for speech Structural risk minimization Structural risk minimization Support vector classifiers Support vector classifiers Coupling vector machines to ASR systems Coupling vector machines to ASR systems Proof of concept and experiments Proof of concept and experiments

Technology Focus: speech recognition Focus: speech recognition First pubic-domain LVCSR system First pubic-domain LVCSR system Goal: Accelerate research Goal: Accelerate research Extensibility, Modular Extensibility, Modular (C++, Java) Easy to Use (Docs, Tutorials, Toolkits) Easy to Use (Docs, Tutorials, Toolkits) Benefit: Technology Benefit: Technology Standard benchmarks Standard benchmarks

Research: Matlab Octave Python Research: Rapid Prototyping “Fair” Evaluations Ease of Use Lightweight Programming Efficiency: Memory Hyper-real time training Parallel processing Data intensive ASR: HTK SPHINX CSLU ISIP: IFC’s Java Apps Toolkits Approach

ASR Problem Front-end maintains information important for modeling in a reduced parameter set Language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams) Search engine uses knowledge sources and models to chooses amongst competing hypotheses

Acoustic Confusability Requires reasoning under uncertainty! Regions of overlap represent classification error Reduce overlap by introducing acoustic and linguistic context Comparison of “aa” in “lOck” and “iy” in “bEAt” for SWB

Probabilistic Formulation

Acoustic Modeling - HMMs HMMs model temporal variation in the transition probabilities of the state machine HMMs model temporal variation in the transition probabilities of the state machine GMM emission densities are used to account for variations in speaker, accent, and pronunciation GMM emission densities are used to account for variations in speaker, accent, and pronunciation Sharing model parameters is a common strategy to reduce complexity Sharing model parameters is a common strategy to reduce complexity s0s0 s1s1 s2s2 s3s3 s4s4 THREE TWO FIVE EIGHT

Hierarchical Search Each node in the hierarchy can dynamically expand to explore sub-networks at the next level. Each node in the hierarchy can dynamically expand to explore sub-networks at the next level. HMM’s are employed at the lowest level of the search hierarchy. HMM’s are employed at the lowest level of the search hierarchy. Word networks can generalize to unseen pronunciation variants in the data. Word networks can generalize to unseen pronunciation variants in the data.

Statistical Models Each state in the HMM is associated with a statistical model (except the non- emitting state and stop). Each state in the HMM is associated with a statistical model (except the non- emitting state and stop). The statistical model can implement any pdf, which follows a defined interface contract. The statistical model can implement any pdf, which follows a defined interface contract. The statistical model can transparently take the form of a GMM or SVM. The statistical model can transparently take the form of a GMM or SVM.

Maximum Likelihood Training Data-driven modeling supervised only from a word-level transcription Approach: maximum likelihood estimation The EM algorithm is used to improve our estimates: Guaranteed convergence to local maximum No guard against overfitting! Computationally efficient training algorithms (Forward- Backward) have been crucial Decision trees are used to optimize sharing parameters, minimize system complexity and integrate additional linguistic knowledge

Drawbacks of Current Approach ML Convergence does not translate to optimal classification Error from incorrect modeling assumptions Finding the optimal decision boundary requires only one parameter!

Drawbacks of Current Approach Data not separable by a hyperplane – nonlinear classifier is needed Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization

Structural Risk Minimization The VC dimension is a measure of the complexity of the learning machine The VC dimension is a measure of the complexity of the learning machine Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik) Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik) Expected Risk: Not possible to estimate P(x,y) Empirical Risk: Related by the VC dimension, h: Approach: choose the machine that gives the least upper bound on the actual risk VC confidence empirical risk bound on the expected risk VC dimension Expected risk optimum

Support Vector Machines Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally The data points that define the boundary are called support vectors Optimization: Separable Data Hyperplane: Constraints: Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors Final classifier: origin class 1 class 2 w H1 H2 C1 CO C2 optimal classifier

SVMs as Nonlinear Classifiers Data for practical applications typically not separable using a hyperplane in the original input feature space Transform data to higher dimension where hyperplane classifier is sufficient to model decision surface Kernels used for this transformation Final classifier:

Experimental Progression Proof of concept on speech classification using the Deterding vowel corpus Proof of concept on speech classification using the Deterding vowel corpus Coupling the SVM classifier to ASR system Coupling the SVM classifier to ASR system Results on the OGI Alphadigits corpus Results on the OGI Alphadigits corpus

Vowel Classification Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 test Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 test Approach % Error # Parameters SVM: Polynomial Kernels 49% K-Nearest Neighbor 44% Gaussian Node Network 44% SVM: RBF Kernels 35% 83 SVs Separable Mixture Models 30%

Coupling to ASR Data size: Data size: 30 million frames of data in training set 30 million frames of data in training set Solution: Segmental phone models Solution: Segmental phone models Source for Segmental Data: Source for Segmental Data: Solution: Use HMM system in bootstrap procedure Solution: Use HMM system in bootstrap procedure Could also build a segment- based decoder Could also build a segment- based decoder Probabilistic decoder coupling: Probabilistic decoder coupling: SVMs: Sigmoid-fit posterior SVMs: Sigmoid-fit posterior hhawaaryuw region 1 0.3*k frames region 3 0.3*k frames region 2 0.4*k frames mean region 1mean region 2mean region 3 k frames

Coupling to ASR System SEGMENTAL CONVERTER SEGMENTAL CONVERTER HMM RECOGNITION HMM RECOGNITION HYBRID DECODER HYBRID DECODER Features (Mel-Cepstra) Segment Information N-best List Segmental Features Hypothesis

N-Best Rescoring A word-internal N-gram decoder is used to generate the N-best word-graphs. A word-internal N-gram decoder is used to generate the N-best word-graphs. The word-graphs come with the HMM and LM score, which is used in the rescoring process. The word-graphs come with the HMM and LM score, which is used in the rescoring process. The SVM score which is computed during rescoring is used as an additional knowledge source. The SVM score which is computed during rescoring is used as an additional knowledge source.

Alphadigit Recognition OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A19B4E”) OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A19B4E”) 3329 utterances using 10-best lists generated by the HMM decoder 3329 utterances using 10-best lists generated by the HMM decoder SVM’s require a sigmoid posterior estimate to produce likelihoods – sigmoid parameters estimated from large held-out set SVM’s require a sigmoid posterior estimate to produce likelihoods – sigmoid parameters estimated from large held-out set

SVM Alphadigit Recognition TranscriptionSegmentationSVMHMM N-bestHypothesis11.0%11.9% N-best+RefReference3.3%6.3% HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models SVM system has monophone models with segmental features System combination experiment yields another 1% reduction in error

Summary We are the first speech group to apply kernel machines to the acoustic modeling problem We are the first speech group to apply kernel machines to the acoustic modeling problem Performance exceeds that of HMM/GMM system, with a bit of HMM interaction Performance exceeds that of HMM/GMM system, with a bit of HMM interaction Algorithms for increased data sizes are key Algorithms for increased data sizes are key

Acknowledgments Collaborators: Naveen Parihar and Joe Picone at Mississippi State Collaborators: Naveen Parihar and Joe Picone at Mississippi State Consultants: Aravind Ganapathiraju (Conversay) and Jonathan Hamaker (Microsoft) Consultants: Aravind Ganapathiraju (Conversay) and Jonathan Hamaker (Microsoft)

References A. Ganapathiraju, “Support Vector Machines for Speech Recognition”, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January 2002. J. Platt, “Fast Training of Support Vector Machines using Sequential Minimal Optimization,” Advances in Kernel Methods, MIT Press, 1998. V.N. Vapnik, “Statistical Learning Theory”, John Wiley, New York, NY, USA, 1998. C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” AT&T Bell Laboratories, November 1999.

Accomplishments Developed a set of Java based graphical tools used to demonstrate fundamental concepts in signal processing and speech recognition. http://www.isip.msstate.edu/projects/speech/software/demonstration s/applets/. Developed a set of Tcl-Tk based graphical tools used to transcribe, segment and analyze speech recognition databases. http://www.isip.msstate.edu/projects/speech/software/legacy/). Developed a generalized network based speech recognition trainer, which is part of my masters thesis work. Developed a hybrid HMM/SVM system used to rescore N-best word- graphs, which is based on work by Aravind Ganipathiraju. Worked as part of a team to design and implement a public-domain HMM-based speech recognition system. http://www.isip.msstate.edu/projects/speech/software/

Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.

Similar presentations

Presentation on theme: "Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.

Similar presentations

Presentation on theme: "Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State."— Presentation transcript:

Similar presentations

About project

Feedback