27 th, February 2004Presentation for the speech recognition system An overview of the SPHINX Speech Recognition System Jie Zhou, Zheng Gong Lingli Wang,

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Natural Language Processing - Speech Processing -
The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
COMP 4060 Natural Language Processing Speech Processing.
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.
Automatic Continuous Speech Recognition Database speech text Scoring.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Isolated-Word Speech Recognition Using Hidden Markov Models
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Multimodal Interaction Dr. Mike Spann
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
7-Speech Recognition Speech Recognition Concepts
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Large Vocabulary Continuous Speech Recognition. Subword Speech Units.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Dynamical Statistical Shape Priors for Level Set Based Tracking
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Presentation transcript:

27 th, February 2004Presentation for the speech recognition system An overview of the SPHINX Speech Recognition System Jie Zhou, Zheng Gong Lingli Wang, Tiantian Ding M.Sc in CMHE Spoken Language Processing Module Presentation of the speech recognition system 27 th, February 2004

Presentation for the speech recognition system Abstract   SPHINX is a system that demonstrates the feasibility of accuracy, large-vocabulary speaker- independent, continuous speech recognition.   SPHINX is based on discrete hidden Markov model (HMM’s) with LPC-derived parameters.   To provide speaker independence   To deal with co-articulation in continuous speech   Adequately represent a large-vocabulary   SPHINX attained word accuracies of 71, 94, and 96 percent on a 997-word task.

27 th, February 2004Presentation for the speech recognition system Introduction SPHINX is a system that tries to overcome three constraints: SPHINX is a system that tries to overcome three constraints: 1)Speaker dependent 2)Isolated words 3)Small vocabulary

27 th, February 2004Presentation for the speech recognition system Introduction Speaker independent Train on less appropriate training data Many more data can be acquired which may compensate for the less appropriate training material Continuous speech recognition’s difficulties Word boundaries are difficult to locate Coarticulatory effects are much stronger in continuous speech Content words are often emphasized, while function words are poorly articulated Large vocabulary 1000 words or more

27 th, February 2004Presentation for the speech recognition system Introduction   To improve speaker independence Presented additional knowledge through the use of multiple vector quantized codebooks Enhance the recognizer with carefully designed models and word duration modeling.   To deal with coarticulation in continuous speech   Function-word-dependent phone models   Generalized triphone models   SPHINX achieved speaker-independent word recognition accuracies of 71, 94 and 96 percent on the 997 word DARPA resource management task with grammars of perplexity 997, 60 and 20.

27 th, February 2004Presentation for the speech recognition system The baseline SPHINX system This system uses standard HMM techniques This system uses standard HMM techniques Speech Processing Speech Processing Sample rate 16KHz Sample rate 16KHz Frame span 20ms, each frame overlap 10ms Frame span 20ms, each frame overlap 10ms Each frame is multiplied by Hamming window Each frame is multiplied by Hamming window Computing the LPC coefficients Computing the LPC coefficients 12 LPC-derived cepstral coefficients are got 12 LPC-derived cepstral coefficients are got 12 LPC cepstrum coefficient are vector quantized into one of 256 prototype vectors 12 LPC cepstrum coefficient are vector quantized into one of 256 prototype vectors

27 th, February 2004Presentation for the speech recognition system Task and Database The resource Management task The resource Management task SHPINX was evaluated on the DARPA resource management task SHPINX was evaluated on the DARPA resource management task Three difficult grammars are used with SPHINX Three difficult grammars are used with SPHINX Null grammar (perplexity 997) Null grammar (perplexity 997) Word-pair grammar (perplexity 60) Word-pair grammar (perplexity 60) Bigram grammar (perplexity 20) Bigram grammar (perplexity 20) The TIRM Database The TIRM Database 80 “training” speakers 80 “training” speakers 40 “development test” speakers 40 “development test” speakers 40 “evaluation” speakers 40 “evaluation” speakers

27 th, February 2004Presentation for the speech recognition system Task and Database Phonetic Hidden Markov Models Phonetic Hidden Markov Models HMM’s are parametric models particularly suitable for describing speech events. HMM’s are parametric models particularly suitable for describing speech events. Each HMM represents a phone Each HMM represents a phone A total number of 46 phones in English A total number of 46 phones in English {s}: a set of states {s}: a set of states {a ij }: a set of transitions where a ij is the probability of transition from state i to state j {a ij }: a set of transitions where a ij is the probability of transition from state i to state j {b ij (k)}: the output probability matrix {b ij (k)}: the output probability matrix Phonetic HMM’s topology figure Phonetic HMM’s topology figure

27 th, February 2004Presentation for the speech recognition system Phonetic HMM’s topology

27 th, February 2004Presentation for the speech recognition system Task and Database Training Training A set of 46 phone models was used to initialize the parameters. A set of 46 phone models was used to initialize the parameters. Ran the forward-backward algorithm on the resource management training sentences. Ran the forward-backward algorithm on the resource management training sentences. Create a sentence model from word models, which were in turn concatenated from phone models. Create a sentence model from word models, which were in turn concatenated from phone models. The trained transition probability are used directly in recognition The trained transition probability are used directly in recognition The output probabilities are smoothed with a uniform distribution The output probabilities are smoothed with a uniform distribution The SPHINX recognition search is a standard time- synchronous Viterbi beam search. The SPHINX recognition search is a standard time- synchronous Viterbi beam search.

27 th, February 2004Presentation for the speech recognition system Task and Database The results with the baseline SPHINX system, using 15 new speakers with 10 sentences each for evaluation are shown in table I. The results with the baseline SPHINX system, using 15 new speakers with 10 sentences each for evaluation are shown in table I. Baseline system is inadequate for any realistic large- vocabulary applications, without incorporating knowledge and contextual modeling Baseline system is inadequate for any realistic large- vocabulary applications, without incorporating knowledge and contextual modeling

27 th, February 2004Presentation for the speech recognition system Adding knowledge to SPHINX Fixed-Width Speech Parameters Fixed-Width Speech Parameters Lexical/Phonological Improvements Lexical/Phonological Improvements Word Duration Modeling Word Duration Modeling Results Results

27 th, February 2004Presentation for the speech recognition system Fixed-Width Speech Parameter Bilinear Transform on the Cepstrum Coefficients Bilinear Transform on the Cepstrum Coefficients Differenced Cepstrum Coefficients Differenced Cepstrum Coefficients Power and Differenced Power Power and Differenced Power Integrating Fixed-Width Parameters in Multiple Codebooks Integrating Fixed-Width Parameters in Multiple Codebooks

27 th, February 2004Presentation for the speech recognition system Lexical/Phonological Improvements This set of improvements involved the modification of the set of phones and the pronunciation dictionary. These changes lead to more accurate assumptions about how words are articulated, without changing our assumption that each word has a single pronunciation. This set of improvements involved the modification of the set of phones and the pronunciation dictionary. These changes lead to more accurate assumptions about how words are articulated, without changing our assumption that each word has a single pronunciation. The first step we took was to replace the baseform pronunciation with the most likely pronunciation. The first step we took was to replace the baseform pronunciation with the most likely pronunciation. In order to improve the appropriateness of the word pronunciation dictionary, a small set of rules was created to In order to improve the appropriateness of the word pronunciation dictionary, a small set of rules was created to modify closure-stop pairs into optional compound phones when appropriate modify closure-stop pairs into optional compound phones when appropriate modify /t/’s and /d/’s into /dx/ when appropriate modify /t/’s and /d/’s into /dx/ when appropriate reduce nasal /t/’s when appropriate reduce nasal /t/’s when appropriate perform other mappings such as /t s/ to /ts/. perform other mappings such as /t s/ to /ts/. Finally, there is the issue of what HMM topology is optimal for phones in general, and what topology is optimal for each phone. Finally, there is the issue of what HMM topology is optimal for phones in general, and what topology is optimal for each phone.

27 th, February 2004Presentation for the speech recognition system Word Duration Modeling HMM’s model duration of events with transition probabilities, which lead to a geometric distribution for the duration of state residence. HMM’s model duration of events with transition probabilities, which lead to a geometric distribution for the duration of state residence. We incorporated word duration into SPHINX as a part of the Viterbi search. The duration of a word is modelled by a univariate Gaussian distribution, with the mean and variance estimated from a supervised Viterbi segmentation of the training set. We incorporated word duration into SPHINX as a part of the Viterbi search. The duration of a word is modelled by a univariate Gaussian distribution, with the mean and variance estimated from a supervised Viterbi segmentation of the training set.

27 th, February 2004Presentation for the speech recognition system Results We have presented various strategies for adding knowledge to SPHINX. We have presented various strategies for adding knowledge to SPHINX. Consistent with earlier results, we found that bilinear transformed coefficients improved the recognition rates. An even greater improvement came from the use of differential coefficients, power, and differenced power in three separate codebooks. Consistent with earlier results, we found that bilinear transformed coefficients improved the recognition rates. An even greater improvement came from the use of differential coefficients, power, and differenced power in three separate codebooks. Next, we enhanced the dictionary and the phone set- a step that led to an appreciable improvement. Next, we enhanced the dictionary and the phone set- a step that led to an appreciable improvement. Finally, the addition of durational information significantly improved SPHINX’s accuracy when no grammar was used, but was not helpful with a grammar. Finally, the addition of durational information significantly improved SPHINX’s accuracy when no grammar was used, but was not helpful with a grammar.

27 th, February 2004Presentation for the speech recognition system Context Modeling in SPHINX Previously Proposed Units of Speech Previously Proposed Units of Speech Function-Word Dependent Phones Function-Word Dependent Phones Generalized Triphones Generalized Triphones Smoothing Detailed Models Smoothing Detailed Models

27 th, February 2004Presentation for the speech recognition system Previously Proposed Units of Speech Since lack of sharing across words, word models not practical for large-vocabulary speech recognition Since lack of sharing across words, word models not practical for large-vocabulary speech recognition In order to improve trainability, some subword unit has to be used In order to improve trainability, some subword unit has to be used Word-dependent phones: a compromise btw word modeling and phone modeling Word-dependent phones: a compromise btw word modeling and phone modeling Context-dependent phones: triphone model, instead of modeling phone-in-word, they model phone-in-context Context-dependent phones: triphone model, instead of modeling phone-in-word, they model phone-in-context

27 th, February 2004Presentation for the speech recognition system Function-Word Dependent Phones Function words are particularly problematic in continuous speech recognition since they are typically unstressed Function words are particularly problematic in continuous speech recognition since they are typically unstressed The phones in function words are distorted The phones in function words are distorted Function-word-dependent phones are the same as word-dependent phones, except they are only used for function words Function-word-dependent phones are the same as word-dependent phones, except they are only used for function words

27 th, February 2004Presentation for the speech recognition system Generalized Triphones Triphones model are sparsely trained and consume substantial memory Triphones model are sparsely trained and consume substantial memory Combining similar triphones, improving the trainability and reduce the memory storage Combining similar triphones, improving the trainability and reduce the memory storage Create generalized triphones by merging contexts with an agglomerative clustering procedure Create generalized triphones by merging contexts with an agglomerative clustering procedure To determine the similarity btw two models, we use the following distance metric: To determine the similarity btw two models, we use the following distance metric:

27 th, February 2004Presentation for the speech recognition system Generalized Triphones In measuring the distance btw the two models, we only consider the o/p probabilities and ignore the transition probabilities, which are of secondary important In measuring the distance btw the two models, we only consider the o/p probabilities and ignore the transition probabilities, which are of secondary important This context generalization algorithm provides the ideal means for finding the equilibrium btw trainability and sensitivity. This context generalization algorithm provides the ideal means for finding the equilibrium btw trainability and sensitivity.

27 th, February 2004Presentation for the speech recognition system Smoothing Detailed Models Detailed models are accurate, but are less robust since many o/p probabilities will be zeros, which can be disastrous to recognition. Detailed models are accurate, but are less robust since many o/p probabilities will be zeros, which can be disastrous to recognition. Combing these detailed models with other more robust ones. Combing these detailed models with other more robust ones. An ideal solution for weighting different estimates of the same event is deleted interpolated estimation. An ideal solution for weighting different estimates of the same event is deleted interpolated estimation. Procedure to combine the detailed models and robust models Procedure to combine the detailed models and robust models Using the uniform distribution to smooth the distribution Using the uniform distribution to smooth the distribution

27 th, February 2004Presentation for the speech recognition system Entire training procedure The summary of the The summary of the entire training procedure is illustrated in figure 2

27 th, February 2004Presentation for the speech recognition system Summary of Results The six versions correspond to the following descriptions with incremental improvements: The six versions correspond to the following descriptions with incremental improvements: the baseline system, which uses only LPC cepstral parameters in one codebook; the baseline system, which uses only LPC cepstral parameters in one codebook; the addition of differenced LPC cepstral coefficients, power, and differenced power in one codebook; the addition of differenced LPC cepstral coefficients, power, and differenced power in one codebook; all four feature sets were used in three separate codebooks all four feature sets were used in three separate codebooks tuning of phone models and the pronunciation dictionary, and the use of word duration modelling; tuning of phone models and the pronunciation dictionary, and the use of word duration modelling; function word dependent phone modelling function word dependent phone modelling generalized triphone modelling generalized triphone modelling

27 th, February 2004Presentation for the speech recognition system Results of five versions of SPHINX

27 th, February 2004Presentation for the speech recognition system Conclusion Given a fixed amount of training, model specificity and model trainability pose two incompatible goals. Given a fixed amount of training, model specificity and model trainability pose two incompatible goals. More specificity usually reduces trainability, and increased trainability usually results in over generality. More specificity usually reduces trainability, and increased trainability usually results in over generality. Our work lies on finding an equilibrium btw specificity and trainability Our work lies on finding an equilibrium btw specificity and trainability

27 th, February 2004Presentation for the speech recognition system Conclusion To improve trainability, using one of the largest speaker- independent speech databases. To improve trainability, using one of the largest speaker- independent speech databases. To facilitate sharing btw models, using deleted interpolation to combine robust models with detailed ones. To facilitate sharing btw models, using deleted interpolation to combine robust models with detailed ones. Improving trainability through sharing by combining poorly trained models with well-trained models Improving trainability through sharing by combining poorly trained models with well-trained models To improve specificity, using multiple codebookds of various LPC-derived features, and integrated external knowledge sources into the system To improve specificity, using multiple codebookds of various LPC-derived features, and integrated external knowledge sources into the system Improving the phone set to include multiple representations of some phones, and introduce the use of function-word- dependent phone modeling and generalized triphone modeling Improving the phone set to include multiple representations of some phones, and introduce the use of function-word- dependent phone modeling and generalized triphone modeling

27 th, February 2004Presentation for the speech recognition system Reference An Overview of the SPHINX Speech Recognition System, Kai-Fu LEE, member IEEE, Hsiao-Wuen, Hon, and Raj Reddy, fellow, IEEE, 1989 An Overview of the SPHINX Speech Recognition System, Kai-Fu LEE, member IEEE, Hsiao-Wuen, Hon, and Raj Reddy, fellow, IEEE, 1989 The SPHINX Speech Recognition system, Kai-Fu Lee, Hsiao-Wuen Hon, Mei-Yuh Hwang, Sanjoy Mahajan, Raj Reddy, 1989 The SPHINX Speech Recognition system, Kai-Fu Lee, Hsiao-Wuen Hon, Mei-Yuh Hwang, Sanjoy Mahajan, Raj Reddy, 1989

27 th, February 2004Presentation for the speech recognition system Thank you very much!