Presentation is loading. Please wait.

Presentation is loading. Please wait.

Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.

Similar presentations


Presentation on theme: "Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng."— Presentation transcript:

1 Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 0452 Mississippi State University Mississippi State, Mississippi 39762 Tel: 662-325-8335 Fax: 662-325-2298 URL: isip.msstate.edu/publications/books/msstate_theses/2003/network_training/ Email: alphonso@isip.msstate.edualphonso@isip.msstate.edu

2 INTRODUCTION ORGANIZATION Motivation: Why do we need a new training paradigm? Network Training: The differences between the network training and traditional training. Experiments: Verification of the approach using industry standard databases (e.g., TIDigits, Alphadigits and Resource Management). Motivation Network Training Experiments Conclusions

3 INTRODUCTION MOTIVATION A traditional trainer uses an EM-based framework to estimate the parameters of a speech recognition system. A traditional trainer re-estimates the acoustic models in several complicated stages which are prone to error. A network trainer reduces the complexity of the training process by using flexible transcriptions. A network trainer achieves comparable performance and retains the robustness of the existing EM-based framework.

4 NETWORK TRAINER TRAINING RECIPE The flat start stage seeds the mean and variance of the speech and non-speech models. The context-independent stage inserts and optional silence model between words. The state-tying stage clusters the model parameters via linguistic rules to compensate for sparse training data. The context-dependent stage is similar to the context- independent stage (words are modeled using context). Flat Start CI Training State Tying CD Training Context-Independent Context-Dependent

5 NETWORK TRAINER FLEXIBLE TRANSCRIPTIONS sil hh v v Traditional Trainer: ae sil SILENCE HAVE SILENCE Network Trainer: The network trainer uses word level transcriptions which does not impose restrictions on the word pronunciation. The traditional trainer uses phone level transcriptions which uses the canonical pronunciation of the word. Using orthographic transcriptions removes the need for directly dealing with phonetic contexts during training.

6 NETWORK TRAINER FLEXIBLE TRANSCRIPTIONS The network trainer uses a silence word which precludes the need for inserting it into the phonetic pronunciation. The traditional trainer deals with silence between words by explicitly specifying it in the phonetic pronunciation. Network Trainer: Traditional Trainer:

7 NETWORK TRAINER DUAL SILENCE MODELLING Multi-Path: Single-Path: The multi-path silence model is used between words. The single-path silence model is used at utterance ends.

8 NETWORK TRAINER The network trainer uses a fixed silence at utterance bounds and an optional silence between words. We use a fixed silence at utterance bounds to avoid an underestimated silence model. DUAL SILENCE MODELLING

9 EXPERIMENTS TIDIGITS: WER COMPARISON StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 7.7%0.1%2.5%5.0% Network Trainer 7.6%0.1%2.4%5.0% The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer.

10 EXPERIMENTS AD: WER COMPARISON The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer. StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 38.0%0.8%3.0%34.2% Network Trainer 35.3%0.8%2.2%34.2%

11 EXPERIMENTS RM: WER COMPARISON The network trainer achieves comparable performance to the traditional trainer. It is important to note that the 1.8% degradation in performance is not significant (MAPSSWE test). StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 25.7%1.9%6.7%17.1% Network Trainer 27.5%2.6%7.1%17.9%

12 Explored the effectiveness of a novel training recipe in the reestimation process of for speech processing. Analyzed performance on three databases. For TIDigits, at 7.6% WER, the performance of the network trainer was better by about 0.1%. For OGI Alphadigits, at 35.3% WER, the performance of the network trainer was better by about 2.7%. For Resource Management, at 27.5% WER, the performance degraded by about 1.8% (not significant). CONCLUSIONS SUMMARY


Download ppt "Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng."

Similar presentations


Ads by Google