Network Training for Continuous Speech Recognition

Network Training for Continuous Speech Recognition
• Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University • Contact Information: Box 0452 Mississippi State, Mississippi 39762 Tel: Fax: Good Morning – I would like to welcome everyone to my Masters defense presentation. URL: isip.msstate.edu/publications/books/msstate_theses/2003/network_training/

Motivation: Why do we need a new training paradigm?
INTRODUCTION ORGANIZATION Motivation: Why do we need a new training paradigm? Theoretical: Review the EM-based supervised training framework. Network Training: The differences between the network training and traditional training. Experiments: Verification of the approach using industry standard databases (e.g., TIDigits, Alphadigits and Resource Management). Motivation Network Training Experiments This presentation is broken down into four major sections. Conclusions

INTRODUCTION MOTIVATION
A traditional trainer uses an EM-based framework to estimate the parameters of a speech recognition system. EM-based parameter estimation is performed in several complicated stages which are prone to human error. A network trainer reduces the complexity of the training process by employing a soft decision criterion. A network trainer achieves comparable performance and retains the robustness of the EM-based framework. The traditional training framework has been proven to be a very successful and robust means of re-estimating the parameters of a speech recognition system. The question then arises as to why we need a new training paradigm? The biggest problem with the traditional training framework has always been the complexity of the training process and the degree of supervision that is needed to yield robust models.

NETWORK TRAINER TRAINING RECIPE
Flat Start CI Training State Tying CD Training Context-Independent Context-Dependent The flat start stage segments the acoustic signal and seed the speech and non-speech models. The context-independent stage inserts and optional silence model between words. The state-tying stage clusters the model parameters via linguistic rules to compensate for sparse training data. The context-dependent stage is similar to the context-independent stage (words are modeled using context). The flat-start segments the acoustic signal and learns the acoustic representation of the words and silence components.

FLEXIBLE TRANSCRIPTIONS
NETWORK TRAINER FLEXIBLE TRANSCRIPTIONS Network Trainer: SILENCE HAVE SILENCE sil hh v Traditional Trainer: ae The network trainer uses word level transcriptions which does not impose restrictions on the word pronunciation. The traditional trainer uses phone level transcriptions which uses the canonical pronunciation of the word. Using orthographic transcriptions removes the need for directly dealing with phonetic contexts during training. Training a speech recognizer is a supervised learning process which mean we require labels (transcriptions) and observations (features).

FLEXIBLE TRANSCRIPTIONS
NETWORK TRAINER FLEXIBLE TRANSCRIPTIONS Network Trainer: Traditional Trainer: The network trainer uses a silence word which precludes the need for inserting it into the phonetic pronunciation. The traditional trainer deals with silence between words by explicitly specifying it in the phonetic pronunciation. Using a global model to learn the silence between words requires a multi-path model which accounts for both a long and a short silence duration between words.

DUAL SILENCE MODELLING
NETWORK TRAINER DUAL SILENCE MODELLING Multi-Path: Single-Path: The multi-path silence model is used between words. The single-path silence model is used at utterance ends.

NETWORK TRAINER DUAL SILENCE MODELLING Using an optional silence model at utterance boundaries caused a bad segmentation of the acoustic signal which resulted in poor performance. We tried seeding the silence model using an example observation but that resulted in poor recognition performance after Flat-Start. The network trainer uses a fixed silence at utterance bounds and an optional silence between words. We use a fixed silence at utterance bounds to avoid an underestimated silence model.

NETWORK TRAINER DUAL SILENCE MODELLING Using an optional silence model at utterance boundaries worked on a small data set however the same results do not scale up to large data sets. Network training uses a single-path silence at utterance bounds and a multi-path silence between words. We use a single-path silence at utterance bounds to avoid uncertainty in modeling silence.

TIDIGITS: WER COMPARISON
EXPERIMENTS TIDIGITS: WER COMPARISON Stage WER Insertion Rate Deletion Rate Substitution Rate Traditional Trainer 7.7% 0.1% 2.5% 5.0% Network Trainer 7.6% 2.4% The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer. The substitution rate also indicates comparable performance (model confusion).

EXPERIMENTS AD: WER COMPARISON
Stage WER Insertion Rate Deletion Rate Substitution Rate Traditional Trainer 38.0% 0.8% 3.0% 34.2% Network Trainer 35.3% 2.2% The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer. The substitution rate also indicates comparable performance (model confusion).

EXPERIMENTS RM: WER COMPARISON
Stage WER Insertion Rate Deletion Rate Substitution Rate Traditional Trainer 25.7% 1.9% 6.7% 17.1% Network Trainer 27.5% 2.6% 7.1% 17.9% The network trainer achieves comparable performance to the traditional trainer. It is important to note that the 1.8% degradation in performance is not significant (MAPSSWE test). The substitution rate also indicates comparable performance (model confusion).

CONCLUSIONS SUMMARY Explored the effectiveness of a novel training recipe in the reestimation process of for speech processing. Analyzed performance on three databases. For TIDigits, at 7.6% WER, the performance of the network trainer was better by about 0.1%. For OGI Alphadigits, at 35.3% WER, the performance of the network trainer was better by about 2.7%. For Resource Management, at 27.5% WER, the performance degraded by about 1.8% (not significant).

Network Training for Continuous Speech Recognition

Similar presentations

Presentation on theme: "Network Training for Continuous Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Network Training for Continuous Speech Recognition

Similar presentations

Presentation on theme: "Network Training for Continuous Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback