Presentation on theme: "Discriminative Training in Speech Processing Filipp Korkmazsky LORIA."— Presentation transcript:
Discriminative Training in Speech Processing Filipp Korkmazsky LORIA
Content Bayes Decision Theory and DiscriminativeTraining Minimum Classification Error(MCE) Training Generalized Probabilistic Descent(GPD) algorithm MCE Training versus Maximum Mutual Information(MMI) Training Discriminative Training for Speech Recognition Discriminative Training for Speaker Verification
Discriminative Training for Feature Extraction Discriminative Training of Language Models Discriminative Training for Speech/Music Classification Conclusions
Main assumption of Bayes decision theory: a joint probability functions are known, where X is an observation and are class labels. Decision cost function: (1) Bayes Decision Theory and Discriminative Training
Why MAP decision is not optimal for real speech data? Probability distribution of speech data is usually uknown and a postulated HMM approximation for this distribution doesn’t provide a MAP optimal solution. Even if HMM was correct distribution for speech, the lack of training data often doesn’t allow to accurately model probability distribution of competing speech classes near their boundaries.
Class I real distribution Class II real distribution Class I postulated distribution Class II postulated distribution
Generalized Probabilistic Descent(GPD) Algorithm positive definite matrix a set of HMMs at the step t of GPD algorithm a speech sample(sentence, word, phone,frame) at the step t of GPD algorithm Example: Gaussian mean correction by GPD algorithm a mean for the HMM i, state j, Gaussian mixture k, dimension at the step t of GPD algorithm
MCE Training versus Maximum Mutual Information Training
Maximization of mutual information corresponds to minimization of special type of classification error. Unlike general procedure of MCE maximization of mutual information doesn’t provide higher correction values to the parameters at the class boundaries. Minimization of classification error provides a better class separation at the class boundaries due to a form of the sigmoid function
Discriminative Training for Speech Recognition 1. Discriminative training is based on comparison the likelihood scores estimated for single speech units(phones, words). Examples: E-set vocabulary recognition(W.Chou, 1992) Speaker independent recognition(100 speakers) ML training – 76% phone recognition accuracy MCE/GPD training – 88% phone recognition accuracy. Broadcast news phone string recognition(Korkmazsky, 2003) ML training – 61.93% phone recognition accuracy MCE/GPD training – 65.11% phone recognition accuracy
a true word string,one of the N alternative word strings Examples: Connected digit strings of uknown length recognition(Wu Chou,1993) ML training - 1.4% string error rate MCE/GPD training – 0.95% string error rate Wireless noisy data digit strings recognition(Korkmazsky, 1997) ML training – 2.6% word error rate MCE/GPD training –1.4% word error rate Generalized HMM MCE/GPD training – 1.0% word error rate 2. Discriminative training is based on comparison the likelihood scores estimated for the strings of speech units(sentences)
Discriminative Training for Speaker Verification a true talker and impostor HMMs then X represents a true talker then X represents an impostor a verification threshold
E[A]-an expectation for A Example: a speaker verification for database consisiting of 43 speakers, each having 5 training sentences(Korkmazsky,1996) ML training – 4.40% equal error rate MCE/GPD training – 2.50% equal error rate
Discriminative Training for Feature Extraction Acoustic Model Feature Extractor Language Model Discriminative Training
Examples: Discriminative filter bank design(Biem, Katagiri, 1996): Central filter bank frequencies were adjusted by MCE/GPD training. First, 128 FFT spectral coefficients were converted to 16 Mel spectrum coefficients by using some convential frequency scale. The models for 5 japanese vowels were represented by the frequency templates. Recognition accuracy in this experiment was 80.91%. After MCE/GPD adjustment of the central band frequencies accuracy increased to 82.45%. Discriminative training of the lifter coefficients(Biem, Juang,1997): Lifter coefficients weight quefrency values after cosine transform. Lifter weights were trained by adjusting neural network coefficients using MCE criterion. Error rate for 5 japanese vowels was reduced from 14.5% to 11.3%.
Discriminative Training of Language Models (Zhen Chen, Kai-Fu Lee(1999), Jeff Kuo, Hui Jiang(2002)) correct word sequence
Discriminative correction of the bigram probabilities for all word pairs : a number of times a word pair appears in the word sequence
DARPA Communicator Project(air travel reservation system) Baseline language model: 900 unigrams and 41K bigrams Word error rate Training sentences Test sentences Sentence error rate Baseline LMAfter DT 19.7% 30.9% 17.5% 19.0% 26.4% 29.0% Baseline LM perplexity =34, after DT perplexity = 35
Discriminative Training for Speech/Music Classification (Korkmazsky, 2003) Speech class: speech, speech&music in the background speech&song in the background Nonspeech class: music, song, noise(aspiration, cough, laugh)
block classification error frameclassification error a total number frames in the block a set of 6 GMMs Frame labeling accuracy for ML trained GMMs – 90.5% Frame labeling accuracy for MCE trained GMMs – 92.7%
Conclusions Maximum likelihood training often does not provide optimal speech classification because real distribution of speech data is unknown. Discriminative training usually improves speech classification over ML training. Discriminative training may provide comparable to ML training recognition performance by using a a smaller number of model parameters. Many new methods of classification(like SVM or boosting) are discriminative ones.