1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009.

1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009

2 Minimum Classification Error (MCE) The objective function of MCE training is a smoothed recognition error rate. Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD) In this work we proposed the Growth Transformation based method for MCE based model estimation

3 Automatic Speech Recognition (ASR) Speech recognition:

4 Models (feature functions) in ASR h 1 (s r, X r ) = log p(X r |s r ; Λ) (AM) h 2 (s r, X r ) = log p(s r ) (LM) h 3 (s r, X r ) = |s r | (#word) λ 1 = 1 λ 2 = s (LM scale) λ 3 = p (word ins. penalty) ASR in the log-linear framework Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.

5 MCE: Mis-classification measure Define misclassification measure: s r,1 : the top one incorrect (not equal to S r ) competing string (in the case of using correct and top one incorrect competing tokens)

6 MCE: Loss function Loss function : smoothed error count func. Classification : Classifi. error : d r (X r,Λ) > 0 1 classification error d r (X r,Λ) < 0 0 classification error

7 MCE: Objective function MCE objective function: L MCE (Λ) is the smoothed recognition error rate on the string (token) level. Model ( acoustic model ) is trained to minimize L MCE (Λ), i.e., Λ * = argmin Λ {L MCE (Λ)}

8 MCE: Optimization Traditional Stochastic GDNew Growth Transform. Gradient descent based online optimization Convergence is unstable Training process is difficult to be parallelized Extend Baum-Welch based batch-mode method Stable convergence Ready for parallelized processing

9 MCE: Optimization Minimizing L MCE (Λ) = ld() Maximizing P(Λ) = G(Λ)/H(Λ) Maximizing F(Λ;Λ) = G-P×H+D Maximizing F(Λ;Λ) = f () Maximizing U(Λ;Λ) = f()log f () GT formula U()/Λ = 0 Λ =T(Λ) If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T() is called a growth transformation of Λ for P(Λ). o Growth Transformation based MCE:

10 MCE: Optimization Re-write MCE loss function to Then, min. L MCE (Λ) max. Q(Λ), where

11 MCE: Optimization Q(Λ) is further re-formulated to a single fractional function P(Λ) where

12 MCE: Optimization Increasing P(Λ) can be achieved by maximizing i.e., as long as D is a Λ - independent constant. Substitute G() and H() into F(), (Λ is the parameter set obtained from last iteration)

13 MCE: Optimization Reformulate F ( Λ;Λ') to where F ( Λ;Λ') is ready for EM style optimization Note: Γ(Λ) is a constant, and log p(χ, q | s, Λ) is easy to decompose.

14 MCE: Optimization Increasing F ( Λ;Λ') can be achieved by maximizing So the growth transformation of Λ for CDHMM is: Use extend Baum-Welch for E step. log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute.

15 MCE: Model estimation formulas For Gaussian mixture CDHMM, where GT of mean and covariance of Gaussian m is

16 MCE: Model estimation formulas Setting of D m Theoretically, set D m so that f(χ,q,s,Λ;Λ') > 0 Empirically,

MCE: Workflow 17 Training utterances Last iteration Model Λ Recognition GT-MCE Training transcripts Competing strings New model Λ next iteration

18 Experiment: TI-DIGITS Vocabulary: 1 to 9, plus oh and zero Training set: 8623 utterances / 28329 words Test set: 8700 utterances / 28583 words 33-dimentional spectrum feature: energy +10 MFCCs, plus and features. Model: Continuous Density HMMs Total number of Gaussian components: 3284

19 Experiment: TI-DIGITS Obtain the lowest error rate on this task Reduce recognition Word Error Rate (WER) by 23% Fast and stable convergence GT-MCE vs. ML (maximum likelihood) baseline

20 Experiment: Microsoft Tele. ASR Microsoft Speech Server – ENUTEL A telephony speech recognition system Training set: 2000 hour speech / 2.7 million utterances 33-dim spectrum features: (E+MFCCs) + + Acoustic Model: Gaussian mixture HMM Total number of Gaussian components: 100K Vocabulary: 120K (delivered vendor lexicon) CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz Training Cost: 4~5 hours per iteration

21 Experiment: Microsoft Tele. ASR Namevoc.size# worddescription MSCT70K 4356enterprise call center system (the MS call center we use daily) SA20K43966major commercial applications (and include many cell phone data) QSR55K 5718name dialing system (many names are OOV, rely on LTS) ACNT20K 3219foreign accented speech recognition (designed to test system robustness) Evaluate on four corpus-independent tests Collected from sites other than training data providers Cover major commercial Tele. ASR scenarios

22 Experiment: Microsoft Tele. ASR WER MLGT-MCEWER reduction MSCT 11.59% 9.73% 16.04% SA 11.24% 10.07% 10.40% QSR 9.55% 8.58% 10.07% ACNT 32.68% 29.00% 11.25% Significant performance improvements across-the-board The first time MCE is successfully applied to a 2000 hr. speech database The Growth Transformation based MCE training is well suited for large scale modeling tasks

1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009.

Similar presentations

Presentation on theme: "1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009.

Similar presentations

Presentation on theme: "1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009."— Presentation transcript:

Similar presentations

About project

Feedback