1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
ZMQS ZMQS
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
ABC Technology Project
VOORBLAD.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Squares and Square Root WALK. Solve each problem REVIEW:
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr)
© 2012 National Heart Foundation of Australia. Slide 2.
Sets Sets © 2005 Richard A. Medeiros next Patterns.
LO: Count up to 100 objects by grouping them and counting in 5s 10s and 2s. Mrs Criddle: Westfield Middle School.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Subtraction: Adding UP
Test B, 100 Subtraction Facts
Week 1.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Converting a Fraction to %
Clock will move after 1 minute
Intracellular Compartments and Transport
A SMALL TRUTH TO MAKE LIFE 100%
PSSA Preparation.
Essential Cell Biology
How Cells Obtain Energy from Food
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
People Counting and Human Detection in a Challenging Situation Ya-Li Hou and Grantham K. H. Pang IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART.
Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

1 Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009

2 Minimum Classification Error (MCE) The objective function of MCE training is a smoothed recognition error rate. Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD) In this work we proposed the Growth Transformation based method for MCE based model estimation

3 Automatic Speech Recognition (ASR) Speech recognition:

4 Models (feature functions) in ASR h 1 (s r, X r ) = log p(X r |s r ; Λ) (AM) h 2 (s r, X r ) = log p(s r ) (LM) h 3 (s r, X r ) = |s r | (#word) λ 1 = 1 λ 2 = s (LM scale) λ 3 = p (word ins. penalty) ASR in the log-linear framework Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.

5 MCE: Mis-classification measure Define misclassification measure: s r,1 : the top one incorrect (not equal to S r ) competing string (in the case of using correct and top one incorrect competing tokens)

6 MCE: Loss function Loss function : smoothed error count func. Classification : Classifi. error : d r (X r,Λ) > 0 1 classification error d r (X r,Λ) < 0 0 classification error

7 MCE: Objective function MCE objective function: L MCE (Λ) is the smoothed recognition error rate on the string (token) level. Model ( acoustic model ) is trained to minimize L MCE (Λ), i.e., Λ * = argmin Λ {L MCE (Λ)}

8 MCE: Optimization Traditional Stochastic GDNew Growth Transform. Gradient descent based online optimization Convergence is unstable Training process is difficult to be parallelized Extend Baum-Welch based batch-mode method Stable convergence Ready for parallelized processing

9 MCE: Optimization Minimizing L MCE (Λ) = ld() Maximizing P(Λ) = G(Λ)/H(Λ) Maximizing F(Λ;Λ) = G-P×H+D Maximizing F(Λ;Λ) = f () Maximizing U(Λ;Λ) = f()log f () GT formula U()/Λ = 0 Λ =T(Λ) If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T() is called a growth transformation of Λ for P(Λ). o Growth Transformation based MCE:

10 MCE: Optimization Re-write MCE loss function to Then, min. L MCE (Λ) max. Q(Λ), where

11 MCE: Optimization Q(Λ) is further re-formulated to a single fractional function P(Λ) where

12 MCE: Optimization Increasing P(Λ) can be achieved by maximizing i.e., as long as D is a Λ - independent constant. Substitute G() and H() into F(), (Λ is the parameter set obtained from last iteration)

13 MCE: Optimization Reformulate F ( Λ;Λ') to where F ( Λ;Λ') is ready for EM style optimization Note: Γ(Λ) is a constant, and log p(χ, q | s, Λ) is easy to decompose.

14 MCE: Optimization Increasing F ( Λ;Λ') can be achieved by maximizing So the growth transformation of Λ for CDHMM is: Use extend Baum-Welch for E step. log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute.

15 MCE: Model estimation formulas For Gaussian mixture CDHMM, where GT of mean and covariance of Gaussian m is

16 MCE: Model estimation formulas Setting of D m Theoretically, set D m so that f(χ,q,s,Λ;Λ') > 0 Empirically,

MCE: Workflow 17 Training utterances Last iteration Model Λ Recognition GT-MCE Training transcripts Competing strings New model Λ next iteration

18 Experiment: TI-DIGITS Vocabulary: 1 to 9, plus oh and zero Training set: 8623 utterances / words Test set: 8700 utterances / words 33-dimentional spectrum feature: energy +10 MFCCs, plus and features. Model: Continuous Density HMMs Total number of Gaussian components: 3284

19 Experiment: TI-DIGITS Obtain the lowest error rate on this task Reduce recognition Word Error Rate (WER) by 23% Fast and stable convergence GT-MCE vs. ML (maximum likelihood) baseline

20 Experiment: Microsoft Tele. ASR Microsoft Speech Server – ENUTEL A telephony speech recognition system Training set: 2000 hour speech / 2.7 million utterances 33-dim spectrum features: (E+MFCCs) + + Acoustic Model: Gaussian mixture HMM Total number of Gaussian components: 100K Vocabulary: 120K (delivered vendor lexicon) CPU Cluster: GHz – 3.4GHz Training Cost: 4~5 hours per iteration

21 Experiment: Microsoft Tele. ASR Namevoc.size# worddescription MSCT70K 4356enterprise call center system (the MS call center we use daily) SA20K43966major commercial applications (and include many cell phone data) QSR55K 5718name dialing system (many names are OOV, rely on LTS) ACNT20K 3219foreign accented speech recognition (designed to test system robustness) Evaluate on four corpus-independent tests Collected from sites other than training data providers Cover major commercial Tele. ASR scenarios

22 Experiment: Microsoft Tele. ASR WER MLGT-MCEWER reduction MSCT 11.59% 9.73% 16.04% SA 11.24% 10.07% 10.40% QSR 9.55% 8.58% 10.07% ACNT 32.68% 29.00% 11.25% Significant performance improvements across-the-board The first time MCE is successfully applied to a 2000 hr. speech database The Growth Transformation based MCE training is well suited for large scale modeling tasks