Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Slides:

Advertisements

Similar presentations

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Supervised Learning Recap

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Conditional Random Fields

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Graphical models for part of speech tagging

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Discriminative Training Approaches for Continuous Speech Recognition Berlin Chen, Jen-Wei Kuo, Shih-Hung Liu Speech Lab Graduate Institute of Computer.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS

CS Statistical Machine learning Lecture 24

I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Linear Models for Classification

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Ch 5b: Discriminative Training (temporal model) Ilkka Aho.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Machine Learning 5. Parametric Methods.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Generally Discriminant Analysis

LECTURE 23: INFORMATION THEORY REVIEW

LECTURE 15: REESTIMATION, EM AND MIXTURES

Parametric Methods Berlin Chen, 2005 References:

Discriminative Training

Presentation transcript:

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007

Outline Introduction Hidden Markov Models Discriminative Training Criteria –Maximum Mutual Information –Minimum Classification Error –Minimum Bayes’ Risk –Techniques to improve generalization Large Margin HMMs Maximum Entropy Markov Models Conditional Random Field Dynamic Kernels Conditional Augmented Models Conclusions

Automatic Speech Recognition The task of the speech recognition is to determine the identity of an given observation sequence by assigning the recognized word sequence to it The decision is to find the identity with maximum a posterior (MAP) probability –The so-called Bayes decision (or minimum-error-rate) rule A certain parametric representation of these distributions is needed HMMs are widely adopted for acoustic modeling Acoustic Model Language Model is assumed to be given have to be estimated MultinomialGaussian

Acoustic Modeling (1/2) In the development of an ASR system, acoustic modeling is always an indispensable and crucial ingredient The purpose of acoustic modeling is to provide a method to calculate the likelihood of a speech utterance occurring given a word sequence, In principle, the word sequence can be decomposed into a sequence of phone-like units (acoustic models) –Each of which is normally represented by a HMM, and can be estimated from a corpus of training utterances –Traditionally, the maximum likelihood (ML) training can be employed for this estimation

Acoustic Modeling (2/2) Besides the ML training, the acoustic model can be alternative trained with discriminative training criteria –MCE training 、 MMI training 、 MPE training…etc –In MCE training, an approximation to the error rate on the training data is optimized –The MMI and MPE algorithms were developed in an attempt to correctly discriminate the recognition hypotheses for the best recognition results However.. –The underlying acoustic model is still generative, with the associated constraints on the state and transition probability distributions –Classification is based on Bayes’ decision rule

Introduction Initially these discriminative criteria were applied to small vocabulary speech recognition tasks A number of techniques were then developed to enable their use for LVCSR tasks –I-smoothing –Language model weakening –The use of lattices to compactly represent the denominator score But the performance on LVCSR tasks is still not satisfactory for many speech-enabled applications –This has led to interest in discriminative (or direct) models for speech recognition where the posterior of the word-sequence given the observation,,is directly modeled

Hidden Markov Models HMMs are the standard acoustic model used in speech recognition The likelihood function is The standard training of HMM is based on Maximum Likelihood training –This optimization is normally performed using Expectation Maximization

Discriminative Training Criteria The discriminative training criteria are more closely linked to minimizing the error rate, rather than maximizing the likelihood of generating the training data Three main forms of discriminative training have been examined –Maximum Mutual Information (MMI) –Minimum Classification Error (MCE) –Minimum Bayes’ Risk (MBR) Minimum Phone Error (MPE)

Discriminative Training Criteria Maximum Mutual Information: –To maximizing the mutual information between the observed sequences and models Minimum Classification Error: –Based on a smooth function of the difference between the log- likelihood of the correct sequence and all other competing word sequences

Discriminative Training Criteria Minimum Bayes’ Risk: –Rather than trying to model the correct distribution, the expected loss during inference is minimized –A number of loss function: 1/0 function –equivalent to a sentence-level loss function Word –the loss function directly related to minimizing the expected Word Error Rate (WER) Phone

Large Margin HMMs The simplest form of large margin training criterion can be expressed as maximizing [Li et al. 2005] –This aims to maximize the minimum distance between the log- posterior of the correct label and all the incorrect labels Some properties related to both the MMI and MCE criterion –A log-posterior cost function is used as in the MMI criterion –The denominator term used with this approach does not include an element from the correct label in a similar fashion to the MCE criterion

Large Margin HMMs A couple of variants of large margin training –Soft margin training [Jinyu Li et al. 2006] –Large margin GMM [F. Sha and L.K. Saul 2007] The size of the margin is specified in terms of a loss function between the two sets of sequences where

Direct Models Direct modeling attempts to model the posterior probability directly There are many potential advantages as well as challenges for direct modeling –The direct model can potentially make decoding simpler –The direct model allows for the potential combination of multiple sources of data in a unified fashion Asynchronous and overlapping features can be incorporated formally It will be possible to take advantage of supra-segmental features like prosodic features, acoustic phonetic features, speaker style, rate of speech, channel differences –However, joint estimation would require a large amount of parallel speech and text data (a challenge for data collection)

Direct Models The relationship between observations and states is reversed –Separate transition and observation probabilities are replaced with one function –Directly modeling makes direct computation of possible The model can also be conditioned flexibly on a variety of contextual features –Any computable property of the observation sequence can be used as a feature –The number of features at each time frame need not be the same Assumption:

Maximum Entropy Markov Models Recently, McCallum et al. (ICML 2000) modeled sequential processes using a direct model similar to the HMM in graphical structure and used exponential models for transition- observation probabilities –Called Maximum Entropy Markov Model (MEMM) Maximum Entropy modeling is used to model the conditional distributions –ME modeling is based on the principle of avoiding unnecessary assumptions –The principle states that the modeled probability distribution should be consistent with the given collection of facts about itself and otherwise be as uniform as possible

Maximum Entropy Markov Models The mathematical interpretation of this principle results in a constrained optimization problem –Maximize the entropy of a conditional distribution, subject to given constraints –Constraints represent the known facts about the model from statistics of the training data Definition 1: Definition 2:

Maximum Entropy Markov Models These definitions allow us to introduce the constraints of the model The expected value of with respect to the model is Using Lagrange multipliers for constrained optimization, the desired probability distribution is given by the maximum of the function

Maximum Entropy Markov Models Finally, the solution of objective function is given by the exponential model

Reference [SAP06][Jeff Kuo and Yuqing Gao] “Maximum Entropy Direct Models for Speech Recognition”