ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Supervised Learning Recap
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Visual Recognition Tutorial
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Crash Course on Machine Learning
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Discriminative Training Approaches for Continuous Speech Recognition Berlin Chen, Jen-Wei Kuo, Shih-Hung Liu Speech Lab Graduate Institute of Computer.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Lecture 1.31 Criteria for optimal reception of radio signals.
LECTURE 10: DISCRIMINANT ANALYSIS
Statistical Models for Automatic Speech Recognition
LECTURE 15: HMMS – EVALUATION AND DECODING
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 14: HMMS – EVALUATION AND DECODING
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP: Discriminative MAP LW: Discriminative MPE LW: Discriminative Linear Transform KY: Unsupervised Training AG: Conditional Maximum Likelihood DP: Discriminative MAP LW: Discriminative MPE LW: Discriminative Linear Transform KY: Unsupervised Training URL:.../publications/courses/ece_8423/lectures/current/lecture_25.ppt.../publications/courses/ece_8423/lectures/current/lecture_25.ppt MP3:.../publications/courses/ece_8423/lectures/current/lecture_25.mp3.../publications/courses/ece_8423/lectures/current/lecture_25.mp3 LECTURE 25: PRACTICAL ISSUES IN DISCRIMINATIVE TRAINING

ECE 8423: Lecture 25, Slide 1 Review Discriminative models directly estimate the class posterior probability. The goal is to directly minimize error rate. There are three general classes of discriminative algorithms: maximum mutual information (MMI), minimum classification error (MCE) and minimum “phone” or “word” error (MPE/MWE). We will focus here on MMIE. MMI classifier design is based on maximizing the mutual information between the observed data and their corresponding labels. The reestimation formulas have been derived for the mean and diagonal covariance matrices for a state j and mixture component m: Discriminative training is a form of adaptation in that the new mean is a weighted combination of the old mean,, and a difference of the MLE weighted sample mean and the mean of the incorrect choices,. D is an adaptation control that is set to ensure the variance remains positive, and to control the proportion of the prior and the estimate from the adaptation data. It is possible to set D specific to each mixture component.

ECE 8423: Lecture 25, Slide 2 Estimation of Occupancies for Competing Hypotheses is the mean of the observation vectors weighted by the occupancies; is a similar quantity for the incorrect, or alternate hypotheses. The essence of discriminative training is to increase the probability of the correct choice (MLE) and decrease the probability of the incorrect choices. We note that calculation of the Gaussian occupancies (i.e. the probability of being in state j and component m summed over all time) is required. In MLE training, these quantities are calculated for each training observation using the forward-backward algorithm (and transcriptions of the data). For MMIE training, we need counts, or occupancies, for both the correct model (numerator) and all competing models (denominator). The latter is extremely expensive and has been the subject of much research. For problems not involving hidden or Markov models, such as the direct estimation of the parameters of a Gaussian mixture distribution, the correct state can be taken as the individual state that is most likely (e.g., Viterbi decoding). The incorrect states are simply all other states. Further, for small scale problems, the computation of the occupancies for the incorrect states can be done exhaustively. However, for large-scale problems, the number of incorrect choices is an extremely large, almost unlimited space, and calculation of these occupancies using a closed-form solution is prohibitive.

ECE 8423: Lecture 25, Slide 3 Using N-Best or Lattices for Calculating Occupancies Hence, we need a method of generating all possible state sequences, or explanations of the data. There are two related methods of producing such estimates:  N-best lists: produced using a stack decoding-type approach in which the N most likely explanations for the observation sequences are produced. For MLE estimation, N can be small, on the order of 100. For discriminative training, N should be very large (e.g., 1000 or more).  Lattices: a representation of the hypothesis space in which the start and stop time of individual symbols are arranged in a connected graph. Candidates are output if they exceed some preset likelihood threshold so that the size (referred to as the depth) of the lattice can be controlled.  The occupancies for are computed using the normal forward-backward algorithm over the lattice. However, only the arcs corresponding to the correct symbols composing the hypothesis (e.g., phones and words) are considered. This requires knowledge of the correct transcription of the data, which makes this a supervised learning method.  The occupancies for are computed by considering all paths in the lattice.

ECE 8423: Lecture 25, Slide 4 Combining MMI and MAP for Adaptation Why introduce MAP into MMIE? The derivation of MMI-MAP begins by adding a prior to our log posterior: This leads to an auxiliary function of the form: The form of the prior used in this version of MAP is a product of Gaussians for each parameter: The smoothing function,, is introduced to facilitate the optimization process, but does not change the location of the optimum parameter setting. It can also be interpreted as a log prior:

ECE 8423: Lecture 25, Slide 5 MMI-MAP Adaptation Equations The process of inserting this smoothing function and modeling it as a log prior is referred to as I-smoothing. represents the likelihood of the adaptation data given the ML or “num” model. Under these assumptions, the components of our MMI reestimation equations can be modified as follows: These are applied to our MMI equations: Hence, our MMI-MAP approach integrates prior knowledge of the parameters, ML estimates from training data, and MMI-MAP estimates from the adaptation data. Typically, while.

ECE 8423: Lecture 25, Slide 6 MMI-MAP Experiments

ECE 8423: Lecture 25, Slide 7 Summary Reviewed the basic MMI approach. Discussed the challenges in generating estimates of the occupancies for the competing hypotheses model. Introduced the concept of MMI-MAP; combining MAP adaptation in the MMI framework. Presented some experimental results comparing these methods. Next: can we relax the supervised learning constraint?  One obvious approach would be to “recognize” the data first and then use the recognition output as the supervision transcription.