ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Slides:

Advertisements

Similar presentations

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

CS479/679 Pattern Recognition Dr. George Bebis

Supervised Learning Recap

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Lecture 5: Learning models using EM

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Machine Learning CMPT 726 Simon Fraser University

Visual Recognition Tutorial

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Principles of Pattern Recognition

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Univariate Gaussian Case (Cont.)

Statistical Models for Automatic Speech Recognition Lukáš Burget.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Lecture 1.31 Criteria for optimal reception of radio signals.

Deep Feedforward Networks

Probability Theory and Parameter Estimation I

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

LECTURE 03: DECISION SURFACES

Statistical Models for Automatic Speech Recognition

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

LECTURE 05: THRESHOLD DECODING

Statistical Models for Automatic Speech Recognition

LECTURE 05: THRESHOLD DECODING

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

LECTURE 23: INFORMATION THEORY REVIEW

LECTURE 15: REESTIMATION, EM AND MIXTURES

Parametric Methods Berlin Chen, 2005 References:

LECTURE 05: THRESHOLD DECODING

Presentation transcript:

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation (MMIE) Minimum Classification Error (MCE) Resources: K.V.: Discriminative Training X.H.: Spoken Language Processing F.O.: Maximum Entropy Models J.B.: Discriminative Training K.V.: Discriminative Training X.H.: Spoken Language Processing F.O.: Maximum Entropy Models J.B.: Discriminative Training LECTURE 25: DISCRIMINATIVE TRAINING Audio: URL:

ECE 8443: Lecture 25, Slide 1 Entropy and Mutual Information The entropy of a discrete random variable is defined as: The joint entropy between two random variables, X and Y, is defined as: The conditional entropy is defined as: The mutual information between two random variables, X and Y, is defined as: There is an important direct relation between mutual information and entropy: By symmetry, it also follows that: Two other important relations:

ECE 8443: Lecture 25, Slide 2 Mutual Information in Pattern Recognition Given a sequence of observations, X = {x 1, x 2, …, x n }, which typically can be viewed as features vectors, we would like to minimize the error in prediction of the corresponding class assignments,  = {  1,  2, …,  m }. A reasonable approach would be to minimize the amount of uncertainty about the correct answer. This can be stated in information theoretic terms as minimization of the conditional entropy: Using our relations from the previous page, we can write: If our goal is to minimize H (  |X), then we can try and minimize H (  ) or maximize I (  |X). The minimization of H (  ) corresponds to finding prior probabilities that minimize entropy, which amounts to accurate prediction of class labels from prior information. (In speech recognition, this is referred to as finding a language model with minimum entropy.) Alternately, we can estimate parameters of our model that maximize mutual information -- referred to as maximum mutual information estimation (MMIE).

ECE 8443: Lecture 25, Slide 3 Posteriors and Bayes Rule The decision rule for the minimum error rate classifier selects the class,  i, with the maximum posterior probability, P(  I |x). Recalling Bayes Rule: The evidence, p(x), can be expressed as: As we have discussed, we can ignore p(x) since it is constant with respect to the maximization (choosing the most probable class). However, during training, the value of p(x) depends on the parameters of all models and varies as a function of x. A conditional maximum likelihood estimator,  * CMLE, is defined as: From the previous slide, we can invoke mutual information: However, we don’t know the joint distribution, p(X,  )!

ECE 8443: Lecture 25, Slide 4 Approximations to Mutual Information We can define the instantaneous mutual information: If we assume equal priors, then maximizing the conditional likelihood is equivalent to maximizing the instantaneous mutual information. This is referred to as MMIE. We can expand I (x,  i ) into two terms, one representing the correct class, and the other representing the incorrect classes (or competing models): The posterior probability, P(  I |x), can be rewritten as: Maximization of the posterior implies we need to reinforce the correct model and minimize the contribution from the competing models. We can further rewrite the posterior: implies we must maximize:

ECE 8443: Lecture 25, Slide 5 Additional Comments This illustrates a fundamental difference: in MLE, only the correct model need be updated during training, while in MMIE, every model is updated. The greater the prior probability of a competing class label,  k, the greater its contribution to the MMIE model, which makes sense since the chance of misrecognizing the input as this class is greater. Though it might appear MMIE and MLE are related, they will produce different solutions for finite amounts of training data. They should converge in the limit for infinite amounts of training data. The principal difference is that in MMIE, we decrement the likelihood of the the competing models, while in MLE, we only reinforce the correct model. MMIE, however, is computationally very expensive and lacks an efficient maximization algorithm. Instead, we must use gradient descent.

ECE 8443: Lecture 25, Slide 6 Gradient Descent We will define our optimization problem as a minimization: No closed-form solution exists so we must use an iterative approach: We can apply this approach to HMMs. The objective functions used in discriminative training are rational functions. The original Baum-Welch algorithm requires that the objective function be a homogenous polynomial. The Extended Baum-Welch (EBW) algorithm was developed for use on rational functions, and applied to continuous density HMM systems. The reestimation formulas have been derived for the mean and diagonal covariance matrices for a state j and mixture component m:

ECE 8443: Lecture 25, Slide 7 Estimating Probabilities Mixture coefficients and transition probabilities are a little harder to estimate using discriminative training and have not showed significant improvements in performance. Similarly, it is possible to integrate mixture splitting into the discriminative training process, but that has provided only marginal gains as well. There are various strategies for estimating the learning parameter, D, and making this specific to the mixture component and the state. We note that calculation of the Gaussian occupancies (i.e. the probability of being in state j and component m summed over all time) is required. In MLE training, these quantities are calculated for each training observation using the forward-backward algorithm (and transcriptions of the data). For MMIE training, we need counts for both the correct model (numerator) and all competing models (denominator). The latter is extremely expensive and has been the subject of much research. The preferred approach is to use a lattice to approximate the calculation of the probability of the competing models.

ECE 8443: Lecture 25, Slide 8 Minimum Error Rate Estimation Parameter estimation techniques discussed thus far seek to maximize the likelihood function (MLE or MAP) or the posterior probability (MMIE). Recall, early in the course we discussed the possibility of directly minimizing the probability of error, P(E), or Bayes’ risk, which is known as minimum error rate estimation. But that is often intractable. Minimum error rate classification is also known as minimum classification error (MCE). Similar to MMIE, the algorithm generally tests the classifier using reestimated models, reinforces correct models, and suppresses incorrect models. In MMIE, we used a posterior, P(  I |x), as the discriminant function. However, any valid discriminant function can be used. Let us denote a family of these functions as {d i (x)}. To find an alternate smooth error function for MCE, assume the discriminant function family contains s discriminant functions d i (x,  ), i = 1, 2, …, s. We can attempt to minimize the error rate directly: e i (x)  0 implies a recognition error; e i (x)  0 implies a correct recognition.

ECE 8443: Lecture 25, Slide 9 Minimum Error Rate Estimation (cont.)  is a positive learning constant that controls how we weight the competing classes. When   , the competing class score becomes ;  = 1 implies the average score for all competing classes is used. To transform this to a smooth, differentiable function, we use a sigmoid function to embed e i (x) in a smooth zero-one function: l i (x,  ) essentially represents a soft recognition error count. The recognizer’s loss function can be defined as: where  () is a Kronecker delta function. Since, we can optimize l i (x,  ) instead of L(  ). We can rarely solve this analytically and instead must use gradient descent: Gradiant descent makes MCE, like MMIE, more computationally expensive.

ECE 8443: Lecture 25, Slide 10 Minimum Error Rate Estimation (cont.) The gradient descent approach is often referred to as generalized probabilistic descent (GPD). A simpler, more pragmatic approach, to discriminative training is corrective training. A labeled training set is used to train parameters using MLE. A list of confusable or near-miss classes is generated (using N-best lists or lattices). The parameters of the models corresponding to the correct classes are reinforced; the parameters of the confusion classes are decremented. This is easy to do in Baum-Welch training because the “counts” are accumulated. Corrective training was first introduced, and then later generalized into MMIE and MCE. Neural networks offer an alternative to these functional approaches to discriminative training. Discriminative training has been applied to every aspect of the pattern recognition problem including feature extraction, frame-level statistical modeling, and adaptation. Minimization of error in a closed-loop training paradigm is very powerful.

ECE 8443: Lecture 25, Slide 11 Summary Reviewed basic concepts of entropy and mutual information from information theory. Discussed the use of these concepts in pattern recognition, particularly the goal of minimization of conditional entropy Demonstrated the relationship between minimization of conditional entropy and mutual information. Derived a method of training that maximizes mutual information (MMIE). Discussed its implementation within an HMM framework. Introduced an alternative to MMIE that directly minimizes errors (MCE). Discussed approaches to MCE using gradient descent. In practice, MMIE and MCE produce similar results. These methods carry the usual concerns about generalization: can models trained on one particular corpus or task transfer to another task? In practice, generalization has not been robust across widely varying applications or channel conditions. In controlled environments, performance improvements are statistically significant but not overwhelmingly impressive.