ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Lecture 5: Learning models using EM
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 15. Pattern Classification (I): Statistical Formulation
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Data Mining Lecture 11.
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 15: REESTIMATION, EM AND MIXTURES
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation (MMIE) Minimum Classification Error (MCE) Resources: K.V.: Discriminative Training X.H.: Spoken Language Processing F.O.: Maximum Entropy Models J.B.: Discriminative Training K.V.: Discriminative Training X.H.: Spoken Language Processing F.O.: Maximum Entropy Models J.B.: Discriminative Training LECTURE 33: DISCRIMINATIVE TRAINING

ECE 8527: Lecture 33, Slide 1 Mutual Information in Pattern Recognition Given a sequence of observations, X = {x 1, x 2, …, x n }, which typically can be viewed as features vectors, we would like to minimize the error in prediction of the corresponding class assignments, Ω = {Ω 1, Ω 2, …,Ω m }. A reasonable approach would be to minimize the amount of uncertainty about the correct answer. This can be stated in information theoretic terms as minimization of the conditional entropy: Using our relations from the previous page, we can write: If our goal is to minimize H (Ω|X), then we can try and minimize H (Ω) or maximize I (Ω|X). The minimization of H (Ω) corresponds to finding prior probabilities that minimize entropy, which amounts to accurate prediction of class labels from prior information. (In speech recognition, this is referred to as finding a language model with minimum entropy.) Alternately, we can estimate parameters of our model that maximize mutual information -- referred to as maximum mutual information estimation (MMIE).

ECE 8527: Lecture 33, Slide 2 Minimum Error Rate Estimation Parameter estimation techniques discussed thus far seek to maximize the likelihood function (MLE or MAP) or the posterior probability (MMIE). Recall, early in the course we discussed the possibility of directly minimizing the probability of error, P(E), or Bayes’ risk, which is known as minimum error rate estimation. But that is often intractable. Minimum error rate classification is also known as minimum classification error (MCE). Similar to MMIE, the algorithm generally tests the classifier using reestimated models, reinforces correct models, and suppresses incorrect models. In MMIE, we used a posterior, P(ω I |x), as the discriminant function. However, any valid discriminant function can be used. Let us denote a family of these functions as {d i (x)}. To find an alternate smooth error function for MCE, assume the discriminant function family contains s discriminant functions d i (x,θ), i = 1, 2, …, s. We can attempt to minimize the error rate directly: e i (x) ≠ 0 implies a recognition error; e i (x) = 0 implies a correct recognition.

ECE 8527: Lecture 33, Slide 3 Minimum Error Rate Estimation (cont.) η is a positive learning constant that controls how we weight the competing classes. When η  ∞, the competing class score becomes ; η = 1 implies the average score for all competing classes is used. To transform this to a smooth, differentiable function, we use a sigmoid function to embed e i (x) in a smooth zero-one function: l i (x,θ) essentially represents a soft recognition error count. The recognizer’s loss function can be defined as: where δ() is a Kronecker delta function. Since, we can optimize l i (x,θ) instead of L(x,θ). We can rarely solve this analytically and instead must use gradient descent: Gradiant descent makes MCE, like MMIE, more computationally expensive.

ECE 8527: Lecture 33, Slide 4 Minimum Error Rate Estimation (cont.) The gradient descent approach is often referred to as generalized probabilistic descent (GPD). A simpler, more pragmatic approach, to discriminative training is corrective training. A labeled training set is used to train parameters using MLE. A list of confusable or near-miss classes is generated (using N-best lists or lattices). The parameters of the models corresponding to the correct classes are reinforced; the parameters of the confusion classes are decremented. This is easy to do in Baum-Welch training because the “counts” are accumulated. Corrective training was first introduced, and then later generalized into MMIE and MCE. Neural networks offer an alternative to these functional approaches to discriminative training. Discriminative training has been applied to every aspect of the pattern recognition problem including feature extraction, frame-level statistical modeling, and adaptation. Minimization of error in a closed-loop training paradigm is very powerful.

ECE 8527: Lecture 33, Slide 5 Summary Reviewed basic concepts of entropy and mutual information from information theory. Discussed the use of these concepts in pattern recognition, particularly the goal of minimization of conditional entropy Demonstrated the relationship between minimization of conditional entropy and mutual information. Derived a method of training that maximizes mutual information (MMIE). Discussed its implementation within an HMM framework. Introduced an alternative to MMIE that directly minimizes errors (MCE). Discussed approaches to MCE using gradient descent. In practice, MMIE and MCE produce similar results. These methods carry the usual concerns about generalization: can models trained on one particular corpus or task transfer to another task? In practice, generalization has not been robust across widely varying applications or channel conditions. In controlled environments, performance improvements are statistically significant but not overwhelmingly impressive.