Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

ECG Signal processing (2)
Component Analysis (Review)
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Pattern Recognition and Machine Learning
Support Vector Machines
Support vector machine
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.
Support Vector Machine
Pattern Recognition and Machine Learning
Chapter 5: Linear Discriminant Functions
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Speaker Adaptation for Vowel Classification
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Data mining and statistical learning - lecture 13 Separating hyperplane.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Separate multivariate observations
An Introduction to Support Vector Machines Martin Law.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Outline Separating Hyperplanes – Separable Case
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
An Introduction to Support Vector Machines (M. Law)
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
EE513 Audio Signals and Systems
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania *University of California

2 Outline Introduction Large margin mixture models –Large margin classification –Mixture models Extensions –Handling of outliers –Segmental training Experimental results Conclusion

3 Introduction(1/3) Much of the acoustic-phonetic modeling in ASR is handled by GMMs but the ML estimation of GMMs does not directly optimize the performance –Therefore, it is interesting to develop alternative learning paradigms that optimize discriminative measures of performance Large margin GMMs have many parallels to support vector machines (SVMs) but use ellipsoids to model classes instead of half-spaces Figure from wikiFigure from mathworld

4 Introduction(2/3) SVMs currently provide state-of-the-art performance in pattern recognition –The simplest setting for SVMs is binary classification which compute the linear decision boundary that maximizes the margin of correct classification For various reasons, it can be challenging to apply SVMs to large problems in multiway as opposed to binary classification –First, to apply the kernel trick, one must construct a large kernel matrix with as many rows and columns as training examples –Second, the training complexity increases with the numbers of classes

5 Introduction(3/3) As in SVMs, the proposed approach is based on the idea of margin maximization Model parameters are trained discriminatively to maximize the margin of correct classification, as measured in terms of Mahalanobis distance The required optimization is convex over the model’s parameter space of positive semidefinite matrices and can be performed efficiently

6 Large margin mixture models(1/6) The simplest large margin GMM represents each class of labeled examples by a single ellipsoid Each ellipsoid is parameterized by –a vector “centroid” –a positive semidefinite “orientation” matrix –these are analogous to the means and inverse covariance matrices of multivariate Gaussians –in addition, a nonnegative scalar offset for each class is used in the scoring procedure Thus, let representing examples in class

7 Large margin mixture models(2/6) We label an example by whichever ellipsoid has the smallest Mahalanobis distance (plus offset) to its centroid: The goal of learning is to estimate the parameters for each class of labeled examples that optimize the performance of this decision rule Mahalanobis distance (1)

8 Large margin mixture models(3/6) It’s useful to collect the ellipsoid parameters of each class in a single enlarged positive semidefinite matrix: We can then rewrite the decision rule in eq.(1) as simply: The goal of learning is simply to estimate the single matrix for each class of labeled examples (2) (3)

9 Large margin mixture models(4/6)

10 Large margin classification(5/6) In more detail, let denote a set of N labeled examples drawn from C classes, where and –In large margin GMMs, we seek matrices Φ c such that all the examples in the training set are correctly classified by a large margin, i.e., situated far from the decision boundaries that define competing classes For the n th example with class label y n, this condition can be written as: –State that for each competing class c ≠ y n, the Mahalanobis distance (plus offset) to the c th centroid exceeds the Mahalanobis distance (plus offset) to the target centroid by a margin of at least one unit (4)

11 Large margin classification(6/6) We adopt a convex loss function for training large margin GMMs –Letting [f] + = max(0,f) denote the so-called “ hinge ” function The loss function is a piecewise linear, convex function of the Φ c, which are further constrained to be positive semidefinite –Its optimization can thus be formulated as a problem in semidefinite programming that can be generically solved by interior point algorithms with polynomial time guarantees Penalizes margin violations of eq.(4) Regularizes the matrices Φ c (5)

12 Mixture models(1/2) Now, extend the previous model to represent each class by multiple ellipsoids Let Φ cm denote the matrix for the m th ellipsoid in class c Each example x n has not only a class label y n, but also a mixture component label m n –which are by fitting a GMM to the examples in each class by ML estimation, then for each example, computing the mixture component with the highest posterior probability under this GMM

13 Mixture models(2/2) Given joint labels (y n, m n ), we replace the large margin criterion in eq. (4) by: –To see this, note that Replace the loss function: –It is no longer piecewise linear, but remains convex (6) Need derivation!!

14 Extensions Two further extensions of large margin GMMs are important for problems in ASR: –Handling of outliers –Segmental training Handling of outliers –Many discriminative learning algorithms are sensitive to outliers –We adopt a simple strategy to detect outliers and reduce their malicious effect on learning

15 Handling of outliers(1/2) Outliers are detected using ML estimates of the mean and covariance matrix of each class –These estimates are used to initialize matrices Φ c ML of the form in eq. (2) For each example x n, we compute the accumulated hinge loss incurred by violations of the large margin constraints –We associate the outliers with large values of (7)

16 Handling of outliers(2/2) In particular, correcting one badly misclassified outlier decreases the cost function proposed in eq. (5) more than correcting multiple examples that lie just barely on the wrong side of a decision boundary We reweight the hinge loss terms in eq. (5) involving example x n by a multiplicative factor of –This reweighting equalizes the losses incurred by all initially misclassified examples

17 Segmental training We can relax margin constraints to apply, collectively, to multiple examples known to share the same class label Let p index the l frames in the n th phonetic segment For segmental training, we rewrite the constraints as: (8)

18 Experimental results(1/3) We applied large margin GMMs to problems in phonetic classification and recognition on the TIMIT database We mapped the 61 phonetic labels in TIMIT to 48 classes and trained ML and large margin GMMs for each class Our front end computed MFCCs with 25 ms windows at a 10 ms frame rate We retained the first 13 MFCC coefficients of each frame, along with their first and second time derivatives GMMs modeled these 39-dimensinal feature vectors after they were whitened by PCA

19 Experimental results(2/3) Phonetic classification –The speech has been correctly segmented into phonetic units, but that the phonetic class label of each segment is unknown –The best large margin GMM yields a slightly lower classification error rate than state-of-the-art results (21.7%) obtained by hidden conditional random fields Phonetic recognition –The recognizers were first-order HMMs with one context- independent state per phonetic class –The large margin GMMs lead to consistently lower error rates

20 Experimental results(3/3)

21 Conclusion We have shown how to learn GMMs for multiway classification based on similar principles as large margin classification in SVMs Classes are represented by ellipsoids whose location, shape, and size are discriminatively trained to maximize the margin of correct classification, as measured in terms of Mahalanobis distance In ongoing work –Investigate the use of context-dependent phone models –Study schemes for integrating the large margin training of GMMs with sequence models such as HMMs and/or conditional random fields

22 Mahalanobis distance In statistics, Mahalanobis distance is a distance measure introduced by P.C. Mahalanobis in 1936 –It is based on correlations between variables by which different pattern can be identified and analysed –It’s a useful way of determining similarity of an unknown sample set to a known one

23 Ellipsoid Any quadratic form can be written as the form of X T AX For example: