Presentation is loading. Please wait.

Presentation is loading. Please wait.

Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.

Similar presentations


Presentation on theme: "Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania."— Presentation transcript:

1 Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania *University of California

2 2 Outline Introduction Large margin mixture models –Large margin classification –Mixture models Extensions –Handling of outliers –Segmental training Experimental results Conclusion

3 3 Introduction(1/3) Much of the acoustic-phonetic modeling in ASR is handled by GMMs but the ML estimation of GMMs does not directly optimize the performance –Therefore, it is interesting to develop alternative learning paradigms that optimize discriminative measures of performance Large margin GMMs have many parallels to support vector machines (SVMs) but use ellipsoids to model classes instead of half-spaces Figure from wikiFigure from mathworld

4 4 Introduction(2/3) SVMs currently provide state-of-the-art performance in pattern recognition –The simplest setting for SVMs is binary classification which compute the linear decision boundary that maximizes the margin of correct classification For various reasons, it can be challenging to apply SVMs to large problems in multiway as opposed to binary classification –First, to apply the kernel trick, one must construct a large kernel matrix with as many rows and columns as training examples –Second, the training complexity increases with the numbers of classes

5 5 Introduction(3/3) As in SVMs, the proposed approach is based on the idea of margin maximization Model parameters are trained discriminatively to maximize the margin of correct classification, as measured in terms of Mahalanobis distance The required optimization is convex over the model’s parameter space of positive semidefinite matrices and can be performed efficiently

6 6 Large margin mixture models(1/6) The simplest large margin GMM represents each class of labeled examples by a single ellipsoid Each ellipsoid is parameterized by –a vector “centroid” –a positive semidefinite “orientation” matrix –these are analogous to the means and inverse covariance matrices of multivariate Gaussians –in addition, a nonnegative scalar offset for each class is used in the scoring procedure Thus, let representing examples in class

7 7 Large margin mixture models(2/6) We label an example by whichever ellipsoid has the smallest Mahalanobis distance (plus offset) to its centroid: The goal of learning is to estimate the parameters for each class of labeled examples that optimize the performance of this decision rule Mahalanobis distance (1)

8 8 Large margin mixture models(3/6) It’s useful to collect the ellipsoid parameters of each class in a single enlarged positive semidefinite matrix: We can then rewrite the decision rule in eq.(1) as simply: The goal of learning is simply to estimate the single matrix for each class of labeled examples (2) (3)

9 9 Large margin mixture models(4/6)

10 10 Large margin classification(5/6) In more detail, let denote a set of N labeled examples drawn from C classes, where and –In large margin GMMs, we seek matrices Φ c such that all the examples in the training set are correctly classified by a large margin, i.e., situated far from the decision boundaries that define competing classes For the n th example with class label y n, this condition can be written as: –State that for each competing class c ≠ y n, the Mahalanobis distance (plus offset) to the c th centroid exceeds the Mahalanobis distance (plus offset) to the target centroid by a margin of at least one unit (4)

11 11 Large margin classification(6/6) We adopt a convex loss function for training large margin GMMs –Letting [f] + = max(0,f) denote the so-called “ hinge ” function The loss function is a piecewise linear, convex function of the Φ c, which are further constrained to be positive semidefinite –Its optimization can thus be formulated as a problem in semidefinite programming that can be generically solved by interior point algorithms with polynomial time guarantees Penalizes margin violations of eq.(4) Regularizes the matrices Φ c (5)

12 12 Mixture models(1/2) Now, extend the previous model to represent each class by multiple ellipsoids Let Φ cm denote the matrix for the m th ellipsoid in class c Each example x n has not only a class label y n, but also a mixture component label m n –which are by fitting a GMM to the examples in each class by ML estimation, then for each example, computing the mixture component with the highest posterior probability under this GMM

13 13 Mixture models(2/2) Given joint labels (y n, m n ), we replace the large margin criterion in eq. (4) by: –To see this, note that Replace the loss function: –It is no longer piecewise linear, but remains convex (6) Need derivation!!

14 14 Extensions Two further extensions of large margin GMMs are important for problems in ASR: –Handling of outliers –Segmental training Handling of outliers –Many discriminative learning algorithms are sensitive to outliers –We adopt a simple strategy to detect outliers and reduce their malicious effect on learning

15 15 Handling of outliers(1/2) Outliers are detected using ML estimates of the mean and covariance matrix of each class –These estimates are used to initialize matrices Φ c ML of the form in eq. (2) For each example x n, we compute the accumulated hinge loss incurred by violations of the large margin constraints –We associate the outliers with large values of (7)

16 16 Handling of outliers(2/2) In particular, correcting one badly misclassified outlier decreases the cost function proposed in eq. (5) more than correcting multiple examples that lie just barely on the wrong side of a decision boundary We reweight the hinge loss terms in eq. (5) involving example x n by a multiplicative factor of –This reweighting equalizes the losses incurred by all initially misclassified examples

17 17 Segmental training We can relax margin constraints to apply, collectively, to multiple examples known to share the same class label Let p index the l frames in the n th phonetic segment For segmental training, we rewrite the constraints as: (8)

18 18 Experimental results(1/3) We applied large margin GMMs to problems in phonetic classification and recognition on the TIMIT database We mapped the 61 phonetic labels in TIMIT to 48 classes and trained ML and large margin GMMs for each class Our front end computed MFCCs with 25 ms windows at a 10 ms frame rate We retained the first 13 MFCC coefficients of each frame, along with their first and second time derivatives GMMs modeled these 39-dimensinal feature vectors after they were whitened by PCA

19 19 Experimental results(2/3) Phonetic classification –The speech has been correctly segmented into phonetic units, but that the phonetic class label of each segment is unknown –The best large margin GMM yields a slightly lower classification error rate than state-of-the-art results (21.7%) obtained by hidden conditional random fields Phonetic recognition –The recognizers were first-order HMMs with one context- independent state per phonetic class –The large margin GMMs lead to consistently lower error rates

20 20 Experimental results(3/3)

21 21 Conclusion We have shown how to learn GMMs for multiway classification based on similar principles as large margin classification in SVMs Classes are represented by ellipsoids whose location, shape, and size are discriminatively trained to maximize the margin of correct classification, as measured in terms of Mahalanobis distance In ongoing work –Investigate the use of context-dependent phone models –Study schemes for integrating the large margin training of GMMs with sequence models such as HMMs and/or conditional random fields

22 22 Mahalanobis distance In statistics, Mahalanobis distance is a distance measure introduced by P.C. Mahalanobis in 1936 –It is based on correlations between variables by which different pattern can be identified and analysed –It’s a useful way of determining similarity of an unknown sample set to a known one

23 23 Ellipsoid Any quadratic form can be written as the form of X T AX For example:


Download ppt "Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania."

Similar presentations


Ads by Google