Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Introduction to Support Vector Machines (SVM)
Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
ECG Signal processing (2)
Support Vector Machines
Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University.
Visual Recognition Tutorial
Support Vector Machines (and Kernel Methods in general)
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Speaker Adaptation for Vowel Classification
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Visual Recognition Tutorial
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.
Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
CS Statistical Machine learning Lecture 24
I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Linear Models for Classification
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
The Chinese University of Hong Kong Learning Larger Margin Machine Locally and Globally Dept. of Computer Science and Engineering The Chinese University.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Statistical Models for Automatic Speech Recognition
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
LECTURE 23: INFORMATION THEORY REVIEW
Learning to Rank using Language Models and SVMs
SVMs for Document Ranking
Discriminative Training
Presentation transcript:

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia Institute of Technology School of Electrical and Computer Engineering, 2008 Yueng-Tien, Lo Speech Lab Computer Science and Information Engineering National Taiwan Normal University 1

outline Background (chap. 1,2) Soft margin estimation (chap.3) 1.J. Li, M. Yuan, and C. -H. Lee, "Soft margin estimation of hidden Markov model parameters," Proc. Interspeech, pp , 2006 (Best Student Paper). 2.J. Li, S. M. Siniscalchi, and C. -H. Lee, "Approximate test risk minimization through soft margin estimation," Proc. IEEE ICASSP, pp. IV 653- IV 656, J. Li, M. Yuan, and C. -H. Lee, "Approximate test risk bound minimization through soft margin estimation," IEEE Trans. on Audio, Speech, and Language Proc., vol. 15, no. 8, pp , SME for LVCSR (chap.7) 1.J. Li, Z. Yan, C. -H. Lee, and R. -H. Wang, "A study on soft margin estimation for LVCSR," Proc. IEEE ASRU, J. Li, Z. Yan, C. -H. Lee, and R. -H. Wang, "Soft margin estimation with various separation levels for LVCSR," Proc. Interspeech, 2008 The relationship between margin and hmm parameters (chap.6) Conclusion (chap.8) 2

Background of Automatic Speech Recognition Feature Extraction Acoustic Modeling Characterize the likelihood of acoustic feature with respect to (w.r.t.) the underlying word sequence. Language Modeling Provide a way to estimate the probability of a possible word sequence. 3

Empirical Risk (1/2) The purpose of classification and recognition is usually to minimize classification errors on a representative testing set by constructing a classifier f (modeled by the parameter set ) based on a set of training samples X is the observation space, Y is the label space, and N is the number of training samples It is convenient to assume that there is a density corresponding to the distribution and replace with. 4

Empirical Risk (2/2) Finally, the empirical risk is minimized instead of the intractable expected risk: Most learning methods focus on how to minimize this empirical risk. 5

Conventional Discriminative Training Maximum Mutual Information Estimation Minimum Classification Error Minimum Word/Phone Error 6

Test Risk Bound The optimal performance of the training set does not guarantee the optimal performance of the testing set. (statistical learning theory) The risk on the testing set (i.e., the test risk) is bounded as follows: is the VC dimension that characterizes the complexity of a classifier function group G, and means that at least one set of (or less) number of samples can be found such that G shatters them. 7

Margin-based Methods in Automatic Speech Recognition Large Margin Estimation Large Margin Gaussian Mixture Model and Hidden Markov Model Soft Margin Estimation 8

Large Margin Estimation (1/2) For a speech utterance, LME defines the multi-class separation margin for For all utterances in a training set D, LME defines a subset of utterances S as: is a preset positive number. S is called a support vector set and each utterance in S is called a support token. 9

Large Margin Estimation (2/2) LME estimates the HMM models based on the criterion of maximizing the minimum margin of all support tokens as: large margin HMMs can be equivalently estimated as follows: The research focus of LME is to use different optimization methods to solve LME. One potential weakness of LME is that it updates models only with accurately classified samples. However, it is well known that misclassified samples are also critical for classifier learning. 10

Soft Margin Estimation The objective of this proposed research is to design a discriminative training method that makes direct use of the successful ideas of soft margin in support vector machines to improve generalization capability and decision feedback learning in discriminative training to enhance model separation in classifier design. The proposed method is called soft margin estimation (SME). If the training set matches well with the testing set, these discriminative training methods usually achieve very good performance in testing. SVM can not be easily adopted to ASR modeling. The major reasons are that it does not work directly on temporal sequences and can not handle the hidden states. SME combines the advantages of HMMs and SVMs for ASR. 11

Approximate Test Risk Bound Minimization Through Soft Margin Estimation (1/2) The bound of the test risk is: As a monotonic increasing function of, the generalization term cannot be directly minimized because of the difficulty of computing It can be shown that is bounded by a decreasing function of the margin. Hence, can be reduced by increasing the margin. One is to minimize the empirical risk. One is to maximize the margin. 12

Approximate Test Risk Bound Minimization Through Soft Margin Estimation (2/2) For separable cases, margin is defined as the minimum distance between the decision boundary and the samples nearest to it. For non-separable cases, the soft margin can be considered as the distance between the decision boundary (solid line) and the class boundary (dotted line). There is no exact margin for the inseparable classification task since different balance coefficients will result in different margin values. 13

Loss Function Definition What’s Loss function? Not a recognition error, a shift Use a margin to secure some generalization in classifier learning. If the mismatch between the training and testing causes a shift less than this margin, a correct decision can still be made. The SME objective function 14

Separation Measure Definition (1/2) A separation (misclassification) measure to maximize the distance between the correct and competing hypotheses. A common choice is to use log likelihood ratio(LLR) If is greater than 0, the classification is correct. Otherwise, a wrong decision. For every utterance, we select the frames that have different HMM model labels in the target and competitive strings. A more precise model separation measure (the average of those frame LLRs) : 15

Separation Measure Definition (2/2) For the usage in SME, the normalized LLR may be more discriminative. The optimization function of SME becomes : : utterance selection : frame selection For example, for frame selection, can be defined as a subset with frames more critical for discriminating HMM models, instead of equally choosing distinct frames in current study. 16

Separation Measures For SME Define separations corresponding to MMIE, MCE, and MPE. 17

Solutions to SME(1/2) Jointly Optimize the Soft Margin and the HMM Parameters the indicator function is approximated with a sigmoid function. : a smoothing parameter for the sigmoid function We need to preset the coefficient, which balances the soft margin maximization and the empirical risk minimization. 18

Solutions to SME(2/2) Presetting the Soft Margin and Optimize the HMM Parameters For a fixed, there is one corresponding as the final solution. Because of a fixed, only the samples with separation smaller than the margin need to be considered. Assuming that there are a total of utterances satisfying this condition, we can minimize the following with the constraint This problem can be solved by the GPD algorithm by iteratively working on the training set 19

Margin-Based Methods Comparison ggg 20

Experimental Setup The proposed SME framework was evaluated on the TIDIGITS connected-digit task. The hidden Markov model toolkit (HTK) was first used to build the baseline MLE HMMs. SME models were initiated with the MLE models. This is in clear contrast with the LME models which are typically built upon the well-performed MCE models. 21

Experiments - Two different solutions of SME The column labeled SME presets the soft margin values. This demonstrates that the two proposed solutions are nearly equivalent because of the mapping relationship between and. 22

Experiments Although the initial string accuracies (got from MLE models) were very dierent, all SME models ended up with nearly the same accuracies of 99.99%. 23

The histogram of separation distance 24 ff 1-mixture MLE 16-mixture SME 1-mixture SME The larger the separation value, the better the models are.

Comparison of the generalization capability of SME with MLE and MCE mix model1-mix model MCE has a rightmost tail, which means that MCE has a better separation for samples in its right tail. But SME outperforms MCE in the 1- mixture case. Why? SME achieves a significantly better separation than both MLE and MCE in the testing set because of direct model separation maximization and better generalization. ( right-most curve )

Comparison of GPD optimization and Quickprop optimization for SME The best string accuracy (99.39%) is got by Quickprop optimized SME in the 16- mixture case. Usually, after around 200 iterations GPD-optimized SME converges while SME with Quickprop takes about 50 iterations 26

SME-conclusion A novel discriminative training method, called SME, to achieve both high accuracy and good model generalization. From the view of statistical learning theory, we show that SME can minimize the approximate risk bound on the test set. The choice of various loss functions and different kinds of separation measures Frame and utterance selections are integrated into a unified framework to select the training utterances and frames critical for discriminating competing models. 27