Ch 5b: Discriminative Training (temporal model) 14.2.2002 Ilkka Aho.

Slides:



Advertisements
Similar presentations
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Advertisements

Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Distributed Microsystems Laboratory: Developing Microsystems that Make Sense Sensor Validation Techniques Sponsoring Agency: Center for Process Analytical.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Chapter 6: Multilayer Neural Networks
Optimal Adaptation for Statistical Classifiers Xiao Li.
Chapter 4 (part 2): Non-Parametric Classification
Dynamic Time Warping Applications and Derivation
Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.
Radial Basis Function (RBF) Networks
Radial-Basis Function Networks
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Isolated-Word Speech Recognition Using Hidden Markov Models
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.
7-Speech Recognition Speech Recognition Concepts
Classification / Regression Neural Networks 2
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Chapter 9 DTW and VQ Algorithm  9.1 Basic idea of DTW  9.2 DTW algorithm  9.3 Basic idea of VQ  9.4 LBG algorithm  9.5 Improvement of VQ.
Non-Bayes classifiers. Linear discriminants, neural networks.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.
Linear Models for Classification
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson
Supervised Sequence Labelling with Recurrent Neural Networks PRESENTED BY: KUNAL PARMAR UHID:
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Mete Ozay, Fatos T. Yarman Vural —Presented by Tianxiao Jiang
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
This whole paper is about...
A NONPARAMETRIC BAYESIAN APPROACH FOR
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Supervised Time Series Pattern Discovery through Local Importance
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
The connected word recognition problem Problem definition: Given a fluently spoken sequence of words, how can we determine the optimum match in terms.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Connected Word Recognition
Discriminative Training
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

Ch 5b: Discriminative Training (temporal model) Ilkka Aho

Abbreviations MCE = Minimum Classification Error MMI = Maximum Mutual Information STLVQ = Shift-Tolerant Learning Vector Quantization TDNN = Time-Delay Neural Network HMM = Hidden Markov Model DP = Dynamic Programming DTW = Dynamic Time Warping GPD = Generalized Probabilistic Descent PBMEC = Prototype-Based Minimum Error Classifier

Basics Prototype-based methods use class representatives (sample or an average of samples) to classify new patterns The MCE framework is used for discriminative training (also MMI is possible) A central concern is the design or learning of prototypes that will yield good classification performance

STLVQ for Speech Recognition LVQ algorithm in its basic form is a method for static pattern recognition STLVQ handles a stream of dynamically varying patterns (fig. 1.) STLVQ is much simpler than TDNN model, but yielded very good results on the same phoneme recognition tasks

Figure 1. STLVQ system architecture.

Limitations and Strengths of STLVQ STLVQ assumes only a single phoneme as an input token Training and testing datasets are obtained from manually labeled speech databases How to extend the phoneme recognition to word or sentence recognition? LVQ is applied locally

Expanding the Scope of LVQ for Speech Recognition Representation of longer speech sequences such as entire utterances Global optimization Application to continuos speech recognition A need for some kind of time warping or normalization How to merge the discriminative power of LVQ with the sequential modeling abilities of HMMs? Two methods: LVQ-HMM (fig. 2.) and HMM-LVQ (fig. 3.)

Figure 2. LVQ-HMM architecture.

Figure 3. HMM-LVQ architecture.

MCE Interpretation of LVQ A prototype-based implementation of the MCE framework The LVQ classification rule is based on the Euclidean distance between a pattern vector and each category's reference vectors The category of the nearest reference vector is given as the classification decision Figures 4, 5 and 6 demonstrate the smoothness of MCE loss

Figure 4. Average empirical loss measured over 10 samples from a one- dimensional, two class classification problem. The ideal zero-one loss is used in calculating the overall loss.

Figure 5. Now a sigmoidal MCE loss, a = 0.1, is used in calculating the overall loss.

Figure 6. The same situation as in the figure 5. except a = 1.0 now.

Prototype-based Methods Using DP DP is used to find the path through a grid of local matches between prototype and test sample frames that has the best overall score When calculating the reference distance between the input utterance and the reference utterance it is more practical to use the top path or the top few paths than every single DP path possible Nonlinear compressing and stretching prototypes DTW is a specific application of DP techniques to speech processing

MCE-Trained Prototypes and DTW The idea is to define the MCE loss in terms of a discriminant function that reflects the structure of a straightforward DTW-based recognizer The loss function have to be continous and differentiable that some gradient-based optimization technique (for example GPD) can be used to minimize the overall loss Also the loss function have to reflect classification performance Good results in the Bell Labs E-set task and in phoneme recognition tasks

PBMEC PBMEC models prototypes at a finer grain than MCE-trained DTW PBMEC prototypes are modeled within phonetic or subphonetic states Word models are formed by connecting different states together Multi-state PBMEC (fig. 7.) The discriminant function for a category is defined as the final accumulated score of the best DP path for that category (fig. 8.) MCE-GPD update rule for PBMEC pulls the nearest reference vectors for the correct category closer to the input and pushes the nearest reference vectors for the incorrect category away MCE-GPD in the context of speech recognition using phoneme models (fig. 9.)

Figure 7. Multi-state PBMEC architecture.

Figure 8. Final DP score.

Figure 9. DP segmentations for the words ”aida” and ”taira”.

HMM design based on MCE The prototype-like nature of HMMs The MCE framework can be applied to HMMs in a very same way that in the case of the PBMEC model HMM state likelihood and discriminant function MCE misclassification measure and loss Calculating of MCE Gradient for HMMs There are a very large number of applications of MCE-trained HMMs Some of the best context-independent results have been reported for the Texas Instruments-Massachusetts Institute of Technology database

Homework Question STLVQ Prototype-based DP (DTW technique) HMM design based on MCE Explain the main differencies between following methods in speech recognition: