Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Principal Component Analysis
Today Wrap up of probability Vectors, Matrices. Calculus
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Discriminative Feature Optimization for Speech Recognition
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Isolated-Word Speech Recognition Using Hidden Markov Models
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Power Linear Discriminant Analysis (PLDA) M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Discriminant Analysis
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Machine Learning 5. Parametric Methods.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
2D-LDA: A statistical linear discriminant analysis for image matrix
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
LDA (Linear Discriminant Analysis) ShaLi. Limitation of PCA The direction of maximum variance is not always good for classification.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Chapter 4. The Normality Assumption: CLassical Normal Linear Regression Model (CNLRM)
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Pattern Classification, Chapter 3
Classification Discriminant Analysis
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Classification Discriminant Analysis
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
ECE539 final project Instructor: Yu Hen Hu Fall 2005
OVERVIEW OF LINEAR MODELS
Feature space tansformation methods
Generally Discriminant Analysis
The loss function, the normal equation,
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present by shih-hung Liu 2006/ 05/16

2 Outline Review PCA, LDA, HLDA Introduction MPE-HLDA Experimental results Conclusions

3 Review - PCA

4 Review - LDA

5 Review - HLDA

6

7 Introduction for MPE-HLDA In speech recognition systems, feature analysis is usually employed for better classification accuracy and complexity control. In recent years, extensions to the classical LDA have been widely adopted Among them, HDA seeks to remove the equal variance constraint by LDA ML-HLDA is taking the HMM structure (eg. diagonal covariance Guassian mixture state distribution) into consideration

8 Introduction for MPE-HLDA Despite the differences between the above techniques, they have some common limitation. First, none of them assumes any prior knowledge of confusable hypotheses, so their choices are determined to be suboptimal for recognition Second, their objective functions do not directly related to the WER For example, we found that HLDA could select totally non- discriminant features while improving its objective function by mapping all training samples to a single point in space along some dimensions

9 Introduction LDA and HLDA –Better classification accuracy –some common Limitations None of them assumes any prior knowledge of confusable hypotheses Their objective functions do not directly relate to the word error rate (WER) –As a result, it is often unknown whether selected features will do well in testing by just looking at the values of objective functions Minimum Phoneme Error –Minimize phoneme errors in lattice-based training frameworks –Since this criterion is closely related to WER, MPE-HLDA tends to be more robust than other projection methods, which makes it potentially better suited for a wider variety of features

10 MPE-HLDA MPE-HLDA model MPE-HLDA aims at minimizing expected number of phoneme errors introduced by the MPE- HLDA model in a given hypothesis lattice, or equivalently maximizing the function

11 MPE-HLDA

12 MPE-HLDA It can be shown that the derivative of (4) with respect to A is

13 MPE-HLDA

14 MPE-HLDA Therefore, Eq.(6) can be rewritten as 39*39 39*162

15 MPE-HLDA Implementation In theory, the derivative of the MPE-HLDA objective function can be computed based on Eq.(12), via s single forward-backward pass over the training lattices. In practice, however, it is not possible to fit all the full covariance matrices in memory. Two steps –First, run a forward-backward pass over the training lattices to accumulate –Second, uses these statistics together with the full covariance matrices to synthesize the derivative. The Paper used gradient descent in updating the projection matrix.

16 MPE-HLDA Overview

17 MPE-HLDA Overview

18 Implementation 1. Initialize feature projection matrix by LDA or HLDA, and MPE-HLDA model 2. Set 3. Compute covariance statistics in the original feature space –(a) Do maximum likelihood update of MPE-HLDA model in the feature space define by –(b) Do single pass retraining using to generate and in the original feature space 4. Optimize the feature projection matrix: –(a) Set –(b) Project and using to get model in reduced subspace –(c) Run F-B pass on lattices using to compute, and –(d) Use, and statistics form 4(c) to compute the MPE derivative –(e) Update to using gradient descent –(f) Set, go to 4(b) unless convergence 5. Optionally, set and go to 3

19 Experiments DARPA EARS research project CTS, 800/2300 hrs for ML training, 370 hrs of held-out data for MPE-HLDA training BN, 600 hrs from Hub4 and TDT for ML training, 330 hrs of held-out data for MPE-HLDA estimating PLP(15dim) and 1st 2nd 3th derivative coefficients (60dim) EARS 2003 Evaluation test set

20 Experiments

21 Experiments

22 Conclusions We have taken a first look at a new feature analysis method, MPE-HLDA. It shows that it is effective in reducing recognition error, and that it is more robust than other commonly used analysis methods like LDA and HLDA