Statistical Models for Automatic Speech Recognition Lukáš Burget.

Slides:



Advertisements
Similar presentations
Unsupervised Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Supervised Learning Recap
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models in NLP
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Gaussian Mixture Models and Expectation Maximization.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Isolated-Word Speech Recognition Using Hidden Markov Models
Gaussian Mixture Model and the EM algorithm in Speech Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
7-Speech Recognition Speech Recognition Concepts
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Luis Fernando D’Haro, Ondřej Glembek, Oldřich Plchot, Pavel Matejka, Mehdi Soufifar, Ricardo Cordoba, Jan Černocký.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Computational NeuroEngineering Lab
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CONTEXT DEPENDENT CLASSIFICATION
Speech Processing Speech Recognition
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 15: REESTIMATION, EM AND MIXTURES
EM Algorithm and its Applications
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Statistical Models for Automatic Speech Recognition Lukáš Burget

Feature extraction Preprocessing speech signal to satisfy needs of the following recognition process (dimensionality reduction, preserving only the “important” information, decorelation). Popular features are MFCC: modification based on psycho-acoustic findings applied to short-time spectra. For convenience, we will use one-dimensional features in most of our examples (e.g. short time energy).

Classifying single speech frame unvoicedvoiced

Classifying single speech frame unvoicedvoiced Mathematically, we ask the following question: But the value we read from probability distribution is p(x|class ). p ( x j vo i ce d ) P ( vo i ce d ) p ( x ) > p ( x j unvo i ce d ) P ( unvo i ce d ) p ( x ) P ( vo i ce d j x ) > P ( unvo i ce d j x ) According to Bayes Rule, the above can be revritten as:

Multi-class classification unvoicedvoicedsilence The class being correct with the highest probability is given by: argmax ! P ( ! j x ) = argmax ! p ( x j ! ) P ( ! ) But we do not know the true distribution, …

Estimation of parameters unvoicedvoicedsilence … we only see some training examples.

Estimation of parameters unvoicedvoicedsilence … we only see some training examples. Let’s decide for some parametric model (e.g. Gaussian distribution) and estimate its parameters from the data.

Maximum Likelihood Estimation In the next part, we will use ML estimation of model parameters: This allow as to individually estimate parameters, Θ, of each class given the data for that class. Therefore, for the convenience, we can omit the class identities in the following equations. The models we are going to examine are: –Single Gaussian –Gaussian Mixture Model (GMM) –Hidden Markov Model We want to solve three fundamental problems: –Evaluation of the model (computing likelihood of features given the model) –Training the model (finding ML estimates of parameters) –Finding most likely values of hidden parameters ^ £ c l ass ML = argmax £ Y 8 x i 2 c l ass p ( x i j £ )

Gaussian distribution (1 dimension) N ( x;¹ ; ¾ 2 ) = 1 ¾ p 2 ¼ e ¡ ( x ¡ ¹ ) 2 2 ¾ 2 ¹ = 1 T P T t = 1 x ( t ) ML estimates of parameters (Training): ¾ 2 = 1 T P T t = 1 ( x ( t ) ¡ ¹ ) 2 Evaluation: No hidden variables.

Gaussian distribution (2 dimensions) N ( x ; ¹ ; § ) = 1 p ( 2 ¼ ) P j § j e ¡ 1 2 ( x ¡ ¹ ) T § ¡ 1 ( x ¡ ¹ )

Gaussian Mixture Model p ( x j £ ) = P c P c N ( x;¹ c ; ¾ 2 c ) Evaluation: where P c P c = 1

Gaussian Mixture Model p ( x j £ ) = P c P c N ( x;¹ c ; ¾ 2 c ) Evaluation: We can see the sum above just as a function defining the shape of the probability density function, or we can see it as a more complicated generative probabilistic model, from which features are generated as follows: – One of Gaussian components is first randomly selected according prior probabilities Pc –Feature vector is generated form the selected Gaussian distribution For the evaluation, however, we do not know which component generated the input vector (Identity of the component is hidden variable). Therefore, we marginalize – sum over all the components respecting their prior probabilities. Why we want to complicate our lives with this concept: –It allows at to apply EM algorithm for GMM training –We will need this concept for HMMs

Training GMM –Viterbi training Intuitive and Approximate iterative algorithm for training GMM parameters. Using current model parameters, let Gaussians to classify data as the Gaussians were different classes (Even though the both data and all components corresponds to one class modeled by the GMM) Re-estimate parameters of Gaussian using the data associated with to them in the previous step. Repeat the previous two steps until the algorithm converge.

Training GMM – EM algorithm ^ ¹ ( new ) c = P T t = 1 ° c ( t ) x ( t ) P T t = 1 ° c ( t ) ^ ¾ 2 c ( new ) = P T t = 1 ° c ( t )( x ( t ) ¡ ^ ¹ ( new ) c ) 2 P T t = 1 ° c ( t ) ° c ( t ) = P c N ( x ( t ) ; ^ ¹ ( o ld ) c ; ^ ¾ 2 c ( o ld ) ) P c P c N ( x ( t ) ; ^ ¹ ( o ld ) c ; ^ ¾ 2 c ( o ld ) ) Expectation Maximization is very general tool applicable in many cases were we deal with unobserved (hidden) data. Here, we only see the result of its application to the problem of re- estimating parameters of GMM. It guarantees to increase likelihood of training data in every iteration, however it does not guarantees to find the global optimum. The algorithm is very similar to Viterbi training presented above. Only instead of hard decisions, it uses “soft” posterior probabilities of Gaussians (given the old model) as a weights and weight average is used to compute new mean and variance estimates.

unvoicedvoicedsilence Classifying stationary sequence P ( X j c l ass ) = Y 8 x i 2 c l ass p ( x i j c l ass ) Frame independency assumption

Modeling more general sequences: Hidden Markov Models a 11 a 22 a 33 a 12 a 34 a 23 b 1 (x)b 2 (x)b 3 (x) Generative model: For each frame, model moves from one state to another according to a transition probability a ij and generates feature vector from probability distribution b j (.) associated with the state that was entered. To evaluate such model, we do not see which path through the states was taken. Let’s start with evaluating HMM for a particular state sequence.

a 11 a 22 a 33 a 12 a 34 a 23 b 1 (x)b 2 (x)b 3 (x) a 11 a 12 a 23 a 33 P(X,S|Θ) = b 1 (x 1 ) b 1 (x 2 ) b 2 (x 3 ) b 3 (x 4 ) b 3 (x 5 )a 11 a 12 a 23 a 33 b 1 (x 1 ) b 1 (x 2 ) b 2 (x 3 ) b 3 (x 4 ) b 3 (x 5 )

a 11 a 12 a 23 a 33 P(X,S|Θ) = b 1 (x 1 ) b 1 (x 2 ) b 2 (x 3 ) b 3 (x 4 ) b 3 (x 5 )a 11 a 12 a 23 a 33 b 1 (x 1 ) b 1 (x 2 ) b 2 (x 3 ) b 3 (x 4 ) b 3 (x 5 ) Evaluating HMM for a particular state sequence

P ( X j S ; £ ) = b 1 ( x 1 ) b 1 ( x 2 ) b 2 ( x 3 ) b 3 ( x 4 ) b 3 ( x 5 ) P ( S j £ ) = a 11 a 12 a 23 a 33 a 34 P ( X ; S j £ ) = P ( X j S ; £ ) P ( S ; £ ) P ( X ; S j £ ) = b 1 ( x 1 ) a 11 b 1 ( x 2 ) a 12 b 2 ( x 3 ) a 23 b 3 ( x 4 ) a 33 b 3 ( x 5 ) a 34 where is prior probability of hidden variable – state sequence S. For GMM, the corresponding term was: P c is likelihood of observed sequence X, given the state sequence S. For GMM, the corresponding term was: N ( x ;¹ c ; ¾ 2 c ) The joint likelihood of observes sequence X and state sequence S can be decomposed as follows:

.

.

.

P ( X j £ ) = P S P ( X ; S j £ ) Evaluating HMM (for any state sequence) Since we do not know the underlying state sequence, we must marginalize – compute and sum likelihoods over all the possible paths

P ( X j £ ) = max S P ( X ; S j £ ) Finding the best (Viterbi) paths

Training HMMs – Viterbi training Similar to the approximate training we have already seen for GMMs 1.For each training utterance find Viterbi path through GMM, which associate feature frames with states. 2.Re-estimate state distribution using associated feature frames. 3.Repeat steps 1. and 2. until the algorithm converges.

Training HMMs using EM ^ ¹ ( new ) s = P T t = 1 ° s ( t ) x ( t ) P T t = 1 ° s ( t ) ^ ¾ 2 s ( new ) = P T t = 1 ° s ( t )( x ( t ) ¡ ^ ¹ ( new ) s ) 2 P T t = 1 ° s ( t ) ® s ( t ) ¯ s ( t ) t s ° s ( t ) = ® s ( t ) ¯ s ( t ) P ( X j £ ( o ld ) )

Isolated word recognition YES NO p ( X j YES ) P ( YES ) > p ( X j NO ) P ( NO )

YES NO sil Connected word recognition

Phoneme based models y eh s y s

w ah n t uw th r iy one two three sil P(one) P(three) P(two) Using Language model - unigram

ah n sil uw sil riy sil one two three w t th P(W 2 |W 1 ) one three two Using Language model - bigram

Other basic ASR topics not covered by this presentation Context dependent models Training phoneme based models Feature extraction –Delta parameters –De-correlation of features Full-covariance vs. diagonal cov. modeling Adaptation to speaker or acoustic condition Language Modeling –LM smoothing (back-off) Discriminative training (MMI or MPE) and so on