Download presentation

Presentation is loading. Please wait.

1
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005

2
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Outline Parameter estimation Maximum Likelihood (ML) parameter estimation ML for Gaussian PDFs ML for HMMs – the Baum-Welch algorithm HMM adaptation: –MAP estimation –MLLR

3
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete variables Suppose that Y is a random variable which can take any value in a discrete set X={x 1,x 2,…,x M } Suppose that y 1,y 2,…,y N are samples of the random variable Y If c m is the number of times that the y n = x m then an estimate of the probability that y n takes the value x m is given by:

4
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete Probability Mass Function Symbol 1 2 3 4 5 6 7 8 9 Total Num.Occurrences 120 231 90 87 63 57 156 203 91 1098

5
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities… …but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods

6
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables An alternative is to use a parametric model In a parametric model, probabilities are defined by a small set of parameters Simplest example is a normal, or Gaussian model A Gaussian probability density function (PDF) is defined by two parameters –its mean , and –variance

7
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF ‘Standard’ 1-dimensional Guassian PDF: – mean =0 –variance =1

8
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF a b P(a x b)

9
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF For a 1-dimensional Gaussian PDF p with mean and variance : Constant to ensure area under curve is 1 Defines ‘bell’ shape

10
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction More examples =0.1 =1.0 =10.0 =5.0

11
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data Suppose y = y 1,…,y n,…,y T is a sequence of T data values Given a Gaussian PDF p with mean and variance , define: How do we choose and to maximise this probability?

12
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data Poor fitGood fit

13
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Maximum Likelihood Estimation Define the best fitting Gaussian to be the one such that p(y| , ) is maximised. Terminology: –p(y| , ) as a function of y is the probability (density) of y –p(y| , ) as a function of , is the likelihood of , Maximising p(y| , ) with respect to , is called Maximum Likelihood (ML) estimation of ,

14
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML estimation of , Intuitively: –The maximum likelihood estimate of should be the average value of y 1,…,y T, (the sample mean) –The maximum likelihood estimate of should be the variance of y 1,…,y T. (the sample variance) This turns out to be true: p(y| , ) is maximised by setting:

15
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Proof First note that maximising p(y) is the same as maximising log(p(y)) At a maximum: Also So,

16
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs Now consider –An N state HMM M, each of whose states is associated with a Gaussian PDF –A training sequence y 1,…,y T For simplicity assume that each y t is 1-dimensional

17
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs If we knew that: –y 1,…,y e(1) correspond to state 1 –y e(1)+1,…,y e(2) correspond to state 2 –: –y e(n-1)+1,…,y e(n) correspond to state n –: Then we could set the mean of state n to the average value of y e(n-1)+1,…,y e(n)

18
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML Training for HMMs y 1,…,y e(1), y e(1)+1,…,y e(2), y e(2)+1,…,y T Unfortunately we don’t know that y e(n-1)+1,…,y e(n) correspond to state n…

19
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution 1.Define an initial HMM – M 0 2.Use the Viterbi algorithm to compute the optimal state sequence between M 0 and y 1,…,y T y 1 y 2 y 3 … … y t … y T

20
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued) Use optimal state sequence to segment y y 1 y e(1) y e(1)+1 … … y e(2) … y T Reestimate parameters to get a new model M 1

21
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued) Now repeat whole process using M 1 instead of M 0, to get a new model M 2 Then repeat again using M 2 to get a new model M 3 …. p(y | M 0 ) p(y | M 1 ) p(y | M 2 ) …. p(y | M n ) ….

22
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Local optimization p(y|M) M M 0 M 1 …M n Local optimum Global optimum

23
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch optimization The algorithm just described is often called Viterbi training or Viterbi reestimation It is often used to train large sets of HMMs An alternative method is called Baum-Welch reestimation

24
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch Reestimation y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT P(s i | y t ) = t (i)

25
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT

26
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Backward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT

27
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward-Backward’ Algorithm yTyT y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt

28
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation A modern large-vocabulary continuous speech recognition system has many thousands of parameters Many hours of speech data used to train the system (e.g. 200+ hours!) Speech data comes from many speakers Hence recogniser is ‘speaker independent’ But performance for an individual would be better if the system were speaker dependent

29
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation For a single speaker, only a small amount of training data is available Viterbi reestimation or Baum-Welch reestimation will not work

30
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Parameters vs training data’ Performance Number of parameters Larger training set Smaller training set

31
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation Two common approaches to adaptation (with small amounts of training data) –Bayesian adaptation (also known as MAP adaptation (MAP = Maximum a Priori)) –Transform-based adaptation (also known as MLLR (MLLR = Maximum Likelihood Linear Regression))

32
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Bayesian adaptation Uses well-trained, ‘speaker-independent’ HMM as a prior for the estimate of the parameters of the speaker dependent HMM E.G: Speaker independent state PDF Sample mean MAP estimate of mean

33
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation Each acoustic vector is typically 40 dimensional So a linear transform of the acoustic data needs 40 x 40 = 1600 parameters This is much less than the 10s or 100s or thousands of parameters needed to train the whole system Estimate a linear transform to transform speaker- independent into speaker-dependent parameters

34
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation Speaker- independent parameters Speaker-dependent data points ‘best fit’ transform Adapted parameters

35
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Summary Maximum Likelihood (ML) estimation Viterbi HMM parameter estimation Baum-Welch HMM parameter estimation Forward and backward probabilities Adaptation –Bayesian adaptation –Transform-based adaptation

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google