Download presentation

Presentation is loading. Please wait.

Published byEric Conley Modified about 1 year ago

1
K.Marasek Multimedia Department Acoustic modeling

2
K.Marasek Multimedia Department Hidden Markov Models zAcoustic models are stochastic models used with language models and other models to make decisions based on incomplete or uncertain knowledge. Given a sequence of feature vectors X extracted from speech signal by a front end, the purpose of the AM is to compute the probability that a particular linguistic event (word, sentence, etc.) has generated the sequence zAM have to be flexible, accurate and efficient -> HMMs are! zeasy training zFirst used by Markov(1913) to analyse the letter sequence in a text zefficient method of training proposed by Baum et al. (1960) zapplication to ASR - Jellinek (1975) zapplication also in other areas: pattern recognition, linguistic analysis (stochastic parsing)

3
K.Marasek Multimedia Department HMM Theory zHMM can be defined as a pair of discrete time stochastic processes (I,X). The process I takes values from a finite set I, whose elements are called states of the model, while X takes values in a space X that can be either discrete or continuous, depending on nature of data sequences to be modeled and is called observation space zThe processes satisfy following relations, where right-hand probabilities are time t independent zhistory before time t has no influence on the future evolution of the process if the present state is specified zneither evolution of I nor past observations influence the present observation if the last two states are specified; output probabilities at time t are conditioned by states of I at time t-1 and t, i.e. by the transition at time t zrandom variables of process X represent the variability of the realization of the acoustic events, while process I models various possibilities in the temporal concatenation of these events First order Markov hypothesis Output independence hypothesis

4
K.Marasek Multimedia Department HMM theory 2 zProperties of HMMs: z0 0 zThe probability of every finite sequence X 1 T of observable random variables can be decomposed as: zFrom these follows that HMM can be defined by specifying parameter set V=( ,A,B), where =Pr(I 0 =I) is the initial state density ya ij =Pr(I t =j|I t-1 =I) is the transition probability matrix A yb ij =Pr(X t =x|I t-1 =I, I t =j) is the output densities matrix B zthe parameters satisfy the following relation: zthus model parameters are sufficient for computing the probability of a sequence of observations, (but usually faster formula used)

5
K.Marasek Multimedia Department HMMs as a probabilistic automata zNodes of graph correspond to states of Markov chain, while directed arcs correspond to allowed transtions a ij zA sequence of observation is regarded as an emission of the system which at each time instant makes a transition form one to another node randomly chosen according to a node-specific probability density and generates a random vector according to arc-specific probability density. A number of states and set of arcs is usually called model topology. zIn ASR it is common to have left-to-right topologies, in which a ij =0 for j*
*

6
K.Marasek Multimedia Department HMM is described by: zObservation sequence O=O 1...O T zStates sequence Q=q 1...q T zHMM states S={S 1...S N } zSymbols (emissions) V={v 1...v m } zTransition probabilities A zEmission probabilities B Initial state density Parameter set

7
K.Marasek Multimedia Department Example: wether zStates: S1 – rain, s2- clouds, s3- sunny zobserved only by the temperature T (density function different for diferrent states) zWhat is the probabilty of observed temperature sequence (20,22,18,15)? zWhich sequence of states (rain, clouds, sunny) is the most probable? O={O 1 =20 o, O 2 =22 o, O 3 =18 o, O 4 =15 o } and start is sunny Following state seqeunces are possible: Q 1 ={q 1 =S 3, q 2 =S 3,q 3 =S 3,q 4 =S 1 }, Q 2 ={q 1 =S 3, q 2 =S 3,q 3 =S 3,q 4 =S 3 }, Q 3 ={q 1 =S 3, q 2 =S 3,q 3 =S 1,q 4 =S 1 }, etc. s1 s2s3 s1 a 31 b 1 (T=10)=P(T =10| q 1 =S 1 ) b 2 (T=10)=P(T =10| q 2 =S 2 ) b 3 (T=10)=P(T =10| q 3 =S 3 ) b 1 (T=11)b 2 (T=11)b 3 (T=11) b 1 (T=40)b 2 (T=40)b 3 (T=40) Emission probabilities B a a Transition probabilities A

8
K.Marasek Multimedia Department Weather II zFor each state sequence the conditional probability can be found which depends on observation sequence: assuming O={O 1 =20 o, O 2 =22 o, O 3 =18 o, O 4 =15 o } and start is sunny Q 1 ={q 1 =S 3, q 2 =S 3,q 3 =S 3,q 4 =S 1 }, Q 2 ={q 1 =S 3, q 2 =S 3,q 3 =S 3,q 4 =S 3 }, Q 3 ={q 1 =S 3, q 2 =S 3,q 3 =S 1,q 4 =S 1 }, etc. Generally, the observed temperature sequence O can be generated by many state sequences which are not observable. The probability of temperature sequence given a model is

9
K.Marasek Multimedia Department Trellis for weather s1 s2s3 s1 a 31 + b 3 (O 1 ) b 2 (O 1 ) b 1 (O 1 ) O = O 1 O 2 O 3 O 4 O 5 a 12 a 22 a 32 *b 2 (O 2 ) 1.Init: 2.Iterations: 3.Final step:

10
K.Marasek Multimedia Department Left-right models: SR zFor speech recognition: zleft to right models zFind best path in trellis zInstead of summation take max –the path will give the best sequence of states – the most probable state sequence zAn additional pointers array is needed to store best pathes zBackpointers show the optimal path 1.Init: 2.Iterations: 3.Final:

11
K.Marasek Multimedia Department Training- estimation of model parameters zCount the frequency of occurences to estimate b j (k) zTransition probabilities: zAssumption: we can observe states of the HMM, what not always is possible: solution: forward-backward training

12
K.Marasek Multimedia Department Forward/Backward training zWe cannot observe state sequences, but we can compute expected values depending on model parameters, iterative estimation (new params with – above)

13
K.Marasek Multimedia Department Forward-Backward 2 s1 s2s3 s1 a 31 O O S j O OO S i O + b 3 (O 1 ) b 2 (O 1 ) b 1 (O 1 ) O = O 1 O t-1 O t O t+1 O t+1 a 12 a 21 a 31 *b2(O 2 ) a ij b j (O t+1 ) t+1 (j) t+1 (j) zForward probability: zBackward probability i.e. prob of O t+1..O T ending in state S i zIterative computation of backward probability:

14
K.Marasek Multimedia Department FB 3 zNow compute the probability that the model in tact t is in the state s i zThe formula gives the probability of being in the state i in the tact t, but we need additionally the expected value of tacts spend in the state i and expected number of transitions zFor ergodic processes (doesn’t depend on time), assuming sequence X=x 1,x 2,..x i,..,x T with only discrete values (e.g. {a,b,c})

15
K.Marasek Multimedia Department FB 4

16
K.Marasek Multimedia Department FB 5 The estimation procedure is done iteratively, so that P(O, P(O, zThe previous equations assume single observations, what for multiple? Let O={O (1), O (2),.. O (M) } be the training examples z{O (m) } are statistically independent, thus: zThis could be solved this way, that we introduce a fictive observation in which all observations are concatenated together zThan we have

17
K.Marasek Multimedia Department HMM: forward and backward coefficients zAdditional probabilities have been defined to save computational load ztwo effectiveness goals: model parameter estimation and decoding (search, recognition) zForward probability is the Pr that X emits the partial sequence x 1 t and process I in state i at time t: zcan be iteratively computed by: zbackward probability is the Pr that X emits the partial sequence x t+1 T and process I in state i at time t: zbest-path probability is the maximum joint probability between partial sequence x 1 t and state sequence ending at state I at time t

18
K.Marasek Multimedia Department Total probability: Trellis zTotal probability of an observation sequence can be computed as: zor using v which measures the probability along the path which gives the highest contribution to the summation: zthese algorithms have a time complexity o(MT), where M is the number of transition with non-zero probability (depends on the number of states in system N) and T is the length o input sequence zthe computation of these probabilities is performed in a data structure called trellis, which corresponds to unfolding of time axis of the graph structure Trellis Dashed arrows - paths which score is added to obtain a probability, dotted for b, v corresponds to the highest scoring path among dashed ones

19
K.Marasek Multimedia Department Trellis 2 zNodes in the trellis are pairs (t,i) t-time index, i - model state, arcs represent model transitions composing possible path in the model; for given observation x 1 T each arc (t-1,I) - > (t,j) carries a “weight” given by a ij b ij (x t ) zthen for each path a score corresponding to the products of the weights of the arcs traversed by the path can be assigned. This score is the probability of emission of the observed sequence along the path, given v, current set of model parameters zformulas left the recurrent computation of v corresponds to appropriate combination at each trellis node, the scores of paths ending or starting at that node zthe computation proceeds in a column-wise manner, synchronously with the apperance of observations. At every frame the scores of the nodes in a column are updated using recursion formula which unvolve the values of an adjacent column, transition probabilities of the models and the values of output denities for the current observation for and v computation starts from left column whose values are initialized by p and ends at outermost right column where the final value is computed. For computations go in opposite direction

20
K.Marasek Multimedia Department Output probabilities zIf the observation sequences are composed of symbols drawn from a finite alphabet of O symbols, then a density is a real valued vector [b(x)] x=1 O having a probability entry for possible symbol with the constraints: zobservation may be also composed of couple of symbols, usually considered to be mutually statistically independent. Then the output density can be represented by the product of Q independent densities. Such a models are called discrete HMMs zdiscrete HMMs are simpler - only array access to find b(x), but imprecise, thus in current implementations rather continuous densities are used zto reduce memory requirements parametric representations are used zmost popular choice: multivariate Gaussian density: where D is the dimension of vector space (length of a feaure vector). Parameters of Gaussian density are: mean vector (location parameter) and symmetric covariance matrix (spread of values around ) z widespread in statistics, parameters easy to estimate

21
K.Marasek Multimedia Department Forward algorithm To calculate the probability (likelihood) P(X| ) of the observation sequence X=(X 1,X 2,...X T ) given the HMM the most intuitive way is to sum up the probabilities of all state sequences: zIn other words, first we enumerate all possible state sequences S of length T, that generate observation X and sum all the probabilities. The probabiloity of each path S is the product of the state sequence probability and joijnt output probability along the path. Using output-independence assumption zSo finally we got: zFirst we enumerate all possible state sequences with length T+1. For any given state sequence we go through all transitions and states in a sequence until we reach the last transition – this require O(N T ) state sequences generation – exponential computational complexity

22
K.Marasek Multimedia Department Forward algorithm II - Trellis Based on the HMM assumption that P(s t |s t-1, P(X t |s t, involves only s t-1, s t P(X| can be computed recursively using so called forward probability t (i)=P(X 1 t,s t =i | ) denoting partial probability that HMM is in state i having generated partial observation X 1 t (i.e. X 1..X t ) This can be illustrated by trellis: arrow is the transition from state to state, number within circle denotes We start cells from t=0 with initial probabilities, other cells are computed time-synchroneous from left to right where each cell is completely computed before proceeding to time t+1, When the states in the last column have been computed, the sum of all probabilities in the final column is the probability of generating the observation sequence

23
K.Marasek Multimedia Department Gaussians zDisadvantage: Gaussian densities are unimodal: to overcome this Gaussian mixtures are used (weighted sums of Gaussians) zmixtures are capable to approximate other densities using appropriate number of components zD-dimensional Gaussian mixture with K components can be described using K[1+D+(D(D+1)/2)] real numbers (D=39, K=20 then real numbers) zfurther reduction: diagonal covariance matrix (components mutually independent). The joint density is then the product of one-dimensional Gaussian densities corresponding to the individual vector elements - 2D parameters zdiagonal-covariance Gaussians are widely use in ASR zto reduce number of Gaussians: distribution tying or sharing is often used: imposing that different transitions of different models share the same outout density. The tying scheme exploits a priori knowledge, e.g. sharing densities among allophones or sound classes (will be in details described further) zattempts to use other densities known from literature: Laplacians, lambda densities (for duration), but Gaussians dominate

24
K.Marasek Multimedia Department HMM composition zProbabilistic decoding: ASR with stochastic models choosing in the set of the possible linguistic events the one that corresponds to the observed data with the highest probability zin ASR an observation often does not correspond to the utterance of a single words, but of a sequence of words. If the language has a limited set of sentences, then it is possible to have models for each utterance, but what to do if the number is unlimited? To much models are also not easy to handle, how to train them, impossible to recognize items not observed in training material zsolution: concatenation of units from a list of manageable size, describe training Model linking

25
K.Marasek Multimedia Department DTW zWarp two speech templates x 1..x N and y 1..y M with minimal distortion: to find the optimal path between starting point (1,1) and end point (N,M) we need to compute the optimal accumulated distance D(N,M) based on distances d(i,,j). Since the same optimal path must be based on the previous step, the minimum distance must satisfy following equation:

26
K.Marasek Multimedia Department Dynamic programming algorithm zWe need only consider and keep only the best move for each pair although tehre are M possible moves, DTW can be computed recursively zWe can identify the optimal match y j with respect to x i and save the index in a back pointer table B(i,,j)

27
K.Marasek Multimedia Department Viterbi algorithm We are looking for the state sequence S=(s 1,s 2..s T ) that maximizes P(S,X| ) – very similiar to dynamic programming (for forward probabilities). Instead of summing up probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remember the best path. Best path probability is defined as: zV t (i) is the probability of the most likely state sequence at time t which has generated the observation X until time t and end in state i.

28
K.Marasek Multimedia Department Viterbi algorithm zAlgorithm for computing v probabilities- application of dynamic programming for finding a best scoring path in a directed graph with weighted arcs, I.e. in trellis zone of the most important algorithms in current computer science zuses recursive formula zwhen the whole observation sequence x 1 T has been processed, the score of the best path can be found computing: zidentity of states can be attained using backpointers f(x,I) - this allows to find optimal state sequence: this constitutes a time alignment of input speech frames, allowing to locate occurrences of SR units (phonemes) zconstruction of recognition model: yfirst the recognition language is represented as a network of words (‘finite-state automata’). The connection between words are empty transitions, but could have assigned probability (LM, n-gram) yeach word is replaced by a sequence (or network) of phonemes according to lexical rules. yPhonemes are replaced by instances of appropriate HMMs. Special labels are assigned to word-ending states (simplifies retrieving word sequence)

29
K.Marasek Multimedia Department Vitterbi movie zThe Viterbi algorithm is an efficient way to find the shortest route through a type of graph called a trellis. The algorithm uses a technique called 'forward dynamic programming' which relies on the property that the cost of a path can be expressed as a sum of incremental or transition costs between nodes adjacent in time in the trellis. The demo shows the evolution of the Viterbi algorithm over 6 time instants. At each time the shortest path to each node at the next time instant is determined. Paths that do not survive to the next time instant are deleted/ By time k+2, the shortest path (track) to time k has been determined unambiguously. This is called 'merging of paths'. states

30
K.Marasek Multimedia Department Compound model

31
K.Marasek Multimedia Department Viterbi pseudocode znote use of stack

32
K.Marasek Multimedia Department Model choice in ASR zIdentification of basic units is complicated due to various NL effects: reduction, other pronunciation depending on context, etc. Thus, sometime phoneme are not appropriate representation zbetter: use context-dependend units: allophones ztriphones: context made by previous and following phonemes (monophones) zphoneme models can have left-to-right topology with 3 groups of states: onset, body and coda znote the huge number of possible triphones: 40 3 =64000 models! Of course not all occurs due to phononatctic rules, but: How to train? How to manage? zOther attempts: half-syllables, diphones, microsegments, etc.but all methods of unit selection base on a priori phonetic knowledge ztotally different approach: automatic unsupervised clustering of frames. Corresponding centroids are taken as starting distributions for a set of basic simple units called fenones. Maximum likelihood decoding of utterances in terms of fenones is generated (dictionary) and fenones are then combined to built word models and the models are then trained Fenone The shared parameter (i.e., the output distribution) associated with a cluster of similar states is called a senone because of its state dependency. The phonetic models that share senones are shared-distribution models or SDM's

33
K.Marasek Multimedia Department Parameter tying zgood trade-off between resolution and precision of models, imposing an equivalence relation between different components of the parameter set of the models or components of different models zdefinition of tying relation: yinvolves decision about every parameter of the model set ya priori knowledge-based equivalence relations : xsemi-continuous HMMs: set of output density mixtures which shares the same set of basic Gaussian components (SCHMMs) - they differ only by weights xphonetically tied-mixtures: set of context-dependent HMMs in which the mixtures of all allophones of a phonem share a phoneme-dependent codebook ystate-tying: clustering of states based on similarity of Gaussians (Young & Woodland, 94) and retraining yphonetic decision trees (Bahr et al, 91): binary decision tree which has a question and a set of HMM densities attached to each node; questions generally reflect phonetic context, e.g. “is the left context a plosive?” ygenones: automatically determined SCHMMs : 1. Mixtures of allophones are clustered - mixtures with common components identified 2. Selecting of most likely elements of clusters: genones 3. Retraining of the system yesno Mostly used

34
K.Marasek Multimedia Department Implementation issues zOverflow and underflow during computations may occur - the probabilities are very small, especially for long sentences - to overcome this log are used

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google