Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Similar presentations


Presentation on theme: "Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields."— Presentation transcript:

1 Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields

2 Maximum Entropy Model l o g p ( y ; x ) = P ¸ f + c n s t p ( y ; x )
x – observations y – class identity fk – feature functions λk – trainable parameters p ( y ; x ) / e f P k g l o g p ( y ; x ) = P k f + c n s t = 2 6 4 1 . K 3 7 5 f ( x ; y ) = 2 6 4 1 . K 3 7 5 l o g p ( y ; x ) = T f + c n s t

3 Maximum Entropy Model P ( y j x ) = ^ ¸ = a r g m x Q P ( y j ) P ( y
; P ( y j x ) = e p f T ; g We train parameters λk to maximize conditional likelihood (minimize cross entropy; maximize MMI objective function) of training data ^ = a r g m x Q P ( y j )

4 Multiclass Logistic Regression
P ( y j x ) = e p f w T g f ( x ; y ) = 2 6 4 1 . N K 3 7 5 2 6 4 w 1 W . N K 3 7 5 =

5 MaxEnt example MaxEnt model can be initialized to simulate recognizer where classes are modeled by Gaussians Example for two classes and 1-dimensional data l o g p ( x ; y ) = P + j N 2 : 5 f ( x ; y ) = 2 6 4 1 3 7 5 2 6 4 l o g P ( 1 ) : 5 + = 3 7

6 Bayesian Networks Graph corresponds to a particular factorization of joint probability distribution over a set of random variables Nodes are random variables, but the graph does not say what are the distributions of the variables The graph represents a set of distributions that conform to the factorization.

7 Bayesian Networks for GMM
s is discrete latent random variable identifying Gaussian component generating observation x s x P ( x ; s ) = p j To compute likelihood of observed data, we need to marginalize over latent variable s: P ( x ) = s p j

8 Bayesian Networks for GMM
Multiple observations: s1 s2 sN-1 sN x1 x2 xN-1 xN P ( x 1 ; : N s ) = Q n p j P ( x 1 ; : N ) = X S Y n p j s

9 Bayesian Networks for HMM
si nodes are not “HMM states”, these are random variables (one for each frame) with values saying what state we are in for a particular frame i s1 s2 sN-1 sN x1 x2 xN-1 xN P ( x 1 ; : N s ) = p h Q n 2 j i To evaluate likelihood of data p(x1,… , xN), we marginalize over all state sequences (all possible values s1,… , sN ), e.g. using Dynamic Programming

10 Conditional independence
Bayesian Networks allows to see conditional independence properties. But the opposite is true for:

11 Markov Random Fields Undirected graphical model
Directly describe the conditional independence property On the example: P(x1, x4 | x2, x3) = P(x1 | x2, x3) P(x4 | x2, x3) x1 and x4 are independent given x2 and x3 as there is no path from x1 to x4 not leading through either x2 or x3. Subsets of nodes where all nodes are connected with each other are called cliques The outline in blue is Maximal clique. When factorizing distribution described by MRF, variables not connected by link must not appear in the same factor  lets make factor corresponding to (Maximal) cliques.

12 MRF - factorization Joint probability distribution over all random variables x can be expressed as normalized product of potential functions , which are positive valued functions of subsets of variables xC corresponding to maximal cliques C It is useful to express the potential functions in terms of energy functions E(xC)  sum of E(xC) instead of product

13 MRF - factorization P ( x 1 ; 2 3 4 ) = Z e p f E g

14 Checking conditional Independence
( x 1 ; 2 3 4 ) = Z Ã P ( x 2 ; 3 ) = 1 4 Z Ã P ( x 1 ; 4 j 2 3 ) = Ã

15 Markov Random Fields for HMM
sn-1 sn x1 x2 xn-1 xn P ( x 1 ; : N s ) = Z h Q n 2 ~ Ã i For Z = 1 and ~ Ã ( s 2 ; 1 ) = p j n x We obtain HMM model: P ( x 1 ; : N s ) = p h Q n 2 j i

16 Markov Random Fields for HMM
HMM is only one of possible distributions represented by the graph In the case of HMM, individual factors are already well normalized distributions  Z = 1 With general “unnormalized” potential functions, it would be difficult to compute Z as we would have to integrate over all real-valued variables xn. However, it is not difficult to evaluate the conditional probability: P ( S j X ) = p ; Normalization terms Z in numerator and denominator cancels Sum in the denominator is over all possible state sequences, but the terms in the sum are just products of factors like for HMM  we can use the same Dynamic Programming trick. We can also find the most likely sequence S using familiar Viterbi algorithm. To train such model (parameters of potential functions), we can directly maximize the conditional likelihood P(S|X)  discriminative training (like MMI or logistic regression)

17 Conditional Random Fields
Lets consider special form of potential functions: ~ Ã ( s n 1 ; ) = e x p f X k g l fn – are predefined feature functions λn – are trainable parameters We can rewrite: P ( S j X ) / ; h Q N n = 2 ~ Ã s 1 i x as P ( S j X ) = e x p k ~ N n 2 f s 1 ; + l

18 Conditional Random Fields
The model can be re-writen to a form that is very similar to Maximum entropy models (Logistic regression) However, S and X are sequences here (not only class identity and input vector) = 2 6 4 ~ 1 . K L 3 7 5 f ( S ; X ) = 2 6 4 P N n ~ 1 s . K x L 3 7 5 P ( S j X ) = e x p f T ; g

19 Hidden Conditional Random Fields
As with HMM, we can use CRFs to model state sequences, but state sequence is not really what we are interested in. We are interested in sequences of words. Maybe, we can live with decoding the most likely sequence of states (as we anyway do with HMMs), but for training, we usually only know the sequence of words (or phonemes). Not states. HCRF therefore marginalize over all state sequences corresponding to a sequence of words.

20 Hidden Conditional Random Fields
Still, we can initialize HCRF to simulate HMMs with states modeled by Gaussians or GMMs

21 Hidden Conditional Random Fields
Still we can use Dynamic Programming to efficiently evaluate the normalizing constant and to decode similarly to HMMs

22 Segmental CRF for LVCSR
Lets have some unit “detectors” phone bigram recognizer multi-phone unit recognizer If we knew the word boundary, we would not care about any sequences; we would just train Maximum Entropy model were feature functions are return quantities derived from units detected in the word span

23 SCRF for LVCSR featues Ngram Existence Features
Ngram Expectation Features

24 SCRF for LVCSR featues Levenshtein Features – compares units detected in the segment/word span with the desired pronunciation

25 Segmental CRF for LVCSR
State sequence = word sequence CRF observations are segments of frames All possible segmentations of frames into observations must be, however, taken into account

26 Segmental CRF for LVCSR
For convenience, we make observation dependent also on previous state/word  simplifies the equation below We marginalize over all possible segmentations

27 State transition features
LM features Baseline features

28 Results


Download ppt "Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields."

Similar presentations


Ads by Google