 Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25.

Presentation on theme: "Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25."— Presentation transcript:

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25.

Natural Language Processing Lab., Korea Univ.2 Contents Markov models Hidden Markov models The three fundamental questions for HMMs Finding the probability of an observation Finding the best state sequence Parameter estimation HMMs: implementation, properties, and variants

Natural Language Processing Lab., Korea Univ.3 Markov Models Markov properties Limited horizon Time invariant (stationary) Stochastic transition matrix A Probabilities of different initial states  If X has the Markov property, X is said to be a Markov chain. : sequence of random variables taking values in some finite set, the state space.

Natural Language Processing Lab., Korea Univ.4 Markov Models (Cont.) Markov models can be used whenever one wants to model the probability of a linear sequence of events word n-gram models, modeling valid phone sequences in speech recognition, sequences of speech acts in dialog systems thought of a probabilistic finite-state automaton. Probability of a sequence of states m th order Markov model m: # of previous states that we are using to predict the next state n-gram model is equivalent to an (n-1) th order Markov model.

Natural Language Processing Lab., Korea Univ.5 Markov Models (Cont.) P.319 Figure 9.1

Natural Language Processing Lab., Korea Univ.6 Hidden Markov Models In an HMM, You don’t know the state sequence that the model passes through, but only some probabilistic function of it. Emission probability for the observations Example: the crazy soft drink machine Q: What is the probability of seeing the output sequence {lem, ice_t} if the machine always starts off in the cola preferring state? A: consider all paths that might be taken through the HMM, and then to sum over them.

Natural Language Processing Lab., Korea Univ.7 The Crazy Soft Drink Machine CPIP 0.5 0.3 start 0.70.5 colaiced tea (ice_t) lemonade (lem) CP0.60.10.3 IP0.10.70.2 Hidden Markov Models (Cont.)

Natural Language Processing Lab., Korea Univ.8 Why use HMMs? HMMs are useful when one can think of underlying events probabilistically generating surface events. POS tagging (Chap. 10) There exist efficient methods of training through use of the EM algorithm. Given plenty of data that we assume to be generated by some HMM, This algorithm allows us to automatically learn the model parameters that best account for the observed data. Linear interpolation of n-gram models We can build an HMM with hidden states that represent the choice of whether to use the unigram, bigram, or trigram probabilities. Hidden Markov Models (Cont.)

Natural Language Processing Lab., Korea Univ.9 Linear interpolation of n-gram models P.323 Figure 9.3 Hidden Markov Models (Cont.)

Natural Language Processing Lab., Korea Univ.10 General form of an HMM An HMM is specified by a five-tuple : set of states : output alphabet : initial state probabilities : state transition probabilities : symbol emission probabilities : state sequence : output sequence arc-emission HMM vs. state-emission HMM arc-emission HMM: the symbol emitted at time t depends on both the state at time t and at time t+1. state-emission HMM: the symbol emitted at time t depends just on the state at time t. Hidden Markov Models (Cont.)

Natural Language Processing Lab., Korea Univ.11 The Three Fundamental Questions for HMMs 1.Given a model, how do we efficiently compute how likely a certain observation is, that is ? used to decide between models which is best. 2.Given the observation sequence O and a model , how do we choose a state sequence that best explains the observations? guess what path was probably followed through the Markov chain; used for classification (e.g. POS tagging) 3.Given an observation sequence O, and a space of possible models found by varying the model parameters, how do we find the model that best explains the observed data? estimate model parameters from data

Natural Language Processing Lab., Korea Univ.12 Finding the probability of an observation Decoding requires multiplications The Three Fundamental Questions for HMMs (Cont.)

Natural Language Processing Lab., Korea Univ.13 Trellis algorithms The secret to avoiding this complexity is the general technique of dynamic programming. Remember partial results rather than recomputing them. Trellis algorithms Make a square array of states versus time Compute the probabilities of being at each state at each time in terms of the probabilities for being in each state at the preceding time instant. Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.14 Trellis algorithms (Cont.) P.328 Figure 9.5 Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.15 The forward procedure Forward variables is stored at in the trellis expresses the total probability of ending up in state at time t is calculated by summing probabilities for all incoming arcs at a trellis node Finding the probability of an observation (Cont.) 1.Initialization 2.Induction 3.Total Requires multiplications

Natural Language Processing Lab., Korea Univ.16 The forward procedure P.329 Figure 9.6 Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.17 The backward procedure Backward variables The total probability of seeing the rest of the observation sequence given that we were in state at time t Combination of forward and backward probabilities is vital for solving the third problem of parameter reestimation Finding the probability of an observation (Cont.) 1.Initialization 2.Induction 3.Total

Natural Language Processing Lab., Korea Univ.18 Variable calculations P.330 Table 9.2 Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.19 Combining them Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.20 Finding the best state sequence Choosing the states individually For each t, we would find that maximizes The individually most likely state This quantity maximizes the expected number of states that will be guessed correctly. However, it may yield a quite unlikely state sequence. This is not the method that is normally used. The Three Fundamental Questions for HMMs (Cont.)

Natural Language Processing Lab., Korea Univ.21 Viterbi algorithm We want to find the most likely complete path This variable stores for each point in the trellis the probability of the most probable path that leads to that node. Records the node of the incoming arc that led to this most probable path. Finding the best state sequence (Cont.)

Natural Language Processing Lab., Korea Univ.22 Viterbi algorithm (Cont.) 1.Initialization 2.Induction Store backtrace 3.Termination and path readout (by backtracking) Finding the best state sequence (Cont.)

Natural Language Processing Lab., Korea Univ.23 The third problem: Parameter estimation There is no known analytic method to choose  We can locally maximize it by an iterative hill-climbing algorithm Baum-Welch or Forward-Backward algorithm Work out the probability of the observation sequence using some (perhaps randomly chosen) model. We can see which state transitions and symbol emissions were probably used the most. By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence. Training ! The Three Fundamental Questions for HMMs (Cont.)

Natural Language Processing Lab., Korea Univ.24 Baum-Welch algorithm Probability of traversing a certain arc at time t given observation sequence O = expected number of transitions from state i in O = expected number of transitions from state i to j in O The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.25 Baum-Welch algorithm P.334 Figure 9.7 The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.26 Baum-Welch algorithm (Cont.) Begin with some model  (perhaps preselected, perhaps just chosen randomly) Run O through the current model to estimate the expectations of each model parameter. Change the model to maximize the values of the paths that are used a lot. Repeat this process, hoping to converge on optimal values for the model parameter . The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.27 Baum-Welch algorithm (Cont.) Reestimation: from, derive Continues reestimating the parameters until results are no longer improving significantly Doest not guarantee that we will find the best model Local maximum, saddle point The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.28 Baum-Welch algorithm (Cont.) P.336 The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.29 Implementation Floating point underflow The probabilities we are calculating consist of keeping multiplying together very small numbers. Work with logarithm It also speeds up the computation Employ auxiliary scaling coefficients Whose values grow with the time t so that the probabilities multiplied by the scaling coefficient remain within the floating point range of the computer. When the parameter values are reestimated, these scaling factors cancel out. HMMs: Implementation, Properties, and Variants (Cont.)

Natural Language Processing Lab., Korea Univ.30 Variants Epsilon or null transitions State-emission model Make the output distribution dependent just on a single state. Large number of parameters that need to be estimated Parameter tying Assumptions that probability distributions certain arcs or at certain states are the same as each other. Structural zero Decide that certain things are impossible (probability zero) HMMs: Implementation, Properties, and Variants (Cont.)

Natural Language Processing Lab., Korea Univ.31 Multiple input observations Ergodic model Every state is connected to every other state We simply concatenate all the observation sequences and train on them as one long input. We do not get sufficient data to be able to reestimate the initial probabilities successfully. Feed forward model Not fully connected. There is an ordered set of states. One can only proceed at each time instant to the same or a higher numbered state. We need to extend the reestimation formulae to work with a sequence of inputs. HMMs: Implementation, Properties, and Variants (Cont.)

Natural Language Processing Lab., Korea Univ.32 Initialization of parameter values If we would rather find the global maximum, Try to start the HMM in a region of the parameter space that is near the global maximum. Good initial estimates for the output parameters turn out to be particularly important, while random initial estimates for the parameters A and  are normally satisfactory. HMMs: Implementation, Properties, and Variants (Cont.)

Download ppt "Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25."

Similar presentations