Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25.

Slides:



Advertisements
Similar presentations
Learning HMM parameters
Advertisements

Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 8.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
FSA and HMM LING 572 Fei Xia 1/5/06.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov Models David Meir Blei November 1, 1999.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
Isolated-Word Speech Recognition Using Hidden Markov Models
THE HIDDEN MARKOV MODEL (HMM)
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
Hidden Markov Models Part 2: Algorithms
CONTEXT DEPENDENT CLASSIFICATION
CPSC 503 Computational Linguistics
Presentation transcript:

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수

Natural Language Processing Lab., Korea Univ.2 Contents Markov models Hidden Markov models The three fundamental questions for HMMs Finding the probability of an observation Finding the best state sequence Parameter estimation HMMs: implementation, properties, and variants

Natural Language Processing Lab., Korea Univ.3 Markov Models Markov properties Limited horizon Time invariant (stationary) Stochastic transition matrix A Probabilities of different initial states  If X has the Markov property, X is said to be a Markov chain. : sequence of random variables taking values in some finite set, the state space.

Natural Language Processing Lab., Korea Univ.4 Markov Models (Cont.) Markov models can be used whenever one wants to model the probability of a linear sequence of events word n-gram models, modeling valid phone sequences in speech recognition, sequences of speech acts in dialog systems thought of a probabilistic finite-state automaton. Probability of a sequence of states m th order Markov model m: # of previous states that we are using to predict the next state n-gram model is equivalent to an (n-1) th order Markov model.

Natural Language Processing Lab., Korea Univ.5 Markov Models (Cont.) P.319 Figure 9.1

Natural Language Processing Lab., Korea Univ.6 Hidden Markov Models In an HMM, You don’t know the state sequence that the model passes through, but only some probabilistic function of it. Emission probability for the observations Example: the crazy soft drink machine Q: What is the probability of seeing the output sequence {lem, ice_t} if the machine always starts off in the cola preferring state? A: consider all paths that might be taken through the HMM, and then to sum over them.

Natural Language Processing Lab., Korea Univ.7 The Crazy Soft Drink Machine CPIP start colaiced tea (ice_t) lemonade (lem) CP IP Hidden Markov Models (Cont.)

Natural Language Processing Lab., Korea Univ.8 Why use HMMs? HMMs are useful when one can think of underlying events probabilistically generating surface events. POS tagging (Chap. 10) There exist efficient methods of training through use of the EM algorithm. Given plenty of data that we assume to be generated by some HMM, This algorithm allows us to automatically learn the model parameters that best account for the observed data. Linear interpolation of n-gram models We can build an HMM with hidden states that represent the choice of whether to use the unigram, bigram, or trigram probabilities. Hidden Markov Models (Cont.)

Natural Language Processing Lab., Korea Univ.9 Linear interpolation of n-gram models P.323 Figure 9.3 Hidden Markov Models (Cont.)

Natural Language Processing Lab., Korea Univ.10 General form of an HMM An HMM is specified by a five-tuple : set of states : output alphabet : initial state probabilities : state transition probabilities : symbol emission probabilities : state sequence : output sequence arc-emission HMM vs. state-emission HMM arc-emission HMM: the symbol emitted at time t depends on both the state at time t and at time t+1. state-emission HMM: the symbol emitted at time t depends just on the state at time t. Hidden Markov Models (Cont.)

Natural Language Processing Lab., Korea Univ.11 The Three Fundamental Questions for HMMs 1.Given a model, how do we efficiently compute how likely a certain observation is, that is ? used to decide between models which is best. 2.Given the observation sequence O and a model , how do we choose a state sequence that best explains the observations? guess what path was probably followed through the Markov chain; used for classification (e.g. POS tagging) 3.Given an observation sequence O, and a space of possible models found by varying the model parameters, how do we find the model that best explains the observed data? estimate model parameters from data

Natural Language Processing Lab., Korea Univ.12 Finding the probability of an observation Decoding requires multiplications The Three Fundamental Questions for HMMs (Cont.)

Natural Language Processing Lab., Korea Univ.13 Trellis algorithms The secret to avoiding this complexity is the general technique of dynamic programming. Remember partial results rather than recomputing them. Trellis algorithms Make a square array of states versus time Compute the probabilities of being at each state at each time in terms of the probabilities for being in each state at the preceding time instant. Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.14 Trellis algorithms (Cont.) P.328 Figure 9.5 Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.15 The forward procedure Forward variables is stored at in the trellis expresses the total probability of ending up in state at time t is calculated by summing probabilities for all incoming arcs at a trellis node Finding the probability of an observation (Cont.) 1.Initialization 2.Induction 3.Total Requires multiplications

Natural Language Processing Lab., Korea Univ.16 The forward procedure P.329 Figure 9.6 Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.17 The backward procedure Backward variables The total probability of seeing the rest of the observation sequence given that we were in state at time t Combination of forward and backward probabilities is vital for solving the third problem of parameter reestimation Finding the probability of an observation (Cont.) 1.Initialization 2.Induction 3.Total

Natural Language Processing Lab., Korea Univ.18 Variable calculations P.330 Table 9.2 Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.19 Combining them Finding the probability of an observation (Cont.)

Natural Language Processing Lab., Korea Univ.20 Finding the best state sequence Choosing the states individually For each t, we would find that maximizes The individually most likely state This quantity maximizes the expected number of states that will be guessed correctly. However, it may yield a quite unlikely state sequence. This is not the method that is normally used. The Three Fundamental Questions for HMMs (Cont.)

Natural Language Processing Lab., Korea Univ.21 Viterbi algorithm We want to find the most likely complete path This variable stores for each point in the trellis the probability of the most probable path that leads to that node. Records the node of the incoming arc that led to this most probable path. Finding the best state sequence (Cont.)

Natural Language Processing Lab., Korea Univ.22 Viterbi algorithm (Cont.) 1.Initialization 2.Induction Store backtrace 3.Termination and path readout (by backtracking) Finding the best state sequence (Cont.)

Natural Language Processing Lab., Korea Univ.23 The third problem: Parameter estimation There is no known analytic method to choose  We can locally maximize it by an iterative hill-climbing algorithm Baum-Welch or Forward-Backward algorithm Work out the probability of the observation sequence using some (perhaps randomly chosen) model. We can see which state transitions and symbol emissions were probably used the most. By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence. Training ! The Three Fundamental Questions for HMMs (Cont.)

Natural Language Processing Lab., Korea Univ.24 Baum-Welch algorithm Probability of traversing a certain arc at time t given observation sequence O = expected number of transitions from state i in O = expected number of transitions from state i to j in O The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.25 Baum-Welch algorithm P.334 Figure 9.7 The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.26 Baum-Welch algorithm (Cont.) Begin with some model  (perhaps preselected, perhaps just chosen randomly) Run O through the current model to estimate the expectations of each model parameter. Change the model to maximize the values of the paths that are used a lot. Repeat this process, hoping to converge on optimal values for the model parameter . The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.27 Baum-Welch algorithm (Cont.) Reestimation: from, derive Continues reestimating the parameters until results are no longer improving significantly Doest not guarantee that we will find the best model Local maximum, saddle point The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.28 Baum-Welch algorithm (Cont.) P.336 The third problem: Parameter estimation (Cont.)

Natural Language Processing Lab., Korea Univ.29 Implementation Floating point underflow The probabilities we are calculating consist of keeping multiplying together very small numbers. Work with logarithm It also speeds up the computation Employ auxiliary scaling coefficients Whose values grow with the time t so that the probabilities multiplied by the scaling coefficient remain within the floating point range of the computer. When the parameter values are reestimated, these scaling factors cancel out. HMMs: Implementation, Properties, and Variants (Cont.)

Natural Language Processing Lab., Korea Univ.30 Variants Epsilon or null transitions State-emission model Make the output distribution dependent just on a single state. Large number of parameters that need to be estimated Parameter tying Assumptions that probability distributions certain arcs or at certain states are the same as each other. Structural zero Decide that certain things are impossible (probability zero) HMMs: Implementation, Properties, and Variants (Cont.)

Natural Language Processing Lab., Korea Univ.31 Multiple input observations Ergodic model Every state is connected to every other state We simply concatenate all the observation sequences and train on them as one long input. We do not get sufficient data to be able to reestimate the initial probabilities successfully. Feed forward model Not fully connected. There is an ordered set of states. One can only proceed at each time instant to the same or a higher numbered state. We need to extend the reestimation formulae to work with a sequence of inputs. HMMs: Implementation, Properties, and Variants (Cont.)

Natural Language Processing Lab., Korea Univ.32 Initialization of parameter values If we would rather find the global maximum, Try to start the HMM in a region of the parameter space that is near the global maximum. Good initial estimates for the output parameters turn out to be particularly important, while random initial estimates for the parameters A and  are normally satisfactory. HMMs: Implementation, Properties, and Variants (Cont.)